FileSense Documentation

FileSense Documentation

Semantic document classifier that understands meaning, not just filenames.

FileSense uses SentenceTransformers (SBERT) and FAISS vector search to organize files by their actual content. When it encounters unknown file types, it leverages Google Gemini (or Local LLMs) to generate new categories automatically.


UPDATE: Infrastructure Shift to SFT

Integrated a Reinforcement Learning architecture. However, due to Google Gemini’s Free Tier Rate Limits, I am temporarily pivoting the generation backend to Local SFT.

The Challenge

The RL Agent works as intended, optimizing for speed. However, effectively “training” the agent requires frequent API calls, which triggers Google’s 429 Rate Limit, forcing 60-second delays.

The Solution

Not removing RL, just removing the latency. By switching to Supervised Fine-Tuning (SFT) of a local model, I eliminate the API bottleneck. This will allow the RL Agent to function at full speed without external throttling.

See the full analysis: Reinforcement Learning Architecture



Core Features

Feature Description
Semantic Sorting Classifies by meaning (e.g., “Newton’s Laws” → Physics)
Reinforcement Learning Adaptive agent that optimizes sorting policies over time
AI-Powered Labeling Uses GenAI to create new categories automatically
FAISS Vector Search Lightning-fast similarity matching with embeddings
Self-Updating Automatically rebuilds index when new labels are created
OCR Support Extracts text from scanned PDFs and images
Keyword Boosting Hybrid approach: Vector similarity + keyword matching
GUI & CLI Desktop app with system tray + command-line interface

How It Works

flowchart TD
    A[Input File] --> B[Extract Text]
    B --> C[Generate Embedding<br/>SBERT BAAI/bge-base-en-v1.5]
    C --> D{Similarity >= 0.40?}
    D -->|Yes| E[Classify to Existing Folder]
    D -->|No| F[Ask Agent (RL)]
    F --> G{Policy A/B/C?}
    G --> H[Update folder_labels.json]
    H --> I[Rebuild FAISS Index]
    I --> J[Move to Sorted Folder]

Quick Start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Set up API key
echo "API_KEY=your_gemini_key" > .env

# 3. Create initial index
python scripts/create_index.py

# 4. Run FileSense
python scripts/script.py --dir ./files --threads 6

Documentation Sections

For Users

For Developers

Research & Analysis


Key Insights

Important discoveries from testing:

  1. Keyword-based descriptions outperform natural language for SBERT embeddings (+32% accuracy)
  2. Semantic descriptions performed worse than expected (24% vs 56% accuracy)
  3. Lighter models significantly reduced performance - I recommend sticking with BAAI/bge-base-en-v1.5
  4. AG News dataset showed poor results - academic documents work best

See the NL vs Keywords Study for detailed analysis.


Performance Highlights

Metric Value
Accuracy (NCERT Test) 56% with keywords
Avg Similarity Score 0.355
Categorization Rate 89% (11% uncategorized)
Processing Speed ~0.27s per file
Embedding Model BAAI/bge-base-en-v1.5 (768 dims)

Contributing

FileSense is an open-source project. Contributions are welcome!


License

MIT License © 2025 Ayush Bhalerao


Ready to get started?Installation Guide