FileSense Documentation

Semantic document classifier that understands meaning, not just filenames.

FileSense uses SentenceTransformers (SBERT) and FAISS vector search to organize files by their actual content. When it encounters unknown file types, it leverages Google Gemini (or Local LLMs) to generate new categories automatically.

UPDATE: Infrastructure Shift to SFT

Integrated a Reinforcement Learning architecture. However, due to Google Gemini’s Free Tier Rate Limits, I am temporarily pivoting the generation backend to Local SFT.

The Challenge

The RL Agent works as intended, optimizing for speed. However, effectively “training” the agent requires frequent API calls, which triggers Google’s 429 Rate Limit, forcing 60-second delays.

The Solution

Not removing RL, just removing the latency. By switching to Supervised Fine-Tuning (SFT) of a local model, I eliminate the API bottleneck. This will allow the RL Agent to function at full speed without external throttling.

See the full analysis: Reinforcement Learning Architecture

Quick Links

Getting Started: Install and run FileSense in 5 minutes
Performance Metrics: See benchmarks and optimization studies
RL Architecture: Deep dive into the Adaptive Agent

Core Features

Feature	Description
Semantic Sorting	Classifies by meaning (e.g., “Newton’s Laws” → Physics)
Reinforcement Learning	Adaptive agent that optimizes sorting policies over time
AI-Powered Labeling	Uses GenAI to create new categories automatically
FAISS Vector Search	Lightning-fast similarity matching with embeddings
Self-Updating	Automatically rebuilds index when new labels are created
OCR Support	Extracts text from scanned PDFs and images
Keyword Boosting	Hybrid approach: Vector similarity + keyword matching
GUI & CLI	Desktop app with system tray + command-line interface

How It Works

flowchart TD
    A[Input File] --> B[Extract Text]
    B --> C[Generate Embedding<br/>SBERT BAAI/bge-base-en-v1.5]
    C --> D{Similarity >= 0.40?}
    D -->|Yes| E[Classify to Existing Folder]
    D -->|No| F[Ask Agent (RL)]
    F --> G{Policy A/B/C?}
    G --> H[Update folder_labels.json]
    H --> I[Rebuild FAISS Index]
    I --> J[Move to Sorted Folder]

Quick Start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Set up API key
echo "API_KEY=your_gemini_key" > .env

# 3. Create initial index
python scripts/create_index.py

# 4. Run FileSense
python scripts/script.py --dir ./files --threads 6

Documentation Sections

For Users

Getting Started - Installation and setup
FAQ - Common questions and troubleshooting

For Developers

Architecture - System design and data flow

Research & Analysis

Performance Metrics - Benchmarks and accuracy
Reinforcement Learning - Architecture & SFT Pivot
NL vs Keywords Study - Comprehensive comparison
Lessons Learned - Key insights from development

Key Insights

Important discoveries from testing:

Keyword-based descriptions outperform natural language for SBERT embeddings (+32% accuracy)

Semantic descriptions performed worse than expected (24% vs 56% accuracy)

Lighter models significantly reduced performance - I recommend sticking with BAAI/bge-base-en-v1.5

AG News dataset showed poor results - academic documents work best

See the NL vs Keywords Study for detailed analysis.

Performance Highlights

Metric	Value
Accuracy (NCERT Test)	56% with keywords
Avg Similarity Score	0.355
Categorization Rate	89% (11% uncategorized)
Processing Speed	~0.27s per file
Embedding Model	BAAI/bge-base-en-v1.5 (768 dims)

Contributing

FileSense is an open-source project. Contributions are welcome!

GitHub: ahhyoushh/FileSense
Issues: Report bugs or request features
Pull Requests: Submit improvements

License

Ready to get started? → Installation Guide