Frequently Asked Questions
No one asked đ„.
General Questions
What is FileSense?
FileSense is an intelligent file organizer that classifies documents by their semantic meaning rather than just filenames or extensions. It uses:
- SBERT (Sentence-BERT) for text embeddings
- FAISS for fast vector similarity search
- Google Gemini for generating new categories
- RL agent for managing lesser api calls
How is this different from traditional file organizers?
| Traditional Organizers |
FileSense |
| Rules-based (file extension, name patterns) |
Semantic understanding (actual content) |
| Manual category creation |
AI-powered auto-generation |
| Keyword matching only |
Vector embeddings + keywords |
| Static classification |
Self-learning system |
Is my data sent to the cloud?
Partially. Hereâs what happens:
- Local: Text extraction, embedding generation, vector search
- Cloud: Only when generating new labels (Gemini API)
- Privacy: File content is sent to Gemini only for classification, not stored
- Logs: all rl logs are sent to supabase storage
Info: You can disable cloud features with --no-generation flag.
đ§ Technical Questions
What file types are supported?
| Format |
Support |
Notes |
| PDF |
Full |
Text-based and scanned (OCR) |
| DOCX |
Full |
Microsoft Word documents |
| TXT |
Full |
Plain text files |
| MD |
Full |
Markdown files |
| Images |
Partial |
OCR only (requires Tesseract) |
| Others |
No |
Filename-based classification only |
How accurate is the classification?
Current performance (NCERT test dataset):
- Accuracy: 56% with keyword-based descriptions
- Categorization rate: 89% (only 11% uncategorized)
- Average similarity: 0.355
Factors affecting accuracy:
- Quality of text extraction
- Similarity to existing labels
- Document length and clarity
- Threshold settings (default: 0.40)
What embedding model is used?
Model: BAAI/bge-base-en-v1.5 (SentenceTransformers)
- Dimensions: 768
- Performance: Best balance of speed and accuracy
- Size: ~420MB download on first run
Warning:
Testing showed that lighter models performed significantly worse. The BAAI/bge-base-en-v1.5 model is optimal for this use case.
Can I use a different embedding model?
Yes! Edit scripts/create_index.py:
MODEL_NAME = "BAAI/bge-base-en-v1.5" # Change this
Recommended alternatives:
all-MiniLM-L6-v2 - Faster, less accurate
all-mpnet-base-v2 - Competitive with current model
multi-qa-mpnet-base-dot-v1 - Better for Q&A documents
How do I organize my Downloads folder?
Option 1: One-time bulk sort
python scripts/script.py --dir ~/Downloads --threads 8
Option 2: Real-time monitoring
python scripts/watcher_script.py --dir ~/Downloads
The watcher will automatically sort new files as they appear!
Can I customize the categories?
Yes! Edit folder_labels.json:
{
"Custom Category": "keyword1, keyword2, keyword3, related terms, synonyms Keywords: key1, key2, key3"
}
Then rebuild the index:
python scripts/create_index.py
How do I adjust classification thresholds?
Edit scripts/classify_process_file.py:
THRESHOLD = 0.4 # Main threshold (increase for stricter)
low_confidence_threshold = 0.35 # Fallback threshold
Guidelines:
- 0.30-0.35: More files categorized, less accurate
- 0.40-0.45: Balanced (recommended)
- 0.50+: Very strict, more uncategorized files
What happens to uncategorized files?
Files that donât match any category (similarity < threshold) are moved to sorted/Uncategorized/.
To reduce uncategorized files:
- Lower the threshold (e.g., 0.35)
- Enable generation: Remove
--no-generation flag
- Add more diverse labels to
folder_labels.json
Troubleshooting
Files are not being classified correctly
Possible causes:
- Poor text extraction
- Check if PDF is scanned (needs OCR)
- Verify Tesseract is installed for images
- Try fallback extraction:
extract_text(file, fallback=True)
- No matching labels
- Add relevant categories to
folder_labels.json
- Enable generation to create new labels
- Check similarity scores in logs
- Threshold too high
- Lower
THRESHOLD in classify_process_file.py
- Check average similarities in your dataset
Gemini API errors
âAPI key not validâ
- Verify
.env file exists and contains valid key
- Check for extra spaces or quotes in
.env
- Regenerate key at Google AI Studio
âRate limit exceededâ
- Gemini has usage limits on free tier
- Add delays between requests
- Consider upgrading to paid tier
âModel not foundâ
- Ensure youâre using
gemini-2.5-flash (current model)
- Check Google AI Studio for model availability
FAISS index errors
âIndex file not foundâ
python scripts/create_index.py
âDimension mismatchâ
- Delete
folder_embeddings.faiss
- Rebuild index with correct model
âEmpty indexâ
- Ensure
folder_labels.json has at least one label
- Check JSON file is valid (no syntax errors)
Slow processing
- Increase threads:
--threads 12
- Use SSD for faster file I/O
- Disable logging:
--no-logs
High memory usage
- Reduce concurrent threads
- Process files in smaller batches
- Use
--single-thread for large files
Embedding model download stuck
- Check internet connection
- Manually download from HuggingFace
- Place in
~/.cache/torch/sentence_transformers/
Why do keyword descriptions work better than natural language?
Comprehensive testing showed:
- Keywords: 56% accuracy
- Natural Language: 24% accuracy
Reasons:
- SBERT embeddings cluster keyword lists more effectively
- Natural language adds grammatical noise
- Keywords provide broader semantic coverage
- Academic documents align well with keyword matching
See the NL vs Keywords Study for detailed analysis.
Whatâs the processing speed?
Benchmarks (NCERT dataset, 75 files):
- Average: 0.27s per file
- Total: ~20s for 75 files (6 threads)
- Bottleneck: Text extraction (especially OCR)
Optimization tips:
- Use more threads for I/O-bound tasks
- Pre-extract text for repeated processing
- Use SSD storage
- Disable fallback extraction if not needed
How much disk space does it need?
Minimal:
- Embedding model: ~440MB (one-time download)
- FAISS index: <1MB per 100 labels
- Dependencies: ~500MB total
- Logs: Varies (can be disabled)
Privacy & Security
Is my data safe?
Local processing:
- Text extraction happens locally
- Embeddings generated locally
- Vector search is local
Cloud processing:
- Only when generating new labels
- File content sent to Gemini API
- Not stored by Google (per their policy)
Recommendations:
- Use
--no-generation for sensitive files
- Review
folder_labels.json before sharing
- Keep
.env file private
Can I use FileSense offline?
Partially:
- Classification works offline (after initial setup)
- New label generation requires internet (Gemini API)
Offline workflow:
- Pre-generate all labels online
- Use
--no-generation flag
- All classification happens locally
Known Limitations
Current Limitations
- Natural language descriptions perform worse than keywords
- Lighter models reduce accuracy significantly
- Now using BAAI/bge-base-en-v1.5 as optimal
- Donât use smaller models for production
- AG News dataset showed poor results
- FileSense works best with academic/professional documents
- News articles may need different approach
- Text classification is inherently challenging
- This might be an inefficient approach for some use cases
- Consider as a learning experience
Warning: These insights are from real testing and development.
Contributing
How can I contribute?
- Report bugs: GitHub Issues
- Suggest features: Open a discussion
- Submit PRs: Fork, improve, and submit
- Improve docs: Help expand this wiki
Where can I get help?
- GitHub Issues: Technical problems
- Discussions: General questions
- Email: Contact the maintainer
Additional Resources
Still have questions? Open an issue on GitHub!
â Back to Home