FileSense Documentation

Frequently Asked Questions

Frequently Asked Questions

No one asked đŸ„€.


General Questions

What is FileSense?

FileSense is an intelligent file organizer that classifies documents by their semantic meaning rather than just filenames or extensions. It uses:

How is this different from traditional file organizers?

Traditional Organizers FileSense
Rules-based (file extension, name patterns) Semantic understanding (actual content)
Manual category creation AI-powered auto-generation
Keyword matching only Vector embeddings + keywords
Static classification Self-learning system

Is my data sent to the cloud?

Partially. Here’s what happens:


🔧 Technical Questions

What file types are supported?

Format Support Notes
PDF Full Text-based and scanned (OCR)
DOCX Full Microsoft Word documents
TXT Full Plain text files
MD Full Markdown files
Images Partial OCR only (requires Tesseract)
Others No Filename-based classification only

How accurate is the classification?

Current performance (NCERT test dataset):

Factors affecting accuracy:

What embedding model is used?

Model: BAAI/bge-base-en-v1.5 (SentenceTransformers)

Warning:

Testing showed that lighter models performed significantly worse. The BAAI/bge-base-en-v1.5 model is optimal for this use case.

Can I use a different embedding model?

Yes! Edit scripts/create_index.py:

MODEL_NAME = "BAAI/bge-base-en-v1.5"  # Change this

Recommended alternatives:


How do I organize my Downloads folder?

Option 1: One-time bulk sort

python scripts/script.py --dir ~/Downloads --threads 8

Option 2: Real-time monitoring

python scripts/watcher_script.py --dir ~/Downloads

The watcher will automatically sort new files as they appear!

Can I customize the categories?

Yes! Edit folder_labels.json:

{
  "Custom Category": "keyword1, keyword2, keyword3, related terms, synonyms Keywords: key1, key2, key3"
}

Then rebuild the index:

python scripts/create_index.py

How do I adjust classification thresholds?

Edit scripts/classify_process_file.py:

THRESHOLD = 0.4              # Main threshold (increase for stricter)
low_confidence_threshold = 0.35  # Fallback threshold

Guidelines:

What happens to uncategorized files?

Files that don’t match any category (similarity < threshold) are moved to sorted/Uncategorized/.

To reduce uncategorized files:

  1. Lower the threshold (e.g., 0.35)
  2. Enable generation: Remove --no-generation flag
  3. Add more diverse labels to folder_labels.json

Troubleshooting

Files are not being classified correctly

Possible causes:

  1. Poor text extraction
    • Check if PDF is scanned (needs OCR)
    • Verify Tesseract is installed for images
    • Try fallback extraction: extract_text(file, fallback=True)
  2. No matching labels
    • Add relevant categories to folder_labels.json
    • Enable generation to create new labels
    • Check similarity scores in logs
  3. Threshold too high
    • Lower THRESHOLD in classify_process_file.py
    • Check average similarities in your dataset

Gemini API errors

“API key not valid”

“Rate limit exceeded”

“Model not found”

FAISS index errors

“Index file not found”

python scripts/create_index.py

“Dimension mismatch”

“Empty index”

Performance issues

Slow processing

High memory usage

Embedding model download stuck


📊 Performance Questions

Why do keyword descriptions work better than natural language?

Comprehensive testing showed:

Reasons:

  1. SBERT embeddings cluster keyword lists more effectively
  2. Natural language adds grammatical noise
  3. Keywords provide broader semantic coverage
  4. Academic documents align well with keyword matching

See the NL vs Keywords Study for detailed analysis.

What’s the processing speed?

Benchmarks (NCERT dataset, 75 files):

Optimization tips:

How much disk space does it need?

Minimal:


Privacy & Security

Is my data safe?

Local processing:

Cloud processing:

Recommendations:

Can I use FileSense offline?

Partially:

Offline workflow:

  1. Pre-generate all labels online
  2. Use --no-generation flag
  3. All classification happens locally

Known Limitations

Current Limitations

  1. Natural language descriptions perform worse than keywords
  2. Lighter models reduce accuracy significantly
    • Now using BAAI/bge-base-en-v1.5 as optimal
    • Don’t use smaller models for production
  3. AG News dataset showed poor results
    • FileSense works best with academic/professional documents
    • News articles may need different approach
  4. Text classification is inherently challenging
    • This might be an inefficient approach for some use cases
    • Consider as a learning experience

Warning: These insights are from real testing and development.


Contributing

How can I contribute?

  1. Report bugs: GitHub Issues
  2. Suggest features: Open a discussion
  3. Submit PRs: Fork, improve, and submit
  4. Improve docs: Help expand this wiki

Where can I get help?


Additional Resources


Still have questions? Open an issue on GitHub!


← Back to Home