System Architecture
Understanding how FileSense processes and classifies files.
High-Level Overview
flowchart TB
subgraph Input
A[Input Files<br/>PDF, DOCX, TXT, MD]
end
subgraph Extraction
B[Text Extraction<br/>pdfplumber, python-docx]
C[Fallback Extraction<br/>Middle pages, OCR]
end
subgraph RLAgent [RL Agent]
D[Epsilon-Greedy Policy<br/>Exploit vs Explore]
end
subgraph Embedding
E[SBERT Encoding<br/>BGE-Base v1.5<br/>768 dimensions]
end
subgraph Search
F[FAISS Vector Search<br/>IndexFlatIP<br/>Cosine Similarity]
G{Similarity >= 0.40?}
end
subgraph Classification
H[High Confidence<br/>Assign to Folder]
I[Generative Fallback<br/>Ask Gemini]
end
subgraph Update
J[Update Labels<br/>folder_labels.json]
K[Rebuild Index<br/>FAISS re-indexing]
L[Re-classify]
end
subgraph Output
M[Move to Sorted Folder]
end
A --> B
B --> C
C --> D
D -->|Get State| E
E --> F
F --> G
G -->|Yes| H
G -->|No| I
I --> J
J --> K
K --> L
L --> F
H --> M
Processing Pipeline
File: scripts/extract_text.py
def extract_text(file_path, fallback=False):
"""
Extract text from PDF, DOCX, or TXT files.
Args:
file_path: Path to the file
fallback: If True, extract from middle pages (avoid TOC)
Returns:
Extracted text (max 2000 chars)
"""
Features:
- PDF: Uses
pdfplumber with smart page selection
- DOCX: Extracts paragraphs with
python-docx
- TXT: Direct file reading with encoding detection
- Fallback: Starts from middle pages to avoid table of contents
- Quality scoring: Filters low-quality pages
- Header/footer removal: Crops margins to get core content
Configuration:
PDF_CONFIG = {
'MAX_INPUT_CHARS': 2000, # Max text length
'QUALITY_THRESHOLD': 0.4, # Min page quality
'START_PAGE_ASSUMPTION': 3, # Skip first N pages
'HEADER_FOOTER_MARGIN': 70, # Crop margins (pixels)
'MIN_LINE_LENGTH': 25, # Min line length
}
Step 2: The RL Agent (Decision Maker)
File: scripts/RL/rl_policy.py
Before any expensive operations, the Reinforcement Learning Agent decides the optimal strategy.
class EpsilonGreedyBandit:
def select_action(self, state):
"""
Decide whether to EXPLOIT (use existing knowledge) or EXPLORE (try new generation).
Policy A: High Accuracy (Liberal use of API)
Policy B: Balanced
Policy C: Efficient (Vector Search only)
"""
Why RL?
- Cost Optimization: Prevents unnecessary API calls for simple files.
- Latency Reduction: Skips generation if vector confidence is high.
- Adaptability: Learns from user feedback (folder moves/renames).
Step 3: Embedding Generation
File: scripts/classify_process_file.py
MODEL_NAME = "BAAI/bge-base-en-v1.5"
model = SentenceTransformer(MODEL_NAME, device="cpu")
text_emb = model.encode([text], normalize_embeddings=True)
Model Details:
- Name: BAAI/bge-base-en-v1.5
- Dimensions: 768
- Normalization: L2 normalized for cosine similarity
- Performance: ~0.1s per encoding (Batch optimized)
- Size: 440MB (cached after first download)
Why this model?
Benchmark testing confirmed bge-base-en-v1.5 is 2x faster than MPNet and significantly more accurate than MiniLM. See Model Comparison.
Step 4: Vector Similarity Search
File: scripts/classify_process_file.py
# Search FAISS index
D, I = index.search(text_emb, 10) # Top 10 matches
# Calculate similarities
for idx, sim in zip(I[0], D[0]):
if idx != -1:
all_sims[idx] = sim
FAISS Configuration:
- Index Type: IndexFlatIP (Inner Product)
- Metric: Cosine similarity (via normalized embeddings)
- Top-K: Retrieves top 10 matches
- Boosting: Filename and keyword matching add bonus scores
Why Cosine Similarity?
I use cosine similarity because it compares semantic direction instead of vector magnitude. This avoids bias from unequal text lengths and keyword-heavy folder descriptions. It aligns better with how sentence-embedding models are trained (on angular distance), ensuring more accurate matching of file content to topic labels.
Keyword Boosting:
FILENAME_BOOST = 0.2 # If label appears in filename
TEXT_BOOST = 0.1 # If keyword appears in filename
Step 5: Classification Decision
Thresholds:
THRESHOLD = 0.4 # Main threshold
low_confidence_threshold = 0.35 # Fallback threshold
Decision Logic:
if similarity >= 0.40:
# High confidence - classify immediately
return label, similarity
elif similarity >= 0.35:
# Medium confidence - accept with warning
print("[!] Low confidence but accepting as fallback")
return label, similarity
else:
# Low confidence - generate new label
if allow_generation:
generate_new_label_via_gemini()
else:
return "Uncategorized", 0.0
Step 6: Label Generation (Gemini)
File: scripts/generate_label.py
def generate_folder_label(target_text, forced_label=None):
"""
Generate or update folder label using Gemini.
Args:
target_text: Document text + filename
forced_label: Optional manual label override
Returns:
{
"folder_label": "Physics",
"description": "mechanics, forces, energy, ...",
"keywords": "physics, motion, force, ..."
}
"""
Prompt Strategy:
- Keyword-based descriptions (proven superior to natural language)
- Existing labels context (prefer reuse over new creation)
- Broad categorization (Physics, not Thermodynamics)
- 15 focused examples covering edge cases
Merging Logic:
When a label already exists, FileSense merges new terminology with existing:
def merge_folder_metadata(folder_label, old_desc, old_kw, new_desc, new_kw):
"""
Intelligently merge metadata preserving all unique terms.
Hard requirement: NO terms from old metadata are lost.
"""
Step 7: Index Rebuild
File: scripts/create_index.py
def create_faiss_index():
# Load labels
folder_data = json.load("folder_labels.json")
# Generate embeddings
combined_desc = [f"{label}: {desc}" for label, desc in folder_data.items()]
embeddings = model.encode(combined_desc, normalize_embeddings=True)
# Build FAISS index
index = faiss.IndexFlatIP(768) # 768 dimensions
index.add(embeddings)
# Save to disk
faiss.write_index(index, "folder_embeddings.faiss")
Performance:
- Speed: ~1.5s for 10 labels
- Scaling: Linear with number of labels
- Memory: ~3KB per label
Data Structures
folder_labels.json
Format:
{
"Physics": "mechanics, thermodynamics, optics, electromagnetism, quantum mechanics, relativity, forces, energy, motion, waves, heat, light, electricity, magnetism, gravity, laboratory experiments, scientific formulas, physical laws Keywords: physics, mechanics, thermodynamics, optics, quantum, energy, force, motion",
"Chemistry": "organic chemistry, inorganic chemistry, chemical reactions, molecular structure, stoichiometry, titration, synthesis, bonding, compounds, purification, acids, bases Keywords: chemistry, organic, reaction, chemical, lab, synthesis, molecule"
}
Structure:
- Key: Folder label (broad category)
- Value:
{description} Keywords: {keywords}
- Description: 20-40 comma-separated terms
- Keywords: 8-12 high-value search terms
Why keywords, not natural language?
Extensive testing showed keyword-based descriptions outperform natural language by +32% accuracy. See NL vs Keywords Study.
folder_embeddings.faiss
Binary format containing:
- Vector embeddings for each label
- Index structure for fast search
- Metadata for reconstruction
Size: ~3KB per label (768 float32 values)
Multithreading Architecture
File: scripts/multhread.py
def process_multiple(files_dir, max_threads, testing=False, allow_generation=True):
"""
Process files in parallel with thread pool.
Args:
files_dir: Directory containing files
max_threads: Maximum concurrent threads
testing: If True, don't move files
allow_generation: Allow Gemini label generation
"""
Thread Safety:
- RLock: Prevents race conditions during label generation
- FAISS index: Read-only after loading (thread-safe)
- File I/O: Each thread processes different files
Performance:
- Optimal threads: 4-8 (I/O bound)
- Scaling: Near-linear up to 8 threads
- Bottleneck: Text extraction (especially OCR)
Fallback Mechanisms
If initial classification has low confidence:
if similarity < 0.35:
# Try extracting from middle of document
fallback_text = extract_text(file_path, fallback=True)
new_sim = classify(fallback_text)
if new_sim > similarity:
# Use fallback result
text = fallback_text
similarity = new_sim
Why? Many documents have table of contents or cover pages that don’t represent actual content.
2. Threshold Lowering
When generation is disabled:
if not allow_generation:
# Lower threshold by 7%
current_threshold -= 0.07
if similarity > current_threshold:
# Accept with lowered threshold
return label, similarity
After max retries:
if retries >= MAX_RETRIES:
# Ask user for manual label
manual_label = input(f"Enter label for '{filename}': ")
# Optionally generate description
if user_wants_description:
generate_folder_label(text, forced_label=manual_label)
Time Complexity
| Operation |
Complexity |
Notes |
| Text Extraction |
O(n) |
n = file size |
| RL Policy Decision |
O(1) |
Table lookup / State check |
| Embedding Generation |
O(1) |
Fixed for 2000 chars |
| FAISS Search |
O(k) |
k = number of labels |
| Label Generation |
O(1) |
API call latency |
| Index Rebuild |
O(k) |
k = number of labels |
Space Complexity
| Component |
Size |
Scaling |
| Embedding Model |
420MB |
One-time |
| FAISS Index |
~3KB/label |
Linear |
| folder_labels.json |
~500B/label |
Linear |
| Logs |
Variable |
Can be disabled |
Bottlenecks
- Text Extraction (especially OCR)
- Solution: Use more threads
- Solution: Pre-extract text
- Gemini API Latency (~1-2s per call)
- Solution: Batch processing
- Solution: Pre-generate common labels
- Model Download (first run only)
- Solution: Manual download and cache
Configuration Points
Similarity Thresholds
File: scripts/classify_process_file.py
THRESHOLD = 0.4 # Adjust for strictness
low_confidence_threshold = 0.35 # Fallback threshold
File: scripts/extract_text.py
PDF_CONFIG = {
'MAX_INPUT_CHARS': 2000, # Increase for longer docs
'QUALITY_THRESHOLD': 0.4, # Lower to accept more pages
'START_PAGE_ASSUMPTION': 3, # Skip more/fewer initial pages
}
Embedding Model
File: scripts/create_index.py
MODEL_NAME = "BAAI/bge-base-en-v1.5" # Change model here
Gemini Settings
File: scripts/generate_label.py
MODEL = "gemini-2.5-flash" # Gemini model version
temperature = 0.5 # Creativity (0.0-1.0)
RL Settings
File: scripts/RL/rl_config.py
EPSILON = 0.2 # Exploration rate
LEARNING_RATE = 0.1 # Q-Learning rate
Design Decisions
Why FAISS over other vector DBs?
- Fast: Optimized for similarity search
- Lightweight: No server required
- Offline: Works without internet
- Scalable: Handles thousands of labels efficiently
Why SBERT over other embeddings?
- Semantic: Captures meaning, not just keywords
- Pre-trained: No training required
- Fast: ~0.1s per encoding
- Accurate: Best performance in testing
Why keyword descriptions?
Proven through testing:
- Keywords: 56% accuracy
- Natural language: 24% accuracy
- +32% improvement with keywords
See NL vs Keywords Study for details.
Why Gemini for generation?
- Structured output: JSON schema support
- Context understanding: Analyzes document content
- Merging logic: Preserves existing metadata
- Cost-effective: Free tier available
← Back to Home