FileSense Documentation

System Architecture & Pipeline

System Architecture

Understanding how FileSense processes and classifies files.


High-Level Overview

flowchart TB
    subgraph Input
        A[Input Files<br/>PDF, DOCX, TXT, MD]
    end
    
    subgraph Extraction
        B[Text Extraction<br/>pdfplumber, python-docx]
        C[Fallback Extraction<br/>Middle pages, OCR]
    end
    
    subgraph RLAgent [RL Agent]
        D[Epsilon-Greedy Policy<br/>Exploit vs Explore]
    end
    
    subgraph Embedding
        E[SBERT Encoding<br/>BGE-Base v1.5<br/>768 dimensions]
    end
    
    subgraph Search
        F[FAISS Vector Search<br/>IndexFlatIP<br/>Cosine Similarity]
        G{Similarity >= 0.40?}
    end
    
    subgraph Classification
        H[High Confidence<br/>Assign to Folder]
        I[Generative Fallback<br/>Ask Gemini]
    end
    
    subgraph Update
        J[Update Labels<br/>folder_labels.json]
        K[Rebuild Index<br/>FAISS re-indexing]
        L[Re-classify]
    end
    
    subgraph Output
        M[Move to Sorted Folder]
    end
    
    A --> B
    B --> C
    C --> D
    D -->|Get State| E
    E --> F
    F --> G
    G -->|Yes| H
    G -->|No| I
    I --> J
    J --> K
    K --> L
    L --> F
    H --> M

Processing Pipeline

Step 1: Text Extraction

File: scripts/extract_text.py

def extract_text(file_path, fallback=False):
    """
    Extract text from PDF, DOCX, or TXT files.
    
    Args:
        file_path: Path to the file
        fallback: If True, extract from middle pages (avoid TOC)
    
    Returns:
        Extracted text (max 2000 chars)
    """

Features:

Configuration:

PDF_CONFIG = {
    'MAX_INPUT_CHARS': 2000,        # Max text length
    'QUALITY_THRESHOLD': 0.4,       # Min page quality
    'START_PAGE_ASSUMPTION': 3,     # Skip first N pages
    'HEADER_FOOTER_MARGIN': 70,     # Crop margins (pixels)
    'MIN_LINE_LENGTH': 25,          # Min line length
}

Step 2: The RL Agent (Decision Maker)

File: scripts/RL/rl_policy.py

Before any expensive operations, the Reinforcement Learning Agent decides the optimal strategy.

class EpsilonGreedyBandit:
    def select_action(self, state):
        """
        Decide whether to EXPLOIT (use existing knowledge) or EXPLORE (try new generation).
        
        Policy A: High Accuracy (Liberal use of API)
        Policy B: Balanced
        Policy C: Efficient (Vector Search only)
        """

Why RL?

Step 3: Embedding Generation

File: scripts/classify_process_file.py

MODEL_NAME = "BAAI/bge-base-en-v1.5"
model = SentenceTransformer(MODEL_NAME, device="cpu")
text_emb = model.encode([text], normalize_embeddings=True)

Model Details:

Why this model?

Benchmark testing confirmed bge-base-en-v1.5 is 2x faster than MPNet and significantly more accurate than MiniLM. See Model Comparison.

File: scripts/classify_process_file.py

# Search FAISS index
D, I = index.search(text_emb, 10)  # Top 10 matches

# Calculate similarities
for idx, sim in zip(I[0], D[0]):
    if idx != -1:
        all_sims[idx] = sim

FAISS Configuration:

Why Cosine Similarity?

I use cosine similarity because it compares semantic direction instead of vector magnitude. This avoids bias from unequal text lengths and keyword-heavy folder descriptions. It aligns better with how sentence-embedding models are trained (on angular distance), ensuring more accurate matching of file content to topic labels.

Keyword Boosting:

FILENAME_BOOST = 0.2  # If label appears in filename
TEXT_BOOST = 0.1      # If keyword appears in filename

Step 5: Classification Decision

Thresholds:

THRESHOLD = 0.4                    # Main threshold
low_confidence_threshold = 0.35    # Fallback threshold

Decision Logic:

if similarity >= 0.40:
    # High confidence - classify immediately
    return label, similarity

elif similarity >= 0.35:
    # Medium confidence - accept with warning
    print("[!] Low confidence but accepting as fallback")
    return label, similarity

else:
    # Low confidence - generate new label
    if allow_generation:
        generate_new_label_via_gemini()
    else:
        return "Uncategorized", 0.0

Step 6: Label Generation (Gemini)

File: scripts/generate_label.py

def generate_folder_label(target_text, forced_label=None):
    """
    Generate or update folder label using Gemini.
    
    Args:
        target_text: Document text + filename
        forced_label: Optional manual label override
    
    Returns:
        {
            "folder_label": "Physics",
            "description": "mechanics, forces, energy, ...",
            "keywords": "physics, motion, force, ..."
        }
    """

Prompt Strategy:

Merging Logic: When a label already exists, FileSense merges new terminology with existing:

def merge_folder_metadata(folder_label, old_desc, old_kw, new_desc, new_kw):
    """
    Intelligently merge metadata preserving all unique terms.
    
    Hard requirement: NO terms from old metadata are lost.
    """

Step 7: Index Rebuild

File: scripts/create_index.py

def create_faiss_index():
    # Load labels
    folder_data = json.load("folder_labels.json")
    
    # Generate embeddings
    combined_desc = [f"{label}: {desc}" for label, desc in folder_data.items()]
    embeddings = model.encode(combined_desc, normalize_embeddings=True)
    
    # Build FAISS index
    index = faiss.IndexFlatIP(768)  # 768 dimensions
    index.add(embeddings)
    
    # Save to disk
    faiss.write_index(index, "folder_embeddings.faiss")

Performance:


Data Structures

folder_labels.json

Format:

{
  "Physics": "mechanics, thermodynamics, optics, electromagnetism, quantum mechanics, relativity, forces, energy, motion, waves, heat, light, electricity, magnetism, gravity, laboratory experiments, scientific formulas, physical laws Keywords: physics, mechanics, thermodynamics, optics, quantum, energy, force, motion",
  
  "Chemistry": "organic chemistry, inorganic chemistry, chemical reactions, molecular structure, stoichiometry, titration, synthesis, bonding, compounds, purification, acids, bases Keywords: chemistry, organic, reaction, chemical, lab, synthesis, molecule"
}

Structure:

Why keywords, not natural language?

Extensive testing showed keyword-based descriptions outperform natural language by +32% accuracy. See NL vs Keywords Study.

folder_embeddings.faiss

Binary format containing:

Size: ~3KB per label (768 float32 values)


Multithreading Architecture

File: scripts/multhread.py

def process_multiple(files_dir, max_threads, testing=False, allow_generation=True):
    """
    Process files in parallel with thread pool.
    
    Args:
        files_dir: Directory containing files
        max_threads: Maximum concurrent threads
        testing: If True, don't move files
        allow_generation: Allow Gemini label generation
    """

Thread Safety:

Performance:


Fallback Mechanisms

1. Fallback Text Extraction

If initial classification has low confidence:

if similarity < 0.35:
    # Try extracting from middle of document
    fallback_text = extract_text(file_path, fallback=True)
    new_sim = classify(fallback_text)
    
    if new_sim > similarity:
        # Use fallback result
        text = fallback_text
        similarity = new_sim

Why? Many documents have table of contents or cover pages that don’t represent actual content.

2. Threshold Lowering

When generation is disabled:

if not allow_generation:
    # Lower threshold by 7%
    current_threshold -= 0.07
    
    if similarity > current_threshold:
        # Accept with lowered threshold
        return label, similarity

3. Manual Input

After max retries:

if retries >= MAX_RETRIES:
    # Ask user for manual label
    manual_label = input(f"Enter label for '{filename}': ")
    
    # Optionally generate description
    if user_wants_description:
        generate_folder_label(text, forced_label=manual_label)

Performance Characteristics

Time Complexity

Operation Complexity Notes
Text Extraction O(n) n = file size
RL Policy Decision O(1) Table lookup / State check
Embedding Generation O(1) Fixed for 2000 chars
FAISS Search O(k) k = number of labels
Label Generation O(1) API call latency
Index Rebuild O(k) k = number of labels

Space Complexity

Component Size Scaling
Embedding Model 420MB One-time
FAISS Index ~3KB/label Linear
folder_labels.json ~500B/label Linear
Logs Variable Can be disabled

Bottlenecks

  1. Text Extraction (especially OCR)
    • Solution: Use more threads
    • Solution: Pre-extract text
  2. Gemini API Latency (~1-2s per call)
    • Solution: Batch processing
    • Solution: Pre-generate common labels
  3. Model Download (first run only)
    • Solution: Manual download and cache

Configuration Points

Similarity Thresholds

File: scripts/classify_process_file.py

THRESHOLD = 0.4                    # Adjust for strictness
low_confidence_threshold = 0.35    # Fallback threshold

Text Extraction

File: scripts/extract_text.py

PDF_CONFIG = {
    'MAX_INPUT_CHARS': 2000,       # Increase for longer docs
    'QUALITY_THRESHOLD': 0.4,      # Lower to accept more pages
    'START_PAGE_ASSUMPTION': 3,    # Skip more/fewer initial pages
}

Embedding Model

File: scripts/create_index.py

MODEL_NAME = "BAAI/bge-base-en-v1.5"  # Change model here

Gemini Settings

File: scripts/generate_label.py

MODEL = "gemini-2.5-flash"         # Gemini model version
temperature = 0.5                  # Creativity (0.0-1.0)

RL Settings

File: scripts/RL/rl_config.py

EPSILON = 0.2          # Exploration rate
LEARNING_RATE = 0.1    # Q-Learning rate

Design Decisions

Why FAISS over other vector DBs?

Why SBERT over other embeddings?

Why keyword descriptions?

Proven through testing:

See NL vs Keywords Study for details.

Why Gemini for generation?



← Back to Home