Natural Language vs Keywords Study

Natural Language vs Original Keywords: Comprehensive Analysis

Test Date: 2025-12-05
commit id: 99d03e579a9fd413d7938f01559ebb0172a260e3 Test Files: 75 NCERT textbook markdown files
Model used for study: all-mpnet-base-v2 (SentenceTransformer) Current Default Model: BAAI/bge-base-en-v1.5

Executive Summary

WINNER: Original Keywords Approach

The comprehensive testing revealed that comma-separated keyword terms dramatically outperform natural language descriptions for SBERT-based semantic matching with academic documents.

Metric	Natural Language	Original Keywords	Difference
Accuracy	24.0%	56.0%	+32.0%
Avg Similarity	0.104	0.355	+241%
Uncategorized Files	53/75 (71%)	8/75 (11%)	-60%
Failed (0.00 sim)	53 files	8 files	-45 files
Avg Time/File	0.30s	0.27s	-10%

Verdict: Original keyword-based descriptions are superior for NCERT academic documents with SBERT embeddings.

Test Configuration

Test Setup

Test Files: NCERT textbook markdown files (various subjects: Physics, Chemistry, Maths, Geography, Biology, civics)
Natural Language Test: NCERT_NL_TEST.log (Content-style descriptions)
Original Keyword Test: NCERT_OG_TEST.log (Comma-separated terms)
Total Files: 75 documents
Embedding Model during test: all-mpnet-base-v2
Vector Search: FAISS IndexFlatIP

Approaches Tested

Approach 1: Natural Language Descriptions

Hypothesis: Content-style descriptions would match better with SBERT

Format:

"Documents contain experimental procedures and physical laws. Content includes 
laboratory observations, scientific principles like Bernoulli's equation, and 
analysis of forces, energy, and motion. Typical terminology involves physical 
quantities, units, and theoretical concepts."

Characteristics:

Full sentences mimicking document content
Natural grammatical structure
Descriptive phrases about what documents contain
50-100 words per description

Approach 2: Original Keywords (Comma-Separated Terms)

Hypothesis: Dense keyword lists capture semantic essence

Format:

"mechanics, thermodynamics, electromagnetism, optics, quantum mechanics, 
relativity, fluid dynamics, kinematics, forces, energy, motion, waves, 
heat, light, electricity, magnetism, laboratory experiments, scientific 
formulas, physical laws"

Characteristics:

Comma-separated terms
Dense terminology (20-40 terms)
Domain-specific vocabulary
Synonyms and related concepts
No grammatical structure

Detailed Results

Overall Performance

Metric	Natural Language	Original Keywords	Winner
Total Files	75	75	-
Accuracy	24.0%	56.0%	OG
Avg Similarity	0.104	0.355	OG
Uncategorized	53	8	OG
Low Confidence (<0.40)	17	40	NL
Avg Time/File	0.30s	0.27s	OG

Similarity Score Distribution

Range	Natural Language	Original Keywords	Analysis
0.00 (Failed)	53	8	NL failed on 71% of files
0.01-0.20 (Very Low)	0	0	-
0.21-0.30 (Low)	4	8	Both struggled here
0.31-0.40 (Medium)	14	34	OG had 2.4x more medium matches
0.41-0.50 (Good)	3	16	OG had 5.3x more good matches
0.51+ (Excellent)	1	9	OG had 9x more excellent matches

Visual Distribution

Natural Language:
00 (Failed):     |||||||||||||||||||||||||||||||||||||||||||||||||| 53
31-0.40:         |||||||||| 14
41-0.50:         || 3
51+:             | 1

Original Keywords:
00 (Failed):     |||||||| 8
21-0.30:         |||||||| 8
31-0.40:         |||||||||||||||||||||||||||||||||| 34
41-0.50:         |||||||||||||||| 16
51+:             ||||||||| 9

Label Distribution

Natural Language Approach

Uncategorized: 53 files (71%)
Biotechnology: 9 files
Chemistry: 5 files
Civics: 5 files
English: 3 files

Original Keywords Approach

Geography: 19 files
English: 15 files
Biotechnology: 14 files
Physics: 14 files
Uncategorized: 8 files (11%)
Civics: 5 files

Analysis: Why Natural Language Failed

1. Too Specific / Overfitted

Natural language descriptions were too specific to the training examples:

BAD: "Integration by parts where u equals x squared, dv equals e to the x dx. 
Series convergence testing using ratio test, comparison test."

This doesn’t match NCERT files which have different phrasing and structure.

2. SBERT Embedding Mismatch

SBERT models work better with keyword density than natural sentences for this use case
Academic documents have consistent terminology that keywords capture better
Natural sentences introduce grammatical noise

3. Semantic Noise

Natural sentences introduced grammatical structure that added noise:

Articles: “the”, “a”, “an”
Verbs: “contains”, “includes”, “discusses”
Prepositions: “in”, “on”, “with”
Phrases: “Documents contain…”, “Content includes…”

These dilute the semantic signal for academic content.

4. Reduced Keyword Density

Natural language spreads keywords across more tokens:

Natural Language (100 words):
"Documents contain experimental procedures investigating physical laws. 
Content includes laboratory observations, scientific principles..."
→ ~15 meaningful keywords in 100 words (15% density)

Keywords (40 terms):
"mechanics, thermodynamics, electromagnetism, optics, quantum mechanics..."
→ 40 meaningful keywords in 40 words (100% density)

5. Embedding Space Geometry

The embedding models cluster keyword lists more effectively
Natural sentences might introduce variance in the embedding space
Keywords create denser, more consistent semantic clusters

Analysis: Why Keywords Succeeded

1. Broader Semantic Coverage

Keyword lists cover MORE semantic space with FEWER tokens:

 GOOD: "mechanics, thermodynamics, optics, electromagnetism, quantum physics, 
relativity, fluid dynamics, kinematics, forces, energy, motion, waves"

Each term is a direct semantic signal without noise.

2. Better Alignment with NCERT Structure

NCERT files use consistent academic terminology:

Technical terms appear frequently
Keyword matching aligns with document structure
Domain-specific vocabulary is strong signal

3. Synonym Coverage

Keywords naturally include synonyms and related terms:

"electricity, electrical, electromagnetism, electromagnetic"
"optics, optical, light, vision"
"mechanics, mechanical, motion, kinematics, dynamics"

4. SBERT Model Characteristics

Trained on diverse text, but keyword matching is robust
Academic terminology clusters well in embedding space
Synonym proximity enhances matching

5. Efficient Token Usage

Every token in keyword descriptions carries semantic weight:

No grammatical filler
No redundant phrases
Maximum information density

Raw Data Summary

Natural Language Test

Total Files: 75
Correct Classifications: 18
Accuracy: 24.0%
Avg Similarity: 0.104
Uncategorized: 53 (71%)
Failed (0.00 sim): 53
Good Matches (>0.40): 4 (5%)

Original Keywords Test

Total Files: 75
Correct Classifications: 42
Accuracy: 56.0%
Avg Similarity: 0.355
Uncategorized: 8 (11%)
Failed (0.00 sim): 8
Good Matches (>0.40): 25 (33%)

Performance Comparison

Accuracy Improvement: +32.0% (133% increase)
Similarity Improvement: +0.251 (241% increase)
Uncategorization Reduction: -60% (from 71% to 11%)
Good Matches Increase: +525% (from 5% to 33%)

Best Practices & Recommendations

DO: Use Keyword-Based Descriptions

Optimal Format

{
  "folder_label": "Physics",
  "description": "mechanics, thermodynamics, electromagnetism, optics, quantum mechanics, nuclear physics, relativity, astrophysics, fluid dynamics, solid state physics, particle physics, kinematics, dynamics, statics, forces, energy, motion, matter, waves, fields, radiation, heat, light, sound, electricity, magnetism, gravity, laboratory experiments, scientific formulas, physical laws, conservation laws, measurement, vectors, scalars, work, power, momentum, impulse",
  "keywords": "physics, mechanics, thermodynamics, optics, quantum, relativity, energy, force, motion, waves"
}

Key Elements

20-40 comma-separated terms in description
Mix of:
- Core concepts (mechanics, thermodynamics)
- Sub-topics (fluid dynamics, particle physics)
- Related terms (energy, force, motion)
- Synonyms (electricity/electrical, optics/optical)
8-12 high-value keywords (most important terms)
No articles, verbs, or grammatical structure
Dense terminology without filler words

DON’T: Use Natural Language Descriptions

Avoid This Format

{
  "folder_label": "Physics",
  "description": "Documents contain experimental procedures and physical laws. Content includes laboratory observations, scientific principles like Newton's laws and Bernoulli's equation. Typical terminology involves physical quantities, units, and theoretical concepts related to mechanics, thermodynamics, and electromagnetism.",
  "keywords": "physics, experiment, laboratory"
}

Problems

Grammatical noise (contains, includes, involves)
Low keyword density (too many filler words)
Overly specific examples (Newton’s laws, Bernoulli’s equation)
Reduced semantic coverage (fewer meaningful terms)
Embedding space dilution (noise reduces signal)

Key Insights

For SBERT-Based Semantic Matching

What Works:

Dense keyword lists (20-40 terms)
Domain-specific vocabulary (technical terms)
Synonym inclusion (multiple ways to express concepts)
Broad coverage (core + sub-topics + related terms)
No grammatical structure (pure semantic content)

What Doesn’t Work:

Natural language sentences (too much noise)
Explanatory text (“Documents contain…”)
Specific examples (overfitting to training data)
Grammatical connectors (articles, prepositions, verbs)
Low keyword density (diluted semantic signal)

Why This Matters

Embedding Characteristics:

Trained on natural text BUT keyword matching is robust
Academic terminology clusters well in embedding space
Keyword density creates stronger semantic signals
Grammatical structure adds variance without value

Document Characteristics:

NCERT files use consistent academic terminology
Technical terms are strong classification signals
Keyword matching aligns with document structure
Domain vocabulary is predictable and stable

Actionable Next Steps

1. Keep Using Keywords (Confirmed Best Practice)

Continue with comma-separated keyword terms in descriptions
Maintain 20-40 terms per description
Focus on domain-specific vocabulary

2. Expand Keyword Coverage (High Priority)

Add more synonyms and related terms to existing labels:

Example: Expand Physics

Current: "mechanics, thermodynamics, optics, electromagnetism, quantum physics"

Expanded: "mechanics, thermodynamics, optics, electromagnetism, quantum physics,
nuclear physics, relativity, astrophysics, fluid dynamics, solid state physics,
particle physics, kinematics, dynamics, statics, forces, energy, motion, matter,
waves, fields, radiation, heat, light, sound, electricity, magnetism, gravity,
laboratory experiments, scientific formulas, physical laws, conservation laws,
measurement, vectors, scalars, work, power, momentum, impulse, friction, tension,
pressure, density, velocity, acceleration, displacement, torque, angular momentum"

3. Optimize Prompts (Medium Priority)

Current prompt optimizations completed:

Removed verbose explanations
Consolidated examples (15 focused examples)
Streamlined merge logic
Better error messages

Further optimizations:

Fine-tune temperature settings
Adjust similarity thresholds based on performance
Add more edge case examples

4. Test with Diverse Documents (Medium Priority)

Validate approach with:

PDFs with different structures
Scanned documents (OCR text)
Short vs long documents
Multi-topic documents
Non-academic content (news, legal, medical)

5. Implement Training Dataset Optimization (Low Priority)

From README ideas:

“Use the dataset with category labels, generate folder labels until similarity crosses threshold for all training files”

This could further optimize descriptions through iterative refinement.

6. Add More Academic Labels (Low Priority)

Consider expanding to:

Economics
Psychology
Political Science
Sociology
Philosophy
Art/Music
Physical Education
Environmental Science

Files & Documentation

Created Documents

NCERT_COMPARISON_REPORT.md - Detailed metrics and analysis
FINAL_VERDICT.md - Visual summary with clear verdict
OPTIMIZATION_SUMMARY.md - Actionable next steps
comparison_metrics.json - Raw test data
scripts/compare_logs.py - Log analysis tool
wiki/NL_VS_OG.md - This comprehensive document

Test Logs

logs/NCERT_NL_TEST.log - Natural language approach results
logs/NCERT_OG_TEST.log - Original keywords approach results

Code Files

scripts/generate_label.py - Label generation with Gemini (using keywords)
scripts/classify_process_file.py - Classification and processing logic
scripts/create_index.py - FAISS index creation
folder_labels.json - Current keyword-based label database

Success Metrics

Current Performance (Keywords)

56% accuracy on NCERT test files
89% categorization rate (only 11% uncategorized)
0.355 average similarity (acceptable range)
33% of files with >0.40 similarity (good matches)

Target Performance

70%+ accuracy
95%+ categorization rate
0.45+ average similarity
50%+ files with >0.40 similarity

How to Achieve Targets

Expand keyword coverage in descriptions (+10-15% accuracy)
Add more training examples to Gemini prompts (+5% accuracy)
Fine-tune similarity thresholds based on data (+5% categorization)
Test and iterate with larger datasets (validate improvements)

Technical Details

Embedding Model

Current Model: BAAI/bge-base-en-v1.5
Dimensions: 768
Normalization: L2 normalized embeddings
Similarity: Cosine similarity (via FAISS IndexFlatIP)

Classification Thresholds

Main Threshold: 0.40 (accept classification)
Low Confidence: 0.35 (fallback threshold)
Fallback Logic: Retry with middle pages if initial extraction fails

Processing Pipeline

File Input
    ↓
Text Extraction (extract_text.py)
    ↓
SBERT Encoding (BAAI/bge-base-en-v1.5)
    ↓
FAISS Similarity Search
    ↓
Threshold Check (0.40)
    ↓
High Confidence? → Classify
Low Confidence? → Generate Label (Gemini)
    ↓
Update folder_labels.json
    ↓
Rebuild FAISS Index
    ↓
Re-classify

Lessons Learned

1. Simpler is Better

For SBERT embeddings with academic content, simple keyword lists outperform complex natural language descriptions.

2. Keyword Density Matters

Maximum information density (100% keywords vs 15% keywords in sentences) creates stronger semantic signals.

3. SBERT Prefers Keywords

Despite being trained on natural text, SBERT embeddings cluster keyword lists more effectively for classification tasks.

4. Avoid Overfitting

Natural language descriptions that mimic specific examples fail to generalize to new documents.

5. Grammatical Noise is Real

Articles, verbs, and prepositions dilute semantic signals without adding classification value.

6. Test Before Assuming

The hypothesis that “content-style descriptions would match better” was proven wrong through rigorous testing.

7. Original Intuition Was Correct

The initial keyword-based approach was optimal all along - sometimes the simplest solution is the best.

Final Recommendations

DO:

Continue using comma-separated keyword terms
Maintain 20-40 terms per description
Include synonyms and related vocabulary
Focus on domain-specific terminology
Keep descriptions dense without filler words
Avoid generic words (document, file, report)
Test with diverse document types
Monitor accuracy metrics over time
Expand keyword coverage iteratively
Keep prompts concise and clear

DON’T:

Switch to natural language descriptions
Add explanatory text (“Documents contain…”)
Include grammatical structure
Use overly specific examples
Reduce keyword density
Ignore similarity metrics
Overfit to training examples
Add filler words or connectors
Assume without testing
Complicate what works simply

Conclusion

The comprehensive testing with 75 NCERT files definitively proved that:

Keyword Terms » Natural Language

Results:

+32% accuracy improvement (24% → 56%)
+241% similarity improvement (0.104 → 0.355)
-60% uncategorization reduction (71% → 11%)
+525% good matches increase (5% → 33%)

Why Keywords Win:

Maximum semantic density (100% vs 15%)
Better SBERT embedding alignment
Reduced grammatical noise
Broader coverage per token
Stronger classification signals

Recommendation: Keep using the original keyword-based approach - it’s proven superior for SBERT semantic matching with academic documents.

Status: Analysis Complete - Keyword Approach Validated
Date: 2025-12-05
Test Files: 75 NCERT documents
Verdict: Original Keywords Win by a Landslide!

This document consolidates all findings from the Natural Language vs Original Keywords comparison testing for the FileSense project.