-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Description
Issue: List-Type Content Creates Toxic Chunks (100-300 entities)
Problem Description
During 10-document test, discovered that list-type content (bibliographies, indexes, name lists) creates toxic chunks with 100-300 entities, regardless of CHUNK_SIZE setting.
Test Results
- 9/10 documents: Normal processing (10-30 entities/chunk) ✅
- 1/10 documents: Toxic list chunks (104, 243, 279 entities)
⚠️
Example Toxic Chunks
Chunk 147 of 153: 104 Ent + 0 Rel (list detected: 0 relationships)
Chunk 148 of 153: 243 Ent + 0 Rel (list detected: 0 relationships)
Chunk 149 of 153: 279 Ent + 0 Rel (list detected: 0 relationships)
Telltale signs: High entity count + 0 relationships = list content
Impact on 8,000 File Run
- ~800 documents (10%) affected
- Each toxic chunk: 3-5x slower processing
- Adds 6-8 hours to total indexing time (28-32 hours total)
- System handles it without crashing ✅
- But significantly impacts performance
⚠️
Root Cause
List-type content extracts one entity per line:
- Albert Einstein
- Isaac Newton
- Marie Curie
...
CHUNK_SIZE reduction doesn't help because:
- 800-token chunk of a list = 200+ entities
- Each line is an entity
- Chunking doesn't reduce entity density
Proposed Solutions
Solution 1: List Detection & Filtering (RECOMMENDED)
Detect list chunks based on:
- Entity count > 50 AND
- Relationship count < 5 AND
- Average entity name length < 20 characters
Action: Skip or summarize instead of extracting
Implementation:
# In lightrag/operate.py, after extraction:
def is_list_chunk(entities, relationships):
"""Detect if chunk is list-type content"""
if len(entities) < 50:
return False
if len(relationships) > 5:
return False
avg_length = sum(len(e['entity_name']) for e in entities) / len(entities)
if avg_length > 20:
return False
return True
# In extract_entities():
if is_list_chunk(entities, relationships):
logger.warning(f"List chunk detected with {len(entities)} entities - summarizing")
# Extract only top 20 most significant entities
entities = filter_top_entities(entities, top_n=20)Expected impact:
- Reduces toxic chunk processing time by 80%
- Saves 6-8 hours on 8K file run
- No quality loss (lists have low information density)
Effort: 2-3 hours
Solution 2: Entity Filtering
Filter low-value entities from list chunks:
- Single-word entities without context
- Entities shorter than 3 characters
- Entities matching stop words
Implementation: See TOXIC_CHUNK_SOLUTIONS.md Solution #5
Expected impact: 40-60% reduction in entity count
Effort: 2-3 hours
Solution 3: Pre-processing (Manual)
Identify and remove bibliographies/indexes before indexing.
Pros: 100% effective
Cons: Manual work, not scalable
Recommendation: Not feasible for 8K files
Configuration Option (Quick Fix)
Add to .env:
# Toxic chunk handling
MAX_ENTITIES_PER_CHUNK=100 # Skip chunks above this
SKIP_TOXIC_CHUNKS=true # Set true to skip instead of processPros: Quick (10 minutes)
Cons: Data loss (skips entire chunks)
Recommended Implementation Order
Phase 1: Accept It (CURRENT - for 8K run)
- Status: IMPLEMENTED ✅
- Action: Proceed with 8K file run
- Expected time: 28-32 hours
- Risk: Low (test proved it works)
Phase 2: List Detection (AFTER 8K run)
- When: After initial indexing completes
- Effort: 2-3 hours implementation
- Benefit: 20-25% faster for future incremental updates
- Priority: Medium
Phase 3: Entity Filtering (FUTURE)
- When: If Phase 2 insufficient
- Effort: 2-3 hours implementation
- Benefit: Additional 10-15% speedup
- Priority: Low
Acceptance Criteria
For Phase 2 Implementation
- Implement
is_list_chunk()detection function - Add configuration options for list handling
- Test on 10-document corpus (include list-heavy doc)
- Verify processing time reduction (expect 20-25% faster)
- Ensure no quality loss on normal documents
- Update documentation
Success Metrics
- List chunks detected: >90% accuracy
- Processing time for toxic docs: <2 min (vs 5-10 min before)
- Entity quality: No regression on normal content
- Index completeness: >99% coverage
Related Documentation
- Test results:
TEST_RESULTS_AND_NEXT_STEPS.md - Detailed solutions:
TOXIC_CHUNK_SOLUTIONS.md - Optimization guide:
OPTIMIZATION_SUMMARY.md
Current Status
- Issue identified in 10-doc test
- Impact quantified (~10% of documents)
- Root cause understood (list-type content)
- Solutions designed
- Implementation (blocked by 8K indexing run)
- Testing
- Deployment
Next action: Proceed with 8K file run (accept toxic chunks for now), implement detection after completion.
Estimated impact: 6-8 hour time savings on future indexing runs
Implementation effort: 2-3 hours
Priority: Medium (not blocking for initial 8K run)