- Introduction
- Features
- Architecture
- How SimHash Works
- Installation
- Usage
- Error Handling
- Performance Benchmarks
- Conclusions and Recommendations
- Contributing
- License
Textblitz is a fast and scalable text indexing system written in Go, designed to efficiently search and retrieve data from large text files. By breaking down files into manageable chunks and using SimHash-based fingerprinting, Textblitz enables rapid content retrieval. Key performance highlights include:
- 0.33-second indexing of 100MB PDFs with 2 workers
- 6-17ms lookup latency across diverse file types
- Memory efficiency (0-3MB usage) even with large inputs
- Efficient Chunking: Configurable fixed-size text splitting (default: 4KB)
- SimHash Fingerprinting: Generates similarity-preserving 64-bit hashes
- Dual Feature Extraction:
- Word-based: Optimal for natural language processing
- N-gram-based: Effective for code/multilingual text patterns
- Fuzzy Matching: Adjustable Hamming distance thresholds (0-5)
- Parallel Processing: Multi-threaded architecture with worker pools
- Cross-Platform: Supports Linux, macOS, and Windows
Textblitz employs a pipeline architecture for optimized text processing:
graph TB
Input[Text File] --> Chunker[Chunk Splitter]
Chunker --> WorkerPool{Worker Pool}
WorkerPool --> Worker1[Worker 1]
WorkerPool --> Worker2[Worker 2]
WorkerPool --> WorkerN[Worker N]
Worker1 --> HashGen[SimHash Generator]
Worker2 --> HashGen
WorkerN --> HashGen
HashGen --> IndexBuilder[Index Builder]
IndexBuilder --> IndexFile[(Index File)]
LookupCmd[Lookup Command] --> SearchIndex[Search Index]
SearchIndex --> RetrieveChunk[Retrieve Chunk]
IndexFile -.-> SearchIndex
- Chunk Splitting: Divides input files into fixed-size segments
- Parallel Processing: Distributes chunks across worker goroutines
- Hash Generation: Computes SimHash values using FNV-1a hashing
- Index Construction: Maps hashes to file positions using optimized storage
- Lookup System: Enables exact and fuzzy search through Hamming distance comparison
- Feature Extraction:
- Split text into words or n-grams
- Normalize to lowercase
- Weighted Hashing:
- Hash features using FNV-1a
- Aggregate bit position weights
- Threshold Determination:
- Set bits based on cumulative weights
- Similarity Comparison:
- Calculate Hamming distance between hashes
Example Text | SimHash | Hamming Distance |
---|---|---|
"The quick brown fox" | 0x3f7c9b1a | 3 |
"The quick brown dog" | 0x3f7c9b58 |
- Tokenization: Splits on non-alphanumeric boundaries
- Optimization: 33% faster indexing for structured text
- Use Case: Natural language documents (e.g., legal contracts)
- Configuration: 3-character sequences with step=5
- Strength: Detects 28% more code similarities vs word-based
- Use Case: Source code analysis, multilingual content
- Go 1.16+
- Git
- Poppler-utils (PDF support)
# Ubuntu/Debian
sudo apt-get install -y poppler-utils
# macOS
brew install poppler
# Windows: Download from poppler-windows releases
git clone https://github.com/kh3rld/textblitz.git
cd textblitz
go build -o textindex
chmod +x build.sh
./build.sh
textindex -c index -i input.txt -s 4096 -o index.idx -w 8
Parameter | Description | Default |
---|---|---|
-s |
Chunk size (bytes) | 4096 |
-w |
Worker threads | 4 |
-o |
Output index file | auto |
textindex -c lookup -i index.idx -h 3e4f1b2c98a6 -t 2
Threshold | Use Case |
---|---|
0 | Exact matches |
1-2 | Minor text variations |
3-5 | Substantial rewrites |
textindex -c index -i "file with spaces.pdf" -o special.idx
Error Type | Solution |
---|---|
Missing dependencies | Verify poppler-utils installation |
Permission denied | Check file/directory permissions |
Invalid SimHash format | Use 64-bit hexadecimal values |
Memory exhaustion | Reduce worker count (-w) |
Workers | Index Time | Memory | Lookup Latency |
---|---|---|---|
2 | 0.33s | 0.88MB | 6.98ms |
8 | 0.33s | 2.36MB | 6.55ms |
Workers | Index Time | Memory | Lookup Latency |
---|---|---|---|
2 | 0.08s | 0MB | 12.71ms |
12 | 0.03s | 0MB | 14.84ms |
Key Insights:
- PDF processing benefits from 2-8 workers
- Text files scale linearly up to 12 workers
- Memory usage remains under 3MB for all tests
Use Case | Workers | Feature Set |
---|---|---|
PDF documents | 2-4 | Word-based |
Small text files | 12 | Word-based |
Code analysis | 4-8 | N-gram-based |
- Start with 4 workers for general use
- Use 3-gram features for multilingual content
- Set threshold=2 for plagiarism detection
- Monitor memory usage with >1GB files
- Fork repository
- Create feature branch
- Submit pull request
- Include test coverage
For detailed technical analysis and benchmarks, see the Textblitz Research Paper.
MIT License - See LICENSE for details.