+
Skip to content

phdctx/ir_project

Repository files navigation

Information Retrieval Evaluation System

A comprehensive system for evaluating and comparing different information retrieval methods on product search datasets, specifically designed for Amazon ESCI (Shopping Queries Dataset).

🎯 Overview

This package compares three search approaches:

  • BM25: Traditional keyword-based search
  • SBERT: Dense vector search using sentence transformers
  • Hybrid: Combination of BM25 + SBERT using Reciprocal Rank Fusion

🚀 Quick Start

1. Prerequisites

  • Python 3.8+
  • Elasticsearch 8.0+ running on localhost:9200
  • Your data files:
    • Product data in JSONL format
    • Queries in ESCI JSON format

2. Installation

# Clone or download this package
cd ir_evaluation_package

# Install dependencies
pip install -r requirements.txt

# Download NLTK data
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords')"

# Download spaCy model (optional)
python -m spacy download en_core_web_sm

3. Setup Elasticsearch

# Start Elasticsearch (adjust for your installation)
# Default: localhost:9200

# Verify it's running
curl -X GET "localhost:9200/"

4. Prepare Your Data

Update file paths in main_pipeline.py:

# Update these paths
jsonl_file = "data/your_products.jsonl"     # Product data
queries_file = "data/your_queries.json"     # ESCI queries

5. Run Evaluation

python main_pipeline.py

📁 Package Structure

ir_evaluation_package/
├── main_pipeline.py         # Main execution script
├── ir_setup.py             # Data preprocessing & Elasticsearch setup
├── search_engines.py       # Search method implementations
├── evaluation.py           # Run generation & evaluation
├── advanced_analysis.py    # Advanced analysis & visualizations
├── requirements.txt        # Python dependencies
├── README.md              # This file
├── example_config.py      # Configuration examples
├── data/                  # Your data files
│   ├── products.jsonl
│   └── queries.json
└── results/              # Generated results
    ├── summary_report.txt
    ├── comparison_table.csv
    └── *.png

📊 Expected Input Format

Product Data (JSONL)

{"id": "B0013REMTE", "title": "Product Title", "description": "Product description...", "full_text": "Complete product text..."}

Queries (ESCI JSON)

[
  {
    "query": "search query text",
    "query_id": 1,
    "product_asins": ["B0013REMTE"],
    "esci_labels": ["E"],
    "product_locales": ["us"]
  }
]

ESCI Label Mapping

  • E (Exact): 3 relevance points
  • S (Substitute): 2 relevance points
  • C (Complement): 1 relevance point
  • I (Irrelevant): 0 relevance points

📈 Output Files

  • summary_report.txt: Executive summary with key findings
  • comparison_table.csv: Side-by-side performance metrics
  • evaluation_results.csv: Detailed metrics for all methods
  • per_query_results.csv: Per-query performance breakdown
  • performance_comparison.png: Visualization charts
  • run_*.txt: TREC format runs for each method
  • significance_tests.json: Statistical significance results

🔧 Configuration

Basic Configuration

Edit main_pipeline.py to adjust:

  • File paths
  • Elasticsearch connection
  • Processing parameters
  • Search parameters

Advanced Configuration

Create config.py:

config = {
    'max_products': 10000,        # Limit for testing
    'sentence_model': 'all-MiniLM-L6-v2',
    'use_stemming': True,
    'remove_stopwords': True,
    'top_k': 1000,
    'embedding_batch_size': 32
}

📊 Evaluation Metrics

The system evaluates using standard IR metrics:

  • MAP: Mean Average Precision
  • nDCG@k: Normalized Discounted Cumulative Gain
  • P@k: Precision at k
  • R@k: Recall at k
  • MRR: Mean Reciprocal Rank
  • Success@k: Success rate at k

🎨 Visualizations

Automatically generated visualizations include:

  • Performance comparison charts
  • Query difficulty analysis
  • System agreement heatmaps
  • Per-query performance distributions
  • ESCI label breakdown analysis

🛠 Individual Components

Run Components Separately

# Setup only
python ir_setup.py

# Test search engines
python search_engines.py

# Generate runs and evaluate
python evaluation.py

# Advanced analysis
python advanced_analysis.py

Custom Analysis

from ir_setup import setup_ir_system
from search_engines import SearchEngineManager
from evaluation import QrelsManager, EvaluationManager

# Setup your custom analysis
es_manager, preprocessor, processor = setup_ir_system("data/products.jsonl")
search_manager = SearchEngineManager(es_manager.es)

# Run custom searches
results = search_manager.search_all("your query", top_k=100)

🔍 Troubleshooting

Common Issues

  1. Elasticsearch Connection Error

    # Check if ES is running
    curl -X GET "localhost:9200/"
    
    # Check logs
    tail -f /path/to/elasticsearch/logs/elasticsearch.log
  2. Memory Issues

    # Reduce batch sizes in config
    config['max_products'] = 1000
    config['embedding_batch_size'] = 16
  3. NLTK Download Issues

    import nltk
    nltk.download('punkt')
    nltk.download('stopwords')

Performance Tips

  • Start with smaller datasets (max_products=1000)
  • Use faster embedding models for testing
  • Increase batch sizes if you have more RAM
  • Use SSD storage for Elasticsearch

📚 Dependencies

Core libraries:

  • elasticsearch: Search engine interface
  • sentence-transformers: Dense vector embeddings
  • ir-measures: Standard IR evaluation metrics
  • pandas/numpy: Data processing
  • matplotlib/seaborn: Visualizations
  • scipy: Statistical testing
  • nltk: Text preprocessing

🤝 Contributing

To extend the system:

  1. Add new search methods in search_engines.py
  2. Add new metrics in evaluation.py
  3. Add new visualizations in advanced_analysis.py
  4. Update the main pipeline accordingly

📄 License

This package is provided for research and educational purposes.

📞 Support

For issues:

  1. Check the logs in ir_evaluation.log
  2. Verify your data format matches the expected input
  3. Ensure all dependencies are installed correctly
  4. Check Elasticsearch is running and accessible

Happy Information Retrieval! 🔍📊

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载