A professional web scraping tool for extracting dealer and machine data from maschinensucher.de, built with Python and crawl4ai. This tool provides comprehensive data extraction capabilities with robust error handling, retry mechanisms, and output in multiple formats.
- Dealer Information: Company names, addresses, contact details, and geographic location
- Machine Inventory: Categories, subcategories, and inventory counts per dealer
- Contact Details: Phone numbers, fax, and contact persons (with phone number validation)
- Address Parsing: German address component extraction with structured fields
- Parallel Processing: Concurrent crawling for faster data extraction
- Smart Retry Logic: Exponential backoff with configurable retry parameters
- Rate Limiting: Configurable delays to respect website policies
- Error Handling: Comprehensive logging and screenshot capture on failures
- Memory Management: Adaptive concurrency control based on system resources
- CSV Export: Flattened data structure with category analytics
- JSON Export: Hierarchical data structure preserving original format
- Data Validation: Phone number validation and data cleaning
- Timestamped Files: Automatic file naming with category and timestamp
- Schema-Based Extraction: Dynamic CSS/JSON extraction schemas
- Category Filtering: Support for different machine categories
- Pagination Support: Multi-page crawling capabilities
- Debug Features: Skip lists, verbose logging, and error screenshots
- Python 3.8+
- Required dependencies listed in
requirements.txt
# Clone the repository
git clone <repository-url>
cd maschinensucher
# Install dependencies
pip install -r requirements.txt
# Create necessary directories
mkdir -p output schemas browser_data# Scrape dealers from default category (Holzbearbeitungsmaschinen)
python maschinensucher.py
# Scrape from specific category with multiple pages
python maschinensucher.py --category 2 --pages 5
# Include contact information extraction
python maschinensucher.py --category 3 --pages 2 --contactpython maschinensucher.py [OPTIONS]
Options:
--category TEXT Category code (tci-<category> from URL)
Default: "3" (Holzbearbeitungsmaschinen)
--pages INTEGER Number of pages to scrape (default: 1)
--contact Extract detailed contact information
--force-schema Force regeneration of extraction schemas
--help Show help messageCommon category codes for maschinensucher.de:
2: Metallbearbeitung (Metal Processing)3: Holzbearbeitungsmaschinen (Woodworking Machines)16: Druckmaschinen (Printing Machines)19: Lebensmitteltechnik (Food Technology)
maschinensucher/
├── maschinensucher.py # Main application
├── helpers.py # Utility functions
├── requirements.txt # Python dependencies
├── README.md # This file
├── schemas/ # Extraction schemas
│ ├── dealer_schema.json # Dealer list extraction
│ ├── dealer_info.json # Dealer contact details
│ └── machines_schema.json # Machine inventory data
├── output/ # Generated files
│ ├── *.csv # CSV exports
│ ├── *.json # JSON exports
│ └── error_screenshots/ # Debug screenshots
├── website_template/ # HTML templates for testing
└── __pycache__/ # Python cache files
Extracts dealer listings with:
- Company names and addresses
- Distance from location
- Category information
- Dealer profile links
Detailed contact extraction with:
- Phone number validation
- Contact person names
- Fax numbers
- Retry logic for failed extractions
Parallel extraction of machine data:
- Main categories and counts
- Subcategories with inventory
- Dealer-specific machine listings
- Address component parsing
- Category analytics generation
- Data validation and cleaning
- Multi-format output generation
The application supports different crawling strategies with configurable parameters:
delay_between_requests = 8.0 # 7-8 requests/minute
max_retries = 2
retry_base_delay = 10.0
retry_max_delay = 60.0delay_between_requests = 6.0 # 10 requests/minute
max_retries = 3
retry_base_delay = 8.0
retry_max_delay = 45.0delay_between_requests = 4.0 # 15 requests/minute
max_retries = 2
retry_base_delay = 6.0
retry_max_delay = 30.0Extraction schemas are defined in JSON files and can be automatically generated or manually configured:
dealer_schema.json: Dealer listing extractiondealer_info.json: Contact detail extractionmachines_schema.json: Machine inventory extraction
Flattened structure with columns:
- Dealer information (company_name, address components, contact details)
- Machine categories (main_category, main_category_count)
- Subcategories (sub_category_1, sub_category_1_count, etc.)
- Metadata (source_url, page_number, timestamp)
Hierarchical structure preserving:
- Original data relationships
- Nested subcategory arrays
- Complete metadata
- Configurable log levels (INFO, DEBUG, WARNING, ERROR)
- File logging (
app.log) and console output - Detailed extraction progress tracking
- Screenshot capture on extraction failures
- Exponential backoff retry mechanism
- Graceful handling of network timeouts
- Data consistency validation
- Skip lists for problematic dealers
- Verbose extraction logging
- Error screenshot storage
- Performance timing metrics
Key dependencies include:
crawl4ai: Advanced web crawling frameworkpandas: Data manipulation and analysisasyncio: Asynchronous programming supportbeautifulsoup4: HTML parsingaiohttp: Async HTTP client
See requirements.txt for complete dependency list.
- Minimum 4GB RAM recommended
- SSD storage for better I/O performance
- Stable internet connection
- Use parallel processing for large datasets
- Adjust retry parameters based on network conditions
- Monitor memory usage for large crawling operations
- Use appropriate delays to avoid rate limiting
- Respect website terms of service
- Implement appropriate rate limiting
- Avoid overloading target servers
- Comply with applicable data protection laws
- Use extracted data responsibly
- Fork the repository
- Create a feature branch
- Make your changes with appropriate tests
- Submit a pull request with detailed description
This project is provided as-is for educational and research purposes. Users are responsible for ensuring compliance with applicable laws and website terms of service.
For issues, questions, or contributions:
- Check the existing documentation
- Review log files for error details
- Examine schema files for extraction logic
- Test with single pages before bulk operations
Last updated: May 30, 2025