As a Software Analyst, I want to collect web links (URL’s) from a given initial web link (URL)
This version (folder scrapy-single) is a standalone version that executes in your local machine (for learning)
To run the unit test for URL and database manipulation:
python3 -m unittest -v crawlerAppUnitTest.py
And to execute the crawler, using the IBM website as a starting point:
scrapy crawl mainSpider -a starting_url=https://www.ibm.com/
Commentaries:
- Currently using DEPTH_LIMIT = 1, in which it can be changed in settings.py file
- Stores URLs in sqlite3, using an indexed column with the URL hash values (MD5) to validate duplicity
- Required libs (installed with pip):
| lib | version |
|---|---|
| scrapy | 2.3.0 |
| sqlite3 | 2.6.0 |
The enhanced version (folder scrapy-docker) to run with containers (docker) based on the following component:
https://pypi.org/project/scrapy-redis/
- Mount your docker instances (with unit test):
bash buildContainers.sh
- Feed or List URLs into Redis (use -h for tips):
python3 manageCrawler.py
- Check for crawler logs with:
docker logs scrapydocker_crawler_1 --follow