This repo contains some web scraping examples.
imdb_chart.py scrapes IMDb most popular movies or TV shows.
Parameters
- -o: output CSV file to store results.
- -t: optional flag. If specified, the script will scrape IMDb TV shows chart instead of moviemeter.
How to run
- Moviemeter:
cd .\IMDb\;python imdb_chart.py -o "./output/moviemeter-example.csv" - TVmeter:
cd .\IMDb\;python imdb_chart.py -o "./output/tvmeter-example.csv" -t
NOTE: Powershell syntax.
moviemeter_cast.py scrapes the cast for the films in Moviemeter Chart.
Parameters
- -o: output JSON file to store results.
- -l: optional flag. It limits the number of actors and actresses retrieved for each film.
How to run
cd .\IMDb\;python .\moviemeter_cast.py -l 3 -o .\output\moviemeter_cast.json
billboard_hot100.py scrapes Billboard Hot 100 chart. So far, this script fetches the name of the song and author.
Parameters
- -o: output CSV file to store results.
How to run
cd .\Billboard\;python .\billboard_hot100.py -o .\output\billboard_top100.csv
goodreads_top100.py scrapes Goodreads Top 100 - Highest Rated Books on Goodreads with at least 10,000 Ratings. It can scrape any other list in the website if we pass the URL as an argument.
Parameters
- -o: output CSV file to store results.
- -m: optional flag. If specified, it writes the scraped data to a MongoDB collection specified as environment variables.
- -u: optional. We can pass a url to any list in goodreads website
How to run
- Default url value (Goodreads Top 100):
cd .\goodreads\;python .\goodreads_top100.py -o .\output\goodreads_top100.csv [-u][-m]
For any other list:
- Best Books of the Decade 2020's:
cd .\goodreads\;python .\goodreads_top100.py -o .\output\top_decade_2020.csv -u https://www.goodreads.com/list/show/143500.Best_Books_of_the_Decade_2020_s?ref=ls_fl_0_seeall - Best Books of 20th century:
cd .\goodreads\;python .\goodreads_top100.py -o .\output\top_20th_century.csv -u https://www.goodreads.com/list/show/6
bbc_news.py scrapes BBC News iterating through articles.
Parameters
- -o: output JSON file to store results.
How to run
cd .\BBC\;python .\bbc_news.py -o .\output\bbc-news-example.json
trending-cryptocurrencies.py scrapes the hottest trending cryptocurrencies on CoinMarketCap.
Parameters
- -o: output CSV file to store results.
How to run
cd .\coinmarketcap\;python .\trending-cryptocurrencies.py -o .\output\trending-cryptocurrencies.csv
docker build -t scraping-examples-py .docker compose up -ddocker exec -it scraping-examples-py bash- Run a scraper:
cd BBC/ && python bbc_news.py -o ./output/docker-example-bbc.json