这是indexloc提供的服务,不要输入任何密码
Skip to content

angelagonzalezp/scraping-examples-py

Repository files navigation

scraping-examples-py

This repo contains some web scraping examples.

Contents

Scrapers

IMDb

imdb_chart

imdb_chart.py scrapes IMDb most popular movies or TV shows.

Parameters

  • -o: output CSV file to store results.
  • -t: optional flag. If specified, the script will scrape IMDb TV shows chart instead of moviemeter.

How to run

  • Moviemeter: cd .\IMDb\;python imdb_chart.py -o "./output/moviemeter-example.csv"
  • TVmeter: cd .\IMDb\;python imdb_chart.py -o "./output/tvmeter-example.csv" -t

NOTE: Powershell syntax.

moviemeter_cast

moviemeter_cast.py scrapes the cast for the films in Moviemeter Chart.

Parameters

  • -o: output JSON file to store results.
  • -l: optional flag. It limits the number of actors and actresses retrieved for each film.

How to run cd .\IMDb\;python .\moviemeter_cast.py -l 3 -o .\output\moviemeter_cast.json

Billboard

billboard_hot100

billboard_hot100.py scrapes Billboard Hot 100 chart. So far, this script fetches the name of the song and author.

Parameters

  • -o: output CSV file to store results.

How to run cd .\Billboard\;python .\billboard_hot100.py -o .\output\billboard_top100.csv

goodreads

goodreads_top100

goodreads_top100.py scrapes Goodreads Top 100 - Highest Rated Books on Goodreads with at least 10,000 Ratings. It can scrape any other list in the website if we pass the URL as an argument.

Parameters

  • -o: output CSV file to store results.
  • -m: optional flag. If specified, it writes the scraped data to a MongoDB collection specified as environment variables.
  • -u: optional. We can pass a url to any list in goodreads website

How to run

  • Default url value (Goodreads Top 100): cd .\goodreads\;python .\goodreads_top100.py -o .\output\goodreads_top100.csv [-u][-m]

For any other list:

  • Best Books of the Decade 2020's: cd .\goodreads\;python .\goodreads_top100.py -o .\output\top_decade_2020.csv -u https://www.goodreads.com/list/show/143500.Best_Books_of_the_Decade_2020_s?ref=ls_fl_0_seeall
  • Best Books of 20th century: cd .\goodreads\;python .\goodreads_top100.py -o .\output\top_20th_century.csv -u https://www.goodreads.com/list/show/6

BBC

bbc_news

bbc_news.py scrapes BBC News iterating through articles.

Parameters

  • -o: output JSON file to store results.

How to run cd .\BBC\;python .\bbc_news.py -o .\output\bbc-news-example.json

CoinMarketCap

trending-cryptocurrencies

trending-cryptocurrencies.py scrapes the hottest trending cryptocurrencies on CoinMarketCap.

Parameters

  • -o: output CSV file to store results.

How to run cd .\coinmarketcap\;python .\trending-cryptocurrencies.py -o .\output\trending-cryptocurrencies.csv

Docker setup

  • docker build -t scraping-examples-py .
  • docker compose up -d
  • docker exec -it scraping-examples-py bash
  • Run a scraper: cd BBC/ && python bbc_news.py -o ./output/docker-example-bbc.json

About

Python examples on how to scrape popular websites

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published