This library provides Scrapy+JavaScript integration using Splash. The license is BSD 3-clause.
Install ScrapyJS using pip:
$ pip install scrapyjs
ScrapyJS uses Splash HTTP API, so you also need a Splash instance. Usually to install & run Splash something like this is enough:
$ docker run -p 8050:8050 scrapinghub/splash
Check Splash install docs for more info.
To process JavaScript from Scrapy spiders one can use Splash HTTP API directly, without SrapyJS. For example, let's fetch HTML of a webpage, as returned by a browser:
import json
import scrapy
from scrapy.http.headers import Headers
RENDER_HTML_URL = "http://127.0.0.1:8050/render.html"
class MySpider(scrapy.Spider):
start_urls = ["http://example.com", "http://example.com/foo"]
def start_requests(self):
for url in self.start_urls:
body = json.dumps({"url": url, "wait": 0.5})
headers = Headers({'Content-Type': 'application/json'})
yield scrapy.Request(RENDER_HTML_URL, self.parse, method="POST",
body=body, headers=headers)
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
# ...
It was easy enough, but the code has some problems:
- There is a bit of biolerplate.
- As seen by Scrapy, we're sending requests to RENDER_HTML_URL instead
of the target URLs. It affects concurrency and politeness settings:
CONCURRENT_REQUESTS_PER_DOMAIN,DOWNLOAD_DELAY, etc could work in unexpected way as delays and concurrency settings are no longer per-domain. - Some options depend on each other - for example, if you use timeout
Splash option then you may want to set
download_timeoutscrapy.Request meta key as well.
ScrapyJS utlities allow to handle such edge cases and reduce the boilerplate.
Put Splash server address to settings.py of your Scrapy project like this:
SPLASH_URL = 'http://192.168.59.103:8050'
Enable the middleware by adding it to
DOWNLOADER_MIDDLEWARES:DOWNLOADER_MIDDLEWARES = { 'scrapyjs.SplashMiddleware': 725, }Order 725 is just before HttpProxyMiddleware (750) in default scrapy settings.
You also have to set a custom DUPEFILTER_CLASS:
DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'
If you use Scrapy HTTP cache then a custom cache storage backend is required. ScrapyJS provides a subclass of
scrapy.contrib.httpcache.FilesystemCacheStorage:HTTPCACHE_STORAGE = 'scrapyjs.SplashAwareFSCacheStorage'
If you use other cache storage then it is necesary to subclass it and replace all
scrapy.util.request.request_fingerprintcalls withscrapyjs.splash_request_fingerprint.
Note
Steps (3) and (4) are necessary because Scrapy doesn't provide a way to override request fingerprints calculation algorithm globally; this could change in future.
To render the requests with Splash use 'splash' Request meta key:
yield Request(url, self.parse_result, meta={
'splash': {
'args': {
# set rendering arguments here
'html': 1,
'png': 1,
# 'url' is prefilled from request url
},
# optional parameters
'endpoint': 'render.json', # optional; default is render.json
'splash_url': '<url>', # overrides SPLASH_URL
'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN,
}
})
meta['splash']['args']contains arguments sent to Splash. ScrapyJS adds request.url to these arguments automatically.meta['splash']['endpoint']is the Splash endpoint to use. By default render.json is used.See Splash HTTP API docs for a full list of available endpoints and parameters.
meta['splash']['splash_url']allows to override Splash URL set in settings.py.meta['splash']['slot_policy']allows to customize how concurrency & politeness are maintained for Splash requests.Currently there are 3 policies available:
scrapyjs.SlotPolicy.PER_DOMAIN(default) - send Splash requests to downloader slots based on URL being rendered. It is useful if you want to maintain per-domain politeness & concurrency settings.scrapyjs.SlotPolicy.SINGLE_SLOT- send all Splash requests to a single downloader slot. It is useful if you want to throttle requests to Splash.scrapyjs.SlotPolicy.SCRAPY_DEFAULT- don't do anything with slots. It is similar to SINGLE_SLOT policy, but can be different if you access other services on the same address as Splash.
Get HTML contents:
import scrapy
class MySpider(scrapy.Spider):
start_urls = ["http://example.com", "http://example.com/foo"]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
# ...
Get HTML contents and a screenshot:
import json
import base64
import scrapy
class MySpider(scrapy.Spider):
# ...
yield scrapy.Request(url, self.parse_result, meta={
'splash': {
'args': {
'html': 1,
'png': 1,
'width': 600,
'render_all': 1,
}
}
})
# ...
def parse_result(self, response):
data = json.loads(response.body_as_unicode())
body = data['html']
png_bytes = base64.b64decode(data['png'])
# ...
Run a simple Splash Lua Script:
import json
import base64
class MySpider(scrapy.Spider):
# ...
script = """
function main(splash)
assert(splash:go(splash.args.url))
return splash:evaljs("document.title")
end
"""
yield scrapy.Request(url, self.parse_result, meta={
'splash': {
'args': {'lua_source': script},
'endpoint': 'execute',
}
})
# ...
def parse_response(self, response):
doc_title = response.body_as_unicode()
# ...
Source code and bug tracker are on github: https://github.com/scrapinghub/scrapyjs
To run tests, install "tox" Python package and then run tox command
from the source checkout.