Tags: commoncrawl/nutch
Tags
Add tool UrlSamplerHost: - sample URLs based on limits per host (without leading www.) - variant of UrlSampler (per-domain limits)
Add tool UrlSamplerHost: - sample URLs based on limits per host (without leading www.) - variant of UrlSampler (per-domain limits)
WarcCdxWriter: normalize URL of redirect target location Convert the redirect target location into an absolute URL and normalize the URL using the URL normalizer configured for scope "fetcher" before storing it as field "redirect" in the CDX file. Create all instances of SimpleDateFormat using the ROOT locale, use timezone "UTC" consistently.
WarcCdxWriter: normalize URL of redirect target location Convert the redirect target location into an absolute URL and normalize the URL using the URL normalizer configured for scope "fetcher" before storing it as field "redirect" in the CDX file. Create all instances of SimpleDateFormat using the ROOT locale, use timezone "UTC" consistently.
NUTCH-3072 Fetcher to stop QueueFeeder if aborting with "hung threads" - fix typo in counter name
Merge upstream branch 'master' into cc Notable changes (speed up of unit tests): - NUTCH-2771 Tests in nightly builds: skip long runners - NUTCH-3084 Improve CI by filtering and separating plugin and core test execution
Merge branch 'NUTCH-3067' into cc cf. https://issues.apache.org/jira/browse/NUTCH-3067
Merge pull request #30 from commoncrawl/cc-http2 WARC writer support HTTP/2
PreviousNext