+
Skip to content

Tags: commoncrawl/nutch

Tags

CC-MAIN-2025-26

Toggle CC-MAIN-2025-26's commit message
Add tool UrlSamplerHost:

- sample URLs based on limits per host (without leading www.)
- variant of UrlSampler (per-domain limits)

CC-MAIN-2025-21

Toggle CC-MAIN-2025-21's commit message
Add tool UrlSamplerHost:

- sample URLs based on limits per host (without leading www.)
- variant of UrlSampler (per-domain limits)

CC-MAIN-2025-18

Toggle CC-MAIN-2025-18's commit message
WarcCdxWriter: normalize URL of redirect target location

Convert the redirect target location into an absolute URL
and normalize the URL using the URL normalizer configured
for scope "fetcher" before storing it as field "redirect"
in the CDX file.

Create all instances of SimpleDateFormat using the ROOT
locale, use timezone "UTC" consistently.

CC-MAIN-2025-13

Toggle CC-MAIN-2025-13's commit message
WarcCdxWriter: normalize URL of redirect target location

Convert the redirect target location into an absolute URL
and normalize the URL using the URL normalizer configured
for scope "fetcher" before storing it as field "redirect"
in the CDX file.

Create all instances of SimpleDateFormat using the ROOT
locale, use timezone "UTC" consistently.

CC-MAIN-2025-08

Toggle CC-MAIN-2025-08's commit message
Merge upstream branch 'master' into cc

CC-MAIN-2025-05

Toggle CC-MAIN-2025-05's commit message
Merge upstream branch 'master' into cc

CC-MAIN-2024-51

Toggle CC-MAIN-2024-51's commit message
NUTCH-3072 Fetcher to stop QueueFeeder if aborting with "hung threads"

- fix typo in counter name

CC-MAIN-2024-46

Toggle CC-MAIN-2024-46's commit message
Merge upstream branch 'master' into cc

Notable changes (speed up of unit tests):
- NUTCH-2771 Tests in nightly builds: skip long runners
- NUTCH-3084 Improve CI by filtering and separating plugin
  and core test execution

CC-MAIN-2024-42

Toggle CC-MAIN-2024-42's commit message
Merge branch 'NUTCH-3067' into cc

cf. https://issues.apache.org/jira/browse/NUTCH-3067

CC-MAIN-2024-38

Toggle CC-MAIN-2024-38's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #30 from commoncrawl/cc-http2

WARC writer support HTTP/2
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载