+
Skip to content

Releases: crawler-commons/crawler-commons

crawler-commons-1.5

03 Jul 07:15
Compare
Choose a tag to compare

Important Changes

  • The robots.txt parser is now pedantic regarding the user-agent names passed to the parseContent() method. The names in the robotNames parameter must be lower-case and the wildcard agent name "*" must not be included. An exception is thrown if these conditions are not met. Please see the Javadoc and #453.

Full List of Changes

  • Migrate publishing from OSSRH to Central Portal (jnioche, sebastian-nagel, Richard Zowalla, aecio) #510, #516
  • [Sitemaps] Add cross-submit feature (Avi Hayun, kkrugler, sebastian-nagel, Richard Zowalla) #85, #515
  • [Sitemaps] Complete sitemap extension attributes (sebastian-nagel, Richard Zowalla) #513, #514
  • [Sitemaps] Allow partial extension metadata (adriabonetmrf, sebastian-nagel, Richard Zowalla) #456, #458, #512
  • [Domains] EffectiveTldFinder to also take shorter suffix matches into account (sebastian-nagel, Richard Zowalla) #479, #505
  • Add package-info.java to all packages (sebastian-nagel, Richard Zowalla) #432, #504
  • [Robots.txt] Extend API to allow to check java.net.URL objects (sebastian-nagel, aecio, Richard Zowalla) #502
  • [Robots.txt] Incorrect robots.txt result for uppercase user agents (teammakdi, sebastian-nagel, aecio, Richard Zowalla) #453, #500
  • Remove class utils.Strings (sebastian-nagel, Richard Zowalla) #503
  • [BasicNormalizer] Complete normalization feature list of BasicURLNormalizer (sebastian-nagel, kkrugler) #494
  • [Robots] Document that URLs not properly normalized may not be matched by robots.txt parser (sebastian-nagel, kkrugler) #492, #493
  • [Sitemaps] Added https variants of namespaces (jnioche) #487
  • [Domains] Add version of public suffix list shipped with release packages enhancement (sebastian-nagel, Richard Zowalla) #433, #484
  • [Domains] Improve representation of public suffix match results by class EffectiveTLD (sebastian-nagel, Richard Zowalla) #478
  • Javadoc: fix links to Java core classes (sebastian-nagel, Richard Zowalla) #417, #483
  • [Sitemaps] Improve logging done by SiteMapParser (Valery Yatsynovich, sebastian-nagel) #457
  • [Sitemaps] Google Sitemap PageMap extensions (josepowera, sebastian-nagel, Richard Zowalla, jnioche) #388, #442
  • [Domains] Installation of a gzip-compressed public suffix list from Maven cache breaks EffectiveTldFinder to address (sebastian-nagel, Richard Zowalla) #441, #443
  • Upgrade dependencies (dependabot) #437, #444, #448, #451, #473, #465, #466, #468, #488, #491, #506, #511, #517
  • Upgrade Maven plugins (dependabot) #434, #438, #439, #449, #445, #452, #455, #459, #460, #464, #469, #467, #470, #471, #472, #474, #475, #476, #477, #480, #481, #482, #489, #490, #495, #496, #497, #498, #499, #508, #509, #518
  • Upgrade GitHub workflow actions v2 -> v4 (sebastian-nagel, Richard Zowalla) #501

crawler-commons-1.4

18 Jul 11:56
Compare
Choose a tag to compare

Important Changes

  • Java 11 is now required to run or build crawler-commons
  • the robots.txt parser (SimpleRobotRulesParser) is now compliant with RFC 9309

Full List of Changes

  • [Robots.txt] Implement Robots Exclusion Protocol (REP) IETF Draft: port unit tests (sebastian-nagel, Richard Zowalla) #245, #360
  • [Robots.txt] Close groups of rules as defined in RFC 9309 (kkrugler, garyillyes, jnioche, sebastian-nagel) #114, #390, #430
  • [Robots.txt] Empty disallow statement not to clear other rules (sebastian-nagel, jnioche) #422, #424
  • [Robots.txt] SimpleRobotRulesParser main() to follow five redirects (sebastian-nagel, jnioche) #428
  • [Robots.txt] Add more spelling variants and typos of robots.txt directives (sebastian-nagel, jnioche) #425
  • [Robots.txt] Document effect of rules merging in combination with multiple agent names (sebastian-nagel, Richard Zowalla) #423, #426
  • [Robots.txt] Pass empty collection of agent names to select rules for any robot (wildcard user-agent name) (sebastian-nagel, Richard Zowalla) #427
  • [Robots.txt] Rename default user-agent / robot name in unit tests (sebastian-nagel, Richard Zowalla) #429
  • [Robots.txt] Add units test based on examples in RFC 9309 (sebastian-nagel, Richard Zowalla) #420
  • [BasicNormalizer] Query parameters normalization in BasicURLNormalizer (aecio, sebastian-nagel, Richard Zowalla) #308, #421
  • [Robots.txt] Deduplicate robots rules before matching (sebastian-nagel, jnioche) #416
  • [Robots.txt] SimpleRobotRulesParser main to use the new API method (sebastian-nagel, jnioche) #413
  • Generate JaCoCo reports when testing (jnioche) #409, #412
  • Push Code Coverage to Coveralls (Richard Zowalla, jnioche) #414
  • [Robots.txt] Path analyse bug with url-decode if allow/disallow path contains escaped wild-card characters (tkalistratov, sebastian-nagel, Richard Zowalla) #195, #408
  • [Robots.txt] Handle allow/disallow directives containing unescaped Unicode characters (sebastian-nagel, Richard Zowalla, aecio) #389, #401
  • [Robots.txt] Improve readability of robots.txt unit tests (sebastian-nagel, Richard Zowalla) #383
  • Upgrade project to use Java 11 (Avi Hayun, Richard Zowalla, aecio, sebastian-nagel) #320, #376
  • [Robots.txt] RFC compliance: matching user-agent names when selecting rule blocks (sebastian-nagel, Richard Zowalla) #362
  • [Robots.txt] Matching user-agent names does not conform to robots.txt RFC (YossiTamari, sebastian-nagel) #192
  • [Robots.txt] Improve robots check draft rfc compliance (Eduardo Jimenez) #351
  • Upgrade dependencies (dependabot) #379, #384, #394, #399, #404, #419
  • Upgrade Maven plugins (dependabot) #377, #381, #386, #396, #397, #398, #400, #402, #403, #405, #406, #407, #415, #418
  • Javadoc: ensure Javascript search is working (sebastian-nagel, Richard Zowalla, aecio) #378, #380

crawler-commons-1.3

28 Jul 10:08
Compare
Choose a tag to compare
  • [Sitemaps] Disable support for DTDs in XML sitemaps and feeds by default (Kenneth Wong) #371
  • Migrate Continuous Integration from Travis to GitHub Actions (Valery Yatsynovich) #333
  • Upgrade dependencies (dependabot, Richard Zowalla) #334, #339, #345, #346, #347, #350, #354, #361, #369
  • Upgrade Maven plugins (dependabot, Richard Zowalla, sebastian-nagel) #328, #329, #330, #331, #335, #336, #337, #338, #340, #341, #343, #356, #363. #364, #366, #373, #374
  • Update pom.xml to address Maven warnings and deprecations (sebastian-nagel, Richard Zowalla, Avi Hayun) #342
  • Enable Dependabot (Valery Yatsynovich) #327
  • Removes test dependency towards mockito-core (Richard Zowalla) #367
  • Drops provided dependency towards servlet-api (Richard Zowalla) #368

crawler-commons-1.2

14 Oct 14:44
Compare
Choose a tag to compare
  • [Sitemaps] Avoid calling java.net.URL::equals in equals method of sitemaps and extensions (sebastian-nagel) #322
  • [URLs] Provide a builder class to configure the URL normalizer (aecio) #321, #324
  • [URLs] Make normalization of IDNs configurable (to ASCII or Unicode) via builder (aecio, sebastian-nagel) #324
  • [Sitemaps] Fix XXE vulnerability in Sitemap parser (kovyrin) #323
  • [URLs] Sorting the Query Parameters (aecio) #246, #309
  • [URLs] Allows to (optionally) remove common irrelevant query parameters (aecio) #309
  • [Sitemaps] Allow to normalize URLs in sitemaps (murderinc, sebastian-nagel) #305
  • Normalize CHANGES.txt (Avi Hayun) #270
  • Readme.MD Overhaul of TOC, Installation, License (Avi Hayun) #311
  • [URLs] Normalize URL without a scheme (Avi Hayun, sebastian-nagel) #271
  • [Domains] EffectiveTldFinder: upgrade public suffix list / Download latest effective_tld_names.dat during Maven build (Richard Zowalla) #295, #302
  • [URLs] decode percent-encoded host names (sebastian-nagel) #303
  • [Sitemaps] Document options strict and allowPartial in SiteMapParser constructors (sebastian-nagel) #267
  • [Robots.txt] Maximum values (crawl-delay and warnings): document and make visible (sebastian-nagel, Avi Hayun) #276
  • [Sitemaps] Replace priority "NaN" by default value (sebastian-nagel) #296
  • [Sitemaps] Adding duration to the map generated by VideoAttributes.asMap (evanhalley) #300

crawler-commons 1.1

29 Jun 17:10
Compare
Choose a tag to compare
crawler-commons-1.1

[maven-release-plugin] copy for tag crawler-commons-1.1

crawler-commons 1.0

21 Mar 21:04
Compare
Choose a tag to compare
crawler-commons-1.0

[maven-release-plugin] copy for tag crawler-commons-1.0

Release 0.10

07 Jun 08:24
Compare
Choose a tag to compare
  • Add JAX-B dependencies to POM (jnioche) #207
  • [Sitemaps] Add method to parse and iterate sitemap SiteMapParser#walkSiteMap(URL,Consumer) (Luc Boruta) #190
  • [Sitemaps] Sitemap file location to ignore query part of URL (sebastian-nagel) #202
  • [RSS sitemaps] Link extraction from RSS feeds fails on XML entities (sebastian-nagel) #204
  • [RSS sitemaps] Resolve relative links in RSS feeds (sebastian-nagel) #203
  • [RSS sitemaps] Extract links from elements (sebastian-nagel) #201
  • [Sitemaps] Limit on "bad url" log messages (sebastian-nagel) #145
  • EffectiveTldFinder to parse Internationalized Domain Names (sebastian-nagel) #179
  • Add main() to EffectiveTldFinder (sebastian-nagel) #187
  • Handle new suffixes in PaidLevelDomain (kkrugler) #183
  • Remove Tika dependency (kkrugler) #199
  • Improve MIME detection for sitemaps (sebastian-nagel) #200
  • Make RobotRules accessible (aecio via kkrugler) #134
  • SimpleRobotRulesParser: Expose MAX_WARNINGS and MAX_CRAWL_DELAY (aecio via kkrugler) #194
  • Added main to SimpleRobotRulesParser for testing (sebastian-nagel) #193
  • Allow for legacy URIs when checking sitemap namespaces (sebastian-nagel) #211

Release 0.9

31 Oct 09:50
Compare
Choose a tag to compare
  • [Sitemaps] Removed DOM-based sitemap parser (jnioche) #177
  • Incorrect domains returned by EffectiveTldFinder (sebastian-nagel) #172
  • [Sitemaps] Add namespace aware DOM/SAX parsing for XML Sitemaps (Marko Milicevic, jnioche, sebastian-nagel) #176
  • Upgraded Tika 1.16 (jnioche) #175
  • [Sitemaps] Sitemap SAX parsing mangles target URLs (jnioche, sebastian-nagel) #169
  • [Sitemaps] RSS parser ignores pubDate of link (MichealKum via kkrugler) #166

Release 0.8

09 Jun 09:22
Compare
Choose a tag to compare
  • Upgrade to JDK 1.8 (lewismc) #126
  • [Sitemaps] SitemapParser methods now protected (michaellavelle) #124
  • [Sitemaps] Faster parsing of dates (jnioche) #117
  • Upgraded Tika 1.13 (jnioche) #113
  • Fix license headers (jnioche) #108
  • Rename package crawlercommons.url (jnioche) #107
  • Sitemap url is not extracted if user agent matches earlier in file (srwilson, kkrugler) #112
  • Deprecate HTTP fetcher support (kkrugler) #92
  • Added URLFilter interface + BasicURLNormalizer (jnioche) #106
  • Updated tld names from publicsuffix.org (jnioche) #100
  • Upgraded http-client to version 4.5.1 (aecio via kkrugler) #84
  • Upgraded Tika 1.10 (jnioche) #89
  • [Sitemaps] Upgrade Valid / Legal / Strict SitemapUrls (Avi Hayun) #82
  • [Sitemaps] Upgrade Valid / Legal / Strict SitemapUrls (Avi Hayun) #60
  • Simplify pom file (jnioche, lewismc) #77
  • Upgrade javac.src.version and javac.target.version to 1.7 or 1.8 (lewismc) #93
  • [Sitemaps] Not able to detect RSS feeds (yogendrasoni via kkrugler) #87
  • [Robots] Added javadoc comments to the SimpleRobotRulesParser class (MichaelRoeder, kkrugler) #95

Release 0.7

21 Nov 15:14
Compare
Choose a tag to compare
  • Upgrade to JDK 1.8 (lewismc) #126
  • [Sitemaps] SitemapParser methods now protected (michaellavelle) #124
  • [Sitemaps] Faster parsing of dates (jnioche) #117
  • Upgraded Tika 1.13 (jnioche) #113
  • Fix license headers (jnioche) #108
  • Rename package crawlercommons.url (jnioche) #107
  • Sitemap url is not extracted if user agent matches earlier in file (srwilson, kkrugler) #112
  • Deprecate HTTP fetcher support (kkrugler) #92
  • Added URLFilter interface + BasicURLNormalizer (jnioche) #106
  • Updated tld names from publicsuffix.org (jnioche) #100
  • Upgraded http-client to version 4.5.1 (aecio via kkrugler) #84
  • Upgraded Tika 1.10 (jnioche) #89
  • [Sitemaps] Upgrade Valid / Legal / Strict SitemapUrls (Avi Hayun) #82
  • [Sitemaps] Upgrade Valid / Legal / Strict SitemapUrls (Avi Hayun) #60
  • Simplify pom file (jnioche, lewismc) #77
  • Upgrade javac.src.version and javac.target.version to 1.7 or 1.8 (lewismc) #93
  • [Sitemaps] Not able to detect RSS feeds (yogendrasoni via kkrugler) #87
  • [Robots] Added javadoc comments to the SimpleRobotRulesParser class (MichaelRoeder, kkrugler) #95
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载