这是indexloc提供的服务,不要输入任何密码
Skip to content

Releases: modesty/pdf2json

Stable Build v4.0.0 [Breaking Changes]

12 Oct 19:47
c8b372b

Choose a tag to compare

v4.0.0 Release Notes

includes critical fixes for text encoding, space preservation, and text positioning, along with improved error handling. This release contains breaking changes that require attention when upgrading from v3.x.

🚨 Breaking Changes

Text Encoding Change (Issue #385, PR #410)

What Changed: Text in JSON output is no longer URI-encoded. All text now outputs as UTF-8 directly.

Why: To properly support Chinese, Japanese, Korean, and other multi-byte Unicode characters. The previous URI encoding caused issues with CJK text display and partial character extraction.

Migration Required: If your code expects URI-encoded text, you must update it to handle plain UTF-8 text.

JSON Output Examples

Before v4.0.0 (URI-encoded):

{
  "Pages": [{
    "Texts": [{
      "R": [{
        "T": "Added%20Text%20from%20Acrobat"
      }]
    }]
  }]
}

After v4.0.0 (UTF-8):

{
  "Pages": [{
    "Texts": [{
      "R": [{
        "T": "Added Text from Acrobat"
      }]
    }]
  }]
}

Code Migration

Before v4.0.0:

// Had to decode URI components
const text = decodeURIComponent(textObj.R[0].T);
// Output: "Added Text from Acrobat"

After v4.0.0:

// Direct text access, no decoding needed
const text = textObj.R[0].T;
// Output: "Added Text from Acrobat"

CJK Character Support

Before v4.0.0:

{
  "T": "%E4%B8%AD%E6%96%87"
}

After v4.0.0:

{
  "T": "中文"
}

✨ Features & Enhancements

Accurate Space Preservation (Issues #355, #361, #319, PR #411)

Complete overhaul of space detection and preservation in text extraction (test CLI with -c command line option):

  • Glyph-based width calculation - Uses actual font metrics instead of estimates
  • Proper coordinate system handling - Correctly processes scaled positions with unscaled widths
  • Text scale support - Applies textHScale for compressed/expanded text
  • Dynamic Y-tolerance - Font size-aware vertical positioning (fontSize × 0.15)

Impact: Spaces in extracted text (both content.txt and JSON output) now accurately reflect the original PDF layout. Multi-word phrases, tables, and formatted text preserve proper spacing.

Example Output Improvement

Before v4.0.0:

Name:JohnDoeSSN:123-45-6789

After v4.0.0:

Name: John Doe    SSN: 123-45-6789

🐛 Bug Fixes

Text Block Coordinate Accuracy (Issue #408, PR #409)

  • Fixed text block coordinate calculations for proper positioning
  • Added comprehensive coordinate tests
  • Ensures accurate x/y values in JSON output

Character Extraction Completeness (Issue #385, PR #410)

  • Fixed missing character extraction for glyphs marked as "disabled"
  • Moved text extraction outside glyph.disabled check
  • All visible characters now properly extracted

CLI Error Handling (Issue #414)

  • Unified error and exception handling for CLI operations
  • Better error messages for invalid input parameters
  • Auto-creates output directory when not specified (removed unnecessary validation)
  • Improved stack trace display

more related issues should have been fixed (needs testing PDFs)

  • #352 : unexpected space
  • #291 : problem with sentences broken into 1 word
  • #272 : unrecognized Text
  • #220 : two TEXTs unexpected joined together in one RUN
  • #212 : content is being randomly split into multiple lines
  • #177 : heading level of text is not captured
  • #156 : extracting table content
  • #94 : parser not handling some spaces between words

📦 Dependencies

  • Maintained zero runtime dependencies (since v3.1.6)
  • Updated development dependencies for build tooling

Stable Build v3.2.2

19 Sep 01:47
1faf820

Choose a tag to compare

  • fix #406
  • refactor: separate out logger functionality from nodeUtil

Stable build: V3.2.1

13 Sep 23:47
b03348e

Choose a tag to compare

  • types update:
    • fix #392
    • update types for root pdfparser.js
  • feat: add type3 glyph font test support
    • issue fixed: #389, #377, #332
    • architectural compliance, separate the type3 glyph fonts processing from rendering, use standard canvas text rendering pipeline for glyph, tested with /test/pdf/misc/i389_type3_glyph.pdf
  • chores: update README, bump dev dependencies versions while keeping zero dependency

Stable build v3.2.0

26 Jul 23:02

Choose a tag to compare

  1. add support for deno and bun plus tests
    -- fix: issue #68 and #396
    -- add node:protocol to make them explicit when running in env other than node, including deno and bun
  2. moved root pdfparser source and types to ./src and ./src/types respectively ---- double check your import path please, all exports from ./dist now
  3. reduce distributed package size to 2.1mb, improve pack and build
  4. feat: enable reading multiple pdf files with a single PDFParser object, credit @nicolabaesso
  5. other chores, including tests, jest upgrade, readme update, etc.

Stable build v3.1.6

24 May 00:20
2298a86

Choose a tag to compare

What's Changed

  • zero dependency: remove dependency on @xmldom/xmldom to make pdf2json zero dependency
  • fix: correct link for open code of conduct #204
  • Fixed radio/checkbox return values in getAllFieldsTypes(), thanks @bogie for #383
  • fix: move package manager version from engines to devEngines, thanks @styfle for #387

New Contributors

Full Changelog: v3.1.5...v3.1.6

Stable build v3.1.5

03 Jan 23:13
49486ef

Choose a tag to compare

feature added:

  1. add commonjs type definition file generation, thanks @grainrigi
  2. add 'types' to package.json 'exports' root, thanks @jeremybanka

Issues addressed:

  1. fix #165: check and make buffer before parse
  2. fix #373: handle bad encoding expcetion by start page rendering after page operator list is resolved
  3. fix #306: infinite loop of invalid stram
  4. fix #369: handle object value for field's rectangle coordinates
  5. other maintenance, eslint, tsconfig, dependency version bumps, etc.

Stable Build v3.1.4

09 Aug 18:54
115d618

Choose a tag to compare

  • dev-dependency updates for braces,
  • correct import for typescript type to fix #349: Cannot compile project with 3.1.3
  • plus issues addressed in v3.1.4:
    • #350: replace nodeUtil.warn with nodeUtil.p2jwarn
    • #274: Invalid XRef stream
    • #216: stream must have data, verfied fix

Stable build v3.1.3

25 May 19:05

Choose a tag to compare

  • eslint is configured and enabled
  • typescript: configured and part of build
  • typescript: updated pdfparser.d.ts with more types
  • typescript: previous lib/p2jcmd*.js are replaced with src/cli/p2jcli*.ts
  • maint: previous root/pdf2json.js is removed, favor bin/pdf2json.js
  • tests: Jest test's Page content are validated with test/data/xxx.json
  • error and exception handling: address the following issues and also added associated test PDFs:
    ** ENOENT: no such file or directory, open '/var/task/../package.json' #343
    ** Node.js Server got stuck when parsing specific PDF while it is working for other PDFs #321
    ** TypeError: Cannot read property 'free' of undefined #318
    ** parserError: 'bad XRef entry' #277
    ** params.get is not a function #262
    ** Error: Requesting object that isn't resolved yet #255

Stable build v3.1.2

04 May 19:00

Choose a tag to compare

  • add conditional export for both esm and cjs,
  • remove unused dev dependency
  • more tests

Stable build v3.1.1

01 May 01:18
3c703fe

Choose a tag to compare

This v3.1.1 release replaces pdf2json@3.1.0.

  • output to both esm and commonJS bundles and source map with rollup
  • bundle outputs directory: ./dist
  • note: previous pdfparser.cjs from root is moved to ./dist/pdfparser.cjs
  • note: previous output bundles are now minified
  • note: previous vows tests are removed, test suits are rewritten in Jest, currently 23 test cases
  • note: npm build is required to run command line, output from build step is not tracked by git
  • more README.md updates and type corrections, thanks @gladykov @mkrishnan-codes
  • add env option to disable debugging logs, thanks @AyresMonteiro