v4.0.0 Release Notes

includes critical fixes for text encoding, space preservation, and text positioning, along with improved error handling. This release contains breaking changes that require attention when upgrading from v3.x.

🚨 Breaking Changes

Text Encoding Change (Issue #385, PR #410)

What Changed: Text in JSON output is no longer URI-encoded. All text now outputs as UTF-8 directly.

Why: To properly support Chinese, Japanese, Korean, and other multi-byte Unicode characters. The previous URI encoding caused issues with CJK text display and partial character extraction.

Migration Required: If your code expects URI-encoded text, you must update it to handle plain UTF-8 text.

JSON Output Examples

Before v4.0.0 (URI-encoded):

{
  "Pages": [{
    "Texts": [{
      "R": [{
        "T": "Added%20Text%20from%20Acrobat"
      }]
    }]
  }]
}

After v4.0.0 (UTF-8):

{
  "Pages": [{
    "Texts": [{
      "R": [{
        "T": "Added Text from Acrobat"
      }]
    }]
  }]
}

Code Migration

Before v4.0.0:

// Had to decode URI components
const text = decodeURIComponent(textObj.R[0].T);
// Output: "Added Text from Acrobat"

After v4.0.0:

// Direct text access, no decoding needed
const text = textObj.R[0].T;
// Output: "Added Text from Acrobat"

CJK Character Support

Before v4.0.0:

{
  "T": "%E4%B8%AD%E6%96%87"
}

After v4.0.0:

{
  "T": "中文"
}

✨ Features & Enhancements

Accurate Space Preservation (Issues #355, #361, #319, PR #411)

Complete overhaul of space detection and preservation in text extraction (test CLI with -c command line option):

Glyph-based width calculation - Uses actual font metrics instead of estimates
Proper coordinate system handling - Correctly processes scaled positions with unscaled widths
Text scale support - Applies textHScale for compressed/expanded text
Dynamic Y-tolerance - Font size-aware vertical positioning (fontSize × 0.15)

Impact: Spaces in extracted text (both content.txt and JSON output) now accurately reflect the original PDF layout. Multi-word phrases, tables, and formatted text preserve proper spacing.

Example Output Improvement

Before v4.0.0:

Name:JohnDoeSSN:123-45-6789

After v4.0.0:

Name: John Doe    SSN: 123-45-6789

🐛 Bug Fixes

Text Block Coordinate Accuracy (Issue #408, PR #409)

Fixed text block coordinate calculations for proper positioning
Added comprehensive coordinate tests
Ensures accurate x/y values in JSON output

Character Extraction Completeness (Issue #385, PR #410)

Fixed missing character extraction for glyphs marked as "disabled"
Moved text extraction outside glyph.disabled check
All visible characters now properly extracted

CLI Error Handling (Issue #414)

Unified error and exception handling for CLI operations
Better error messages for invalid input parameters
Auto-creates output directory when not specified (removed unnecessary validation)
Improved stack trace display

more related issues should have been fixed (needs testing PDFs)

#352 : unexpected space
#291 : problem with sentences broken into 1 word
#272 : unrecognized Text
#220 : two TEXTs unexpected joined together in one RUN
#212 : content is being randomly split into multiple lines
#177 : heading level of text is not captured
#156 : extracting table content
#94 : parser not handling some spaces between words

📦 Dependencies

Maintained zero runtime dependencies (since v3.1.6)
Updated development dependencies for build tooling

fix #406
refactor: separate out logger functionality from nodeUtil

types update:
- fix #392
- update types for root pdfparser.js
feat: add type3 glyph font test support
- issue fixed: #389, #377, #332
- architectural compliance, separate the type3 glyph fonts processing from rendering, use standard canvas text rendering pipeline for glyph, tested with /test/pdf/misc/i389_type3_glyph.pdf
chores: update README, bump dev dependencies versions while keeping zero dependency

@nicolabaesso

add support for deno and bun plus tests
-- fix: issue #68 and #396
-- add node:protocol to make them explicit when running in env other than node, including deno and bun
moved root pdfparser source and types to ./src and ./src/types respectively ---- double check your import path please, all exports from ./dist now
reduce distributed package size to 2.1mb, improve pack and build
feat: enable reading multiple pdf files with a single PDFParser object, credit @nicolabaesso
other chores, including tests, jest upgrade, readme update, etc.

@bogie

What's Changed

zero dependency: remove dependency on @xmldom/xmldom to make pdf2json zero dependency
fix: correct link for open code of conduct #204
Fixed radio/checkbox return values in getAllFieldsTypes(), thanks @bogie for #383
fix: move package manager version from engines to devEngines, thanks @styfle for #387

New Contributors

@bogie made their first contribution in #383
@styfle made their first contribution in #387

Full Changelog: v3.1.5...v3.1.6

@grainrigi

feature added:

add commonjs type definition file generation, thanks @grainrigi
add 'types' to package.json 'exports' root, thanks @jeremybanka

Issues addressed:

fix #165: check and make buffer before parse
fix #373: handle bad encoding expcetion by start page rendering after page operator list is resolved
fix #306: infinite loop of invalid stram
fix #369: handle object value for field's rectangle coordinates
other maintenance, eslint, tsconfig, dependency version bumps, etc.

dev-dependency updates for braces,
correct import for typescript type to fix #349: Cannot compile project with 3.1.3
plus issues addressed in v3.1.4:
- #350: replace nodeUtil.warn with nodeUtil.p2jwarn
- #274: Invalid XRef stream
- #216: stream must have data, verfied fix

eslint is configured and enabled
typescript: configured and part of build
typescript: updated pdfparser.d.ts with more types
typescript: previous lib/p2jcmd*.js are replaced with src/cli/p2jcli*.ts
maint: previous root/pdf2json.js is removed, favor bin/pdf2json.js
tests: Jest test's Page content are validated with test/data/xxx.json
error and exception handling: address the following issues and also added associated test PDFs:
** ENOENT: no such file or directory, open '/var/task/../package.json' #343
** Node.js Server got stuck when parsing specific PDF while it is working for other PDFs #321
** TypeError: Cannot read property 'free' of undefined #318
** parserError: 'bad XRef entry' #277
** params.get is not a function #262
** Error: Requesting object that isn't resolved yet #255

add conditional export for both esm and cjs,
remove unused dev dependency
more tests

This v3.1.1 release replaces pdf2json@3.1.0.

output to both esm and commonJS bundles and source map with rollup
bundle outputs directory: ./dist
note: previous pdfparser.cjs from root is moved to ./dist/pdfparser.cjs
note: previous output bundles are now minified
note: previous vows tests are removed, test suits are rewritten in Jest, currently 23 test cases
note: npm build is required to run command line, output from build step is not tracked by git
more README.md updates and type corrections, thanks @gladykov @mkrishnan-codes
add env option to disable debugging logs, thanks @AyresMonteiro

Releases: modesty/pdf2json

Stable Build v4.0.0 [Breaking Changes]

v4.0.0 Release Notes

🚨 Breaking Changes

Text Encoding Change (Issue #385, PR #410)

JSON Output Examples

Code Migration

CJK Character Support

✨ Features & Enhancements

Accurate Space Preservation (Issues #355, #361, #319, PR #411)

Example Output Improvement

🐛 Bug Fixes

Text Block Coordinate Accuracy (Issue #408, PR #409)

Character Extraction Completeness (Issue #385, PR #410)

CLI Error Handling (Issue #414)

more related issues should have been fixed (needs testing PDFs)

📦 Dependencies

Uh oh!

Stable Build v3.2.2

Uh oh!

Stable build: V3.2.1

Uh oh!

Stable build v3.2.0

Contributors

Uh oh!

Stable build v3.1.6

What's Changed

New Contributors

Contributors

Uh oh!

Stable build v3.1.5

feature added:

Issues addressed:

Contributors

Uh oh!

Stable Build v3.1.4

Uh oh!

Stable build v3.1.3

Uh oh!

Stable build v3.1.2

Uh oh!

Stable build v3.1.1

Uh oh!