Releases: modesty/pdf2json
Stable Build v4.0.0 [Breaking Changes]
v4.0.0 Release Notes
includes critical fixes for text encoding, space preservation, and text positioning, along with improved error handling. This release contains breaking changes that require attention when upgrading from v3.x.
🚨 Breaking Changes
Text Encoding Change (Issue #385, PR #410)
What Changed: Text in JSON output is no longer URI-encoded. All text now outputs as UTF-8 directly.
Why: To properly support Chinese, Japanese, Korean, and other multi-byte Unicode characters. The previous URI encoding caused issues with CJK text display and partial character extraction.
Migration Required: If your code expects URI-encoded text, you must update it to handle plain UTF-8 text.
JSON Output Examples
Before v4.0.0 (URI-encoded):
{
"Pages": [{
"Texts": [{
"R": [{
"T": "Added%20Text%20from%20Acrobat"
}]
}]
}]
}After v4.0.0 (UTF-8):
{
"Pages": [{
"Texts": [{
"R": [{
"T": "Added Text from Acrobat"
}]
}]
}]
}Code Migration
Before v4.0.0:
// Had to decode URI components
const text = decodeURIComponent(textObj.R[0].T);
// Output: "Added Text from Acrobat"After v4.0.0:
// Direct text access, no decoding needed
const text = textObj.R[0].T;
// Output: "Added Text from Acrobat"CJK Character Support
Before v4.0.0:
{
"T": "%E4%B8%AD%E6%96%87"
}After v4.0.0:
{
"T": "中文"
}✨ Features & Enhancements
Accurate Space Preservation (Issues #355, #361, #319, PR #411)
Complete overhaul of space detection and preservation in text extraction (test CLI with -c command line option):
- Glyph-based width calculation - Uses actual font metrics instead of estimates
- Proper coordinate system handling - Correctly processes scaled positions with unscaled widths
- Text scale support - Applies
textHScalefor compressed/expanded text - Dynamic Y-tolerance - Font size-aware vertical positioning (fontSize × 0.15)
Impact: Spaces in extracted text (both content.txt and JSON output) now accurately reflect the original PDF layout. Multi-word phrases, tables, and formatted text preserve proper spacing.
Example Output Improvement
Before v4.0.0:
Name:JohnDoeSSN:123-45-6789
After v4.0.0:
Name: John Doe SSN: 123-45-6789
🐛 Bug Fixes
Text Block Coordinate Accuracy (Issue #408, PR #409)
- Fixed text block coordinate calculations for proper positioning
- Added comprehensive coordinate tests
- Ensures accurate x/y values in JSON output
Character Extraction Completeness (Issue #385, PR #410)
- Fixed missing character extraction for glyphs marked as "disabled"
- Moved text extraction outside glyph.disabled check
- All visible characters now properly extracted
CLI Error Handling (Issue #414)
- Unified error and exception handling for CLI operations
- Better error messages for invalid input parameters
- Auto-creates output directory when not specified (removed unnecessary validation)
- Improved stack trace display
more related issues should have been fixed (needs testing PDFs)
- #352 : unexpected space
- #291 : problem with sentences broken into 1 word
- #272 : unrecognized Text
- #220 : two TEXTs unexpected joined together in one RUN
- #212 : content is being randomly split into multiple lines
- #177 : heading level of text is not captured
- #156 : extracting table content
- #94 : parser not handling some spaces between words
📦 Dependencies
- Maintained zero runtime dependencies (since v3.1.6)
- Updated development dependencies for build tooling
Stable Build v3.2.2
- fix #406
- refactor: separate out logger functionality from nodeUtil
Stable build: V3.2.1
- types update:
- fix #392
- update types for root pdfparser.js
- feat: add type3 glyph font test support
- chores: update README, bump dev dependencies versions while keeping zero dependency
Stable build v3.2.0
- add support for deno and bun plus tests
-- fix: issue #68 and #396
-- add node:protocol to make them explicit when running in env other than node, including deno and bun - moved root pdfparser source and types to ./src and ./src/types respectively ---- double check your import path please, all exports from ./dist now
- reduce distributed package size to 2.1mb, improve pack and build
- feat: enable reading multiple pdf files with a single PDFParser object, credit @nicolabaesso
- other chores, including tests, jest upgrade, readme update, etc.
Stable build v3.1.6
What's Changed
- zero dependency: remove dependency on @xmldom/xmldom to make pdf2json zero dependency
- fix: correct link for open code of conduct #204
- Fixed radio/checkbox return values in getAllFieldsTypes(), thanks @bogie for #383
- fix: move package manager version from
enginestodevEngines, thanks @styfle for #387
New Contributors
Full Changelog: v3.1.5...v3.1.6
Stable build v3.1.5
feature added:
- add commonjs type definition file generation, thanks @grainrigi
- add 'types' to package.json 'exports' root, thanks @jeremybanka
Issues addressed:
- fix #165: check and make buffer before parse
- fix #373: handle bad encoding expcetion by start page rendering after page operator list is resolved
- fix #306: infinite loop of invalid stram
- fix #369: handle object value for field's rectangle coordinates
- other maintenance, eslint, tsconfig, dependency version bumps, etc.
Stable Build v3.1.4
Stable build v3.1.3
- eslint is configured and enabled
- typescript: configured and part of build
- typescript: updated pdfparser.d.ts with more types
- typescript: previous lib/p2jcmd*.js are replaced with src/cli/p2jcli*.ts
- maint: previous root/pdf2json.js is removed, favor bin/pdf2json.js
- tests: Jest test's Page content are validated with test/data/xxx.json
- error and exception handling: address the following issues and also added associated test PDFs:
** ENOENT: no such file or directory, open '/var/task/../package.json' #343
** Node.js Server got stuck when parsing specific PDF while it is working for other PDFs #321
** TypeError: Cannot read property 'free' of undefined #318
** parserError: 'bad XRef entry' #277
** params.get is not a function #262
** Error: Requesting object that isn't resolved yet #255
Stable build v3.1.2
- add conditional export for both esm and cjs,
- remove unused dev dependency
- more tests
Stable build v3.1.1
This v3.1.1 release replaces pdf2json@3.1.0.
- output to both esm and commonJS bundles and source map with rollup
- bundle outputs directory: ./dist
- note: previous pdfparser.cjs from root is moved to ./dist/pdfparser.cjs
- note: previous output bundles are now minified
- note: previous vows tests are removed, test suits are rewritten in Jest, currently 23 test cases
- note: npm build is required to run command line, output from build step is not tracked by git
- more README.md updates and type corrections, thanks @gladykov @mkrishnan-codes
- add env option to disable debugging logs, thanks @AyresMonteiro