θΏ™ζ˜―indexlocζδΎ›ηš„ζœεŠ‘οΌŒδΈθ¦θΎ“ε…₯任何密码
Skip to content

Conversation

@angelplusultra
Copy link
Contributor

Pull Request Type

  • ✨ feat
  • πŸ› fix
  • ♻️ refactor
  • πŸ’„ style
  • πŸ”¨ chore
  • πŸ“ docs

Relevant Issues

resolves #4339

What is in this change?

This PR adds a JSONB-safe sanitizer to the PGVector provider to address β€œunsupported Unicode escape sequence” errors that occur when inserting metadata containing disallowed control characters (most notably the NUL character, \u0000). Certain sources (e.g., some PDFs) can produce chunks with these control characters in the extracted text, causing Postgres to reject the jsonb payload.

  • Adds sanitizeForJsonb which recursively removes C0 control characters from strings while preserving tabs/newlines/CR, and traverses arrays/objects deeply.
  • Applies the sanitizer to submission.metadata immediately before insert into the metadata jsonb column.
  • No changes to schemas or query behavior; only non-printable, disallowed control chars are stripped. Printable content remains intact.
  • Scope is limited to the PGVector provider; other vector DB providers are unaffected.

Additional Information

Developer Validations

  • I ran yarn lint from the root of the repo & committed changes
  • Relevant documentation has been updated
  • I have tested my code functionality
  • Docker build succeeds locally

This new method recursively sanitizes values intended for JSONB storage, removing disallowed control characters and ensuring safe insertion into PostgreSQL. The method is integrated into the vector insertion process to sanitize metadata before database operations.
@angelplusultra angelplusultra linked an issue Sep 24, 2025 that may be closed by this pull request
@angelplusultra angelplusultra added the PR:needs review Needs review by core team label Sep 24, 2025
Copy link
Member

@timothycarambat timothycarambat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test suite for sanitizeForJsonb to server/__tests__/utils/vectorDbProviders/pgvector/index.js so we can validate this against known valid/invalid cases.

Copy link
Member

@timothycarambat timothycarambat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test suite for sanitizeForJsonb to server/__tests__/utils/vectorDbProviders/pgvector/index.js so we can validate this against known valid/invalid cases.

@timothycarambat timothycarambat removed their assignment Sep 26, 2025
@timothycarambat timothycarambat removed the PR:needs review Needs review by core team label Sep 26, 2025
@angelplusultra angelplusultra added the PR: Requested Changes Changes have been requested by a reviewer label Sep 26, 2025
This commit introduces a comprehensive test suite for the PGVector.sanitizeForJsonb method, ensuring it correctly handles various input types, including null, undefined, strings with disallowed control characters, objects, arrays, and Date objects. The tests verify that the method sanitizes inputs without mutating the original data structures.
@angelplusultra
Copy link
Contributor Author

Add a test suite for sanitizeForJsonb to server/__tests__/utils/vectorDbProviders/pgvector/index.js so we can validate this against known valid/invalid cases.

Fix in 0eb23b0

@angelplusultra angelplusultra added PR:needs review Needs review by core team and removed PR: Requested Changes Changes have been requested by a reviewer labels Sep 26, 2025
@timothycarambat
Copy link
Member

Beautiful work - great test case coverage

@timothycarambat timothycarambat dismissed their stale review September 29, 2025 20:36

Applied changes

@timothycarambat timothycarambat merged commit 7ca2753 into master Sep 29, 2025
1 check passed
@timothycarambat timothycarambat deleted the 4339-unsupported-unicode-escape-sequence branch September 29, 2025 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR:needs review Needs review by core team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: unsupported Unicode escape sequence

3 participants