-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Sanitize Metadata Before PG Vector Database Insertion #4434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sanitize Metadata Before PG Vector Database Insertion #4434
Conversation
This new method recursively sanitizes values intended for JSONB storage, removing disallowed control characters and ensuring safe insertion into PostgreSQL. The method is integrated into the vector insertion process to sanitize metadata before database operations.
timothycarambat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test suite for sanitizeForJsonb to server/__tests__/utils/vectorDbProviders/pgvector/index.js so we can validate this against known valid/invalid cases.
timothycarambat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test suite for sanitizeForJsonb to server/__tests__/utils/vectorDbProviders/pgvector/index.js so we can validate this against known valid/invalid cases.
This commit introduces a comprehensive test suite for the PGVector.sanitizeForJsonb method, ensuring it correctly handles various input types, including null, undefined, strings with disallowed control characters, objects, arrays, and Date objects. The tests verify that the method sanitizes inputs without mutating the original data structures.
Fix in 0eb23b0 |
|
Beautiful work - great test case coverage |
Pull Request Type
Relevant Issues
resolves #4339
What is in this change?
This PR adds a JSONB-safe sanitizer to the PGVector provider to address βunsupported Unicode escape sequenceβ errors that occur when inserting metadata containing disallowed control characters (most notably the NUL character, \u0000). Certain sources (e.g., some PDFs) can produce chunks with these control characters in the extracted text, causing Postgres to reject the jsonb payload.
Additional Information
Developer Validations
yarn lintfrom the root of the repo & committed changes