-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
feat: add support for variable chunk length #415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fix: cleanup code for embedding length clarify resolves #388
docker/.env.exampleIt would be beneficial to add a comment explaining the purpose of the 'EMBEDDING_MODEL_MAX_CHUNK_LENGTH' environment variable. This will help other developers understand its usage and the impact of changing its value. # EMBEDDING_MODEL_MAX_CHUNK_LENGTH=1000 # The max chunk size in chars a string to embed can be. This is used to limit the size of the text that can be embedded to prevent memory overflow or performance issues.server/.env.exampleIt would be beneficial to add a comment explaining the purpose of the 'EMBEDDING_MODEL_MAX_CHUNK_LENGTH' environment variable. This will help other developers understand its usage and the impact of changing its value. # EMBEDDING_MODEL_MAX_CHUNK_LENGTH=1000 # The max chunk size in chars a string to embed can be. This is used to limit the size of the text that can be embedded to prevent memory overflow or performance issues.frontend/src/components/EmbeddingSelection/LocalAiOptions/index.jsxConsider adding a state variable and an event handler for the EmbeddingModelMaxChunkLength input field. This will allow the application to react to changes in the input field and update the state accordingly, which can be useful for future features or debugging. const [maxChunkLength, setMaxChunkLength] = useState(settings?.EmbeddingModelMaxChunkLength);
function updateMaxChunkLength(e) {
setMaxChunkLength(e.target.value);
}
// In the input field
onChange={updateMaxChunkLength}Consider adding a tooltip or a help icon next to the label "Max embedding chunk length". This would provide users with more context about what this setting does and how it affects the embedding process, improving the overall user experience and readability of the application. <label className="text-white text-sm font-semibold block mb-4">
Max embedding chunk length
<span className="tooltip">Tooltip text explaining max chunk length</span>
</label>server/utils/helpers/index.jsConsider adding a default value for the const EMBEDDING_MODEL_MAX_CHUNK_LENGTH = process.env.EMBEDDING_MODEL_MAX_CHUNK_LENGTH || 1000;Consider adding a comment to explain the purpose of the // The maximum chunk length for embedding models can be controlled via the EMBEDDING_MODEL_MAX_CHUNK_LENGTH environment variable.
// This allows for flexibility in controlling the size of the chunks that are sent for embedding.
const EMBEDDING_MODEL_MAX_CHUNK_LENGTH = process.env.EMBEDDING_MODEL_MAX_CHUNK_LENGTH || 1000;server/utils/vectorDbProviders/chroma/index.jsIt would be beneficial to improve the error handling and logging in the code. This would make it easier to debug and understand what's happening in the code. Specifically, when an error is thrown, it would be helpful to include more information about the error and the state of the program when the error occurred. This could include the values of relevant variables, the function in which the error occurred, and the specific error message. } catch (e) {
console.error(`Error in addDocumentToNamespace: ${e.message}`);
console.error(`Document data: ${JSON.stringify(documentData)}`);
console.error(`Namespace: ${namespace}`);
console.error(`Full file path: ${fullFilePath}`);
return false;
}It would be beneficial to improve the performance of the code by reducing the number of database operations. Specifically, instead of inserting document vectors one by one, it would be more efficient to insert them in bulk. This would reduce the number of database operations and thus improve the performance of the code. const documentVectors = [];
// ...
for (const chunk of chunks) {
// ...
documentVectors.push({ docId, vectorId: id });
// ...
}
// ...
await DocumentVectors.bulkInsert(documentVectors); |
fix: cleanup code for embedding length clarify resolves Mintplex-Labs#388
fix: cleanup code for embedding length clarify
resolves #388