feat: add support for variable chunk length #415

timothycarambat · 2023-12-08T00:24:32Z

fix: cleanup code for embedding length clarify
resolves #388

fix: cleanup code for embedding length clarify resolves #388

review-agent-prime · 2023-12-08T00:25:44Z

docker/.env.example

It would be beneficial to add a comment explaining the purpose of the 'EMBEDDING_MODEL_MAX_CHUNK_LENGTH' environment variable. This will help other developers understand its usage and the impact of changing its value.
Create Issue
See the diff
Checkout the fix

    # EMBEDDING_MODEL_MAX_CHUNK_LENGTH=1000 # The max chunk size in chars a string to embed can be. This is used to limit the size of the text that can be embedded to prevent memory overflow or performance issues.

git fetch origin && git checkout -b ReviewBot/Addin-b3fnl6u origin/ReviewBot/Addin-b3fnl6u

server/.env.example

It would be beneficial to add a comment explaining the purpose of the 'EMBEDDING_MODEL_MAX_CHUNK_LENGTH' environment variable. This will help other developers understand its usage and the impact of changing its value.
Create Issue
See the diff
Checkout the fix

    # EMBEDDING_MODEL_MAX_CHUNK_LENGTH=1000 # The max chunk size in chars a string to embed can be. This is used to limit the size of the text that can be embedded to prevent memory overflow or performance issues.

git fetch origin && git checkout -b ReviewBot/Addin-1aqrntu origin/ReviewBot/Addin-1aqrntu

frontend/src/components/EmbeddingSelection/LocalAiOptions/index.jsx

Consider adding a state variable and an event handler for the EmbeddingModelMaxChunkLength input field. This will allow the application to react to changes in the input field and update the state accordingly, which can be useful for future features or debugging.
Create Issue
See the diff
Checkout the fix

    const [maxChunkLength, setMaxChunkLength] = useState(settings?.EmbeddingModelMaxChunkLength);
    
    function updateMaxChunkLength(e) {
      setMaxChunkLength(e.target.value);
    }
    
    // In the input field
    onChange={updateMaxChunkLength}

git fetch origin && git checkout -b ReviewBot/The-n-u0yc6b4 origin/ReviewBot/The-n-u0yc6b4

Consider adding a tooltip or a help icon next to the label "Max embedding chunk length". This would provide users with more context about what this setting does and how it affects the embedding process, improving the overall user experience and readability of the application.
Create Issue
See the diff
Checkout the fix

    <label className="text-white text-sm font-semibold block mb-4">
      Max embedding chunk length
      <span className="tooltip">Tooltip text explaining max chunk length</span>
    </label>

git fetch origin && git checkout -b ReviewBot/The-n-f26cuwh origin/ReviewBot/The-n-f26cuwh

server/utils/helpers/index.js

Consider adding a default value for the EMBEDDING_MODEL_MAX_CHUNK_LENGTH environment variable. This will ensure that the application has a fallback value in case this environment variable is not provided.
Create Issue
See the diff
Checkout the fix

    const EMBEDDING_MODEL_MAX_CHUNK_LENGTH = process.env.EMBEDDING_MODEL_MAX_CHUNK_LENGTH || 1000;

git fetch origin && git checkout -b ReviewBot/The-c-maztl30 origin/ReviewBot/The-c-maztl30

Consider adding a comment to explain the purpose of the EMBEDDING_MODEL_MAX_CHUNK_LENGTH environment variable. This will help other developers understand why this environment variable is used and how it affects the application.
Create Issue
See the diff
Checkout the fix

    // The maximum chunk length for embedding models can be controlled via the EMBEDDING_MODEL_MAX_CHUNK_LENGTH environment variable.
    // This allows for flexibility in controlling the size of the chunks that are sent for embedding.
    const EMBEDDING_MODEL_MAX_CHUNK_LENGTH = process.env.EMBEDDING_MODEL_MAX_CHUNK_LENGTH || 1000;

git fetch origin && git checkout -b ReviewBot/The-c-660o3rk origin/ReviewBot/The-c-660o3rk

server/utils/vectorDbProviders/chroma/index.js

It would be beneficial to improve the error handling and logging in the code. This would make it easier to debug and understand what's happening in the code. Specifically, when an error is thrown, it would be helpful to include more information about the error and the state of the program when the error occurred. This could include the values of relevant variables, the function in which the error occurred, and the specific error message.
Create Issue
See the diff
Checkout the fix

    } catch (e) {
      console.error(`Error in addDocumentToNamespace: ${e.message}`);
      console.error(`Document data: ${JSON.stringify(documentData)}`);
      console.error(`Namespace: ${namespace}`);
      console.error(`Full file path: ${fullFilePath}`);
      return false;
    }

git fetch origin && git checkout -b ReviewBot/Impro-29qllxm origin/ReviewBot/Impro-29qllxm

It would be beneficial to improve the performance of the code by reducing the number of database operations. Specifically, instead of inserting document vectors one by one, it would be more efficient to insert them in bulk. This would reduce the number of database operations and thus improve the performance of the code.
Create Issue
See the diff
Checkout the fix

    const documentVectors = [];
    // ...
    for (const chunk of chunks) {
      // ...
      documentVectors.push({ docId, vectorId: id });
      // ...
    }
    // ...
    await DocumentVectors.bulkInsert(documentVectors);

git fetch origin && git checkout -b ReviewBot/Impro-yqzmrmi origin/ReviewBot/Impro-yqzmrmi

docker/.env.example

server/.env.example

frontend/src/components/EmbeddingSelection/LocalAiOptions/index.jsx

server/utils/helpers/index.js

fix: cleanup code for embedding length clarify resolves Mintplex-Labs#388

feat: add support for variable chunk length

2eb1e73

fix: cleanup code for embedding length clarify resolves #388