[Fork] Batch embed by jwaltz #153

timothycarambat · 2023-07-20T19:03:51Z

Original Author: jwaltz

I was having really long embed times for a couple large .docx documents ingested through the /hotdir. One had 256 chunks and took many minutes to fully embed, sometimes never finishing at all.

I converted the embedChunk() method to accept a list of chunks instead of just one chunk, and make a single call to OpenAI's embedding API endpoint, which seems to have sped up the embed process dramatically. The refactored embedChunks() required some changes to its usage in addDocumentToNamespace() as well. I included these changes for the 3 vector db options at /server/utils/vectorDbProviders/[vectorDb]/index.js

Let me know what you think and if this is helpful. Again, apologies for my editor's lsp changing some of the method signatures and line-length formatting, hope it is ok.

Had to modify this code as it would break the application since we do both multiple embeds and singular text embeds so each vector database needs to have this interface and support it since during chat mode we manually embed the query.

timothycarambat · 2023-07-20T19:06:39Z

@jwaltz I dont know what your Discord handle is but happy to add you as a contributor! Apologies for the delay in merging this as its a massive quality improvement.

AntonioCiolino · 2023-07-20T22:38:29Z

In excited to try this out!

jwaltz · 2023-07-20T22:46:12Z

@timothycarambat I don't use discord much but am happy to have contributed nonetheless. Keep up the good work!

* refactor: convert chunk embedding to one API call * chore: lint * fix chroma for batch and single vectorization of text * Fix LanceDB multi and single vectorization * Fix pinecone for single and multiple embeddings --------- Co-authored-by: Jonathan Waltz <volcanicislander@gmail.com>

jwaltz and others added 6 commits June 10, 2023 11:48

refactor: convert chunk embedding to one API call

e2b3b74

chore: lint

885d5e2

merge with master

fe0eebb

fix chroma for batch and single vectorization of text

10717e3

Fix LanceDB multi and single vectorization

f7728e8

Fix pinecone for single and multiple embeddings

f2d8ccc

timothycarambat mentioned this pull request Jul 20, 2023

refactor: convert chunk embedding to one API call #24

Closed

timothycarambat merged commit c1deca4 into master Jul 20, 2023

timothycarambat deleted the batch-embed branch July 20, 2023 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Fork] Batch embed by jwaltz #153

[Fork] Batch embed by jwaltz #153

Uh oh!

timothycarambat commented Jul 20, 2023

Uh oh!

timothycarambat commented Jul 20, 2023

Uh oh!

AntonioCiolino commented Jul 20, 2023

Uh oh!

jwaltz commented Jul 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[Fork] Batch embed by jwaltz #153

[Fork] Batch embed by jwaltz #153

Uh oh!

Conversation

timothycarambat commented Jul 20, 2023

Uh oh!

timothycarambat commented Jul 20, 2023

Uh oh!

AntonioCiolino commented Jul 20, 2023

Uh oh!

jwaltz commented Jul 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants