-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Labels
api: firestoreIssues related to the googleapis/langchain-google-firestore-python API.Issues related to the googleapis/langchain-google-firestore-python API.
Description
It get an error when I am processing large text files from GCS to insert into FireStore for RAG usage. I have use file splitters to get around this, but my metadata contains the file hashes to anticipate deletes and keep duplicates from being entered.
This may be more of a feature request for this library if this isn't anticipated usage.
Environment details
- OS type and version:
Ubuntu 24.04
- Python version:
Python 3.12.3
- pip version:
pip 23.2.1
langchain-google-firestore
version: Version: 0.5.0
Steps to reproduce
- Pass a list to
FirestoreVectorStore.add_text
that contains data that is larger than the transaction limit (my document is 3.6MB, also happens with another documents around ~150KB)
Code example
chunk_size = 1000
# processing large files for RAG usage
corpus_text = get_file_from_gcs(bucket_name, file_name)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=50,
length_function=len, is_separator_regex=False, )
texts = text_splitter.split_text(corpus_text)
# This is the portion I use to track the files between GCS and FireStore
metadatas = list()
ids = list()
for i in range(len(texts)):
doc_hash = hashlib.sha256(texts[i].encode()).hexdigest()
metadatas.append({"file_name": file_name, "file_hash": file_hash, "chunk_index": i, "doc_hash": doc_hash})
ids.append(doc_hash)
embedding = OpenAIEmbeddings(model="text-embedding-3-small", chunk_size=chunk_size, max_retries=3, )
vector_store = FirestoreVectorStore(client=firestore_client, collection=collection_name,
embedding_service=embedding)
vector_store.add_texts(texts=texts, metadatas=metadatas, ids=ids)
Stack trace
Processing file: ***.txt in bucket: ***
Traceback (most recent call last):
File "main.py", line 100, in <module>
main()
File "main.py", line 96, in main
process_file(bucket_name, file_name, args.collection, args.chunk_size)
File "main.py", line 52, in process_file
vector_store.add_texts(texts=texts, metadatas=metadatas, ids=ids)
File ".venv/lib/python3.12/site-packages/langchain_google_firestore/vectorstores.py", line 151, in add_texts
db_batch.commit()
File ".venv/lib/python3.12/site-packages/google/cloud/firestore_v1/batch.py", line 61, in commit
commit_response = self._client._firestore_api.commit(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.12/site-packages/google/cloud/firestore_v1/services/firestore/client.py", line 1418, in commit
response = rpc(
^^^^
File ".venv/lib/python3.12/site-packages/google/api_core/gapic_v1/method.py", line 131, in __call__
return wrapped_func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.12/site-packages/google/api_core/retry/retry_unary.py", line 293, in retry_wrapped_func
return retry_target(
^^^^^^^^^^^^^
File ".venv/lib/python3.12/site-packages/google/api_core/retry/retry_unary.py", line 153, in retry_target
_retry_error_helper(
File ".venv/lib/python3.12/site-packages/google/api_core/retry/retry_base.py", line 212, in _retry_error_helper
raise final_exc from source_exc
File ".venv/lib/python3.12/site-packages/google/api_core/retry/retry_unary.py", line 144, in retry_target
result = target()
^^^^^^^^
File ".venv/lib/python3.12/site-packages/google/api_core/timeout.py", line 130, in func_with_timeout
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.12/site-packages/google/api_core/grpc_helpers.py", line 78, in error_remapped_callable
raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.InvalidArgument: 400 datastore transaction or write too big.
Metadata
Metadata
Assignees
Labels
api: firestoreIssues related to the googleapis/langchain-google-firestore-python API.Issues related to the googleapis/langchain-google-firestore-python API.