这是indexloc提供的服务,不要输入任何密码
Skip to content

Transaction size too big when using FirestoreVectorStore.add_texts #126

@LukeBrumfield

Description

@LukeBrumfield

It get an error when I am processing large text files from GCS to insert into FireStore for RAG usage. I have use file splitters to get around this, but my metadata contains the file hashes to anticipate deletes and keep duplicates from being entered.

This may be more of a feature request for this library if this isn't anticipated usage.

Environment details

  • OS type and version: Ubuntu 24.04
  • Python version: Python 3.12.3
  • pip version: pip 23.2.1
  • langchain-google-firestore version: Version: 0.5.0

Steps to reproduce

  1. Pass a list to FirestoreVectorStore.add_text that contains data that is larger than the transaction limit (my document is 3.6MB, also happens with another documents around ~150KB)

Code example

    chunk_size = 1000
    # processing large files for RAG usage
    corpus_text = get_file_from_gcs(bucket_name, file_name)
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=50, 
    length_function=len, is_separator_regex=False, )
    texts = text_splitter.split_text(corpus_text)

    # This is the portion I use to track the files between GCS and FireStore
    metadatas = list()
    ids = list()
    for i in range(len(texts)):
        doc_hash = hashlib.sha256(texts[i].encode()).hexdigest()
        metadatas.append({"file_name": file_name, "file_hash": file_hash, "chunk_index": i, "doc_hash": doc_hash})
        ids.append(doc_hash)


    embedding = OpenAIEmbeddings(model="text-embedding-3-small", chunk_size=chunk_size, max_retries=3, )

    vector_store = FirestoreVectorStore(client=firestore_client, collection=collection_name,
                                        embedding_service=embedding)

    vector_store.add_texts(texts=texts, metadatas=metadatas, ids=ids)

Stack trace

Processing file: ***.txt in bucket: ***
Traceback (most recent call last):
  File "main.py", line 100, in <module>
    main()
  File "main.py", line 96, in main
    process_file(bucket_name, file_name, args.collection, args.chunk_size)
  File "main.py", line 52, in process_file
    vector_store.add_texts(texts=texts, metadatas=metadatas, ids=ids)
  File ".venv/lib/python3.12/site-packages/langchain_google_firestore/vectorstores.py", line 151, in add_texts
    db_batch.commit()
  File ".venv/lib/python3.12/site-packages/google/cloud/firestore_v1/batch.py", line 61, in commit
    commit_response = self._client._firestore_api.commit(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.12/site-packages/google/cloud/firestore_v1/services/firestore/client.py", line 1418, in commit
    response = rpc(
               ^^^^
  File ".venv/lib/python3.12/site-packages/google/api_core/gapic_v1/method.py", line 131, in __call__
    return wrapped_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.12/site-packages/google/api_core/retry/retry_unary.py", line 293, in retry_wrapped_func
    return retry_target(
           ^^^^^^^^^^^^^
  File ".venv/lib/python3.12/site-packages/google/api_core/retry/retry_unary.py", line 153, in retry_target
    _retry_error_helper(
  File ".venv/lib/python3.12/site-packages/google/api_core/retry/retry_base.py", line 212, in _retry_error_helper
    raise final_exc from source_exc
  File ".venv/lib/python3.12/site-packages/google/api_core/retry/retry_unary.py", line 144, in retry_target
    result = target()
             ^^^^^^^^
  File ".venv/lib/python3.12/site-packages/google/api_core/timeout.py", line 130, in func_with_timeout
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.12/site-packages/google/api_core/grpc_helpers.py", line 78, in error_remapped_callable
    raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.InvalidArgument: 400 datastore transaction or write too big.

Metadata

Metadata

Assignees

Labels

api: firestoreIssues related to the googleapis/langchain-google-firestore-python API.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions