这是indexloc提供的服务,不要输入任何密码
Skip to content

[Bug]:Chunking in a PDF Manual of a Sewing machine creates "Noah Carter"! #2333

@Audiojoy72

Description

@Audiojoy72

Do you need to file an issue?

  • I have searched the existing issues and this bug is not already filed.
  • I believe this is a legitimate bug, not just a question or feature request.

Describe the bug

i setted up lighrag (latest) and i figured something curious.
i chunked in a (german) PDF file, a manual about a sewing machine.
after the processing was done i saw this in the processing window:

Chunk 49 of 50 extracted 7 Ent + 6 Rel chunk-fe3df1c3d9bf3e9ea1d8efc6f8a58915
Chunk 50 of 50 extracted 7 Ent + 4 Rel chunk-896d6c3bb2d8e400e2b9f135d94eccba
Merging stage 1/1: Anleitungsbuch_Naehmaschine_2259_Deutsch.pdf
Phase 1: Processing 613 entities from doc-3f651dbe208da94184b27316170623a8 (async: 24)
LLMmrg: 100m Sprint Rekord | 0+9 (dd 2)
LLMmrg: Noah Carter | 0+21 (dd 5)
LLMmrg: Tokyo | 0+9 (dd 17)
LLMmrg: World Athletics Championship | 0+25
LLMmrg: Carbon-Fiber Spikes | 0+8 (dd 2)
Phase 2: Processing 53 relations from doc-3f651dbe208da94184b27316170623a8 (async: 24)
Chunks appended from relation: c60
LLMmrg: TokyoWorld Athletics Championship | 0+10 (dd 15)
LLMmrg: 100m Sprint Rekord
Noah Carter | 0+11
Chunks appended from relation: c104
LLMmrg: Carbon-Fiber Spikes~Noah Carter | 0+10
Phase 3: Updating final 616(613+3) entities and 53 relations from doc-3f651dbe208da94184b27316170623a8
Completed merging: 613 entities, 3 extra entities, 53 relations
Completed processing file 1/1: Anleitungsbuch_Naehmaschine_2259_Deutsch.pdf
Enqueued document processing pipeline stopped

When i check the graphs there is Noah Carter.
I checked the content of the PDF file and there is nothing about him in it.
i removed the PDF again completly and read in another PDF.
There is no Noah Carter.
So with this PDF there is allways Noah Carter
I read in an 8 month old redit post that there seems to be a demo content about him in a lightrag folder in the file "prompt.py"?!
I attached the PDF file. maybe someone can confirm.

Anleitungsbuch_Naehmaschine_2259_Deutsch.pdf

Steps to reproduce

fully clean your database and chunk in this attached PDF

Expected Behavior

No response

LightRAG Config Used

Using MS Azure and jina

Logs and screenshots

see above

Additional Information

  • LightRAG Version: vv1.4.9.8/0251
  • Operating System: Debian 12
  • Python Version:
  • Related Issues:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions