normalize parser struct for all file types #321

timothycarambat · 2023-11-01T23:43:31Z

No description provided.

review-agent-prime · 2023-11-01T23:44:19Z

Here are some suggestions to improve the code:

In as_docx.py and as_text.py, you have hardcoded the docAuthor and description fields as 'Unknown'. It would be better to extract these details from the document metadata if possible. If the metadata is not available, then default to 'Unknown'.

'docAuthor': get_author(fullpath) if get_author(fullpath) else 'Unknown',
'description': get_description(fullpath) if get_description(fullpath) else 'Unknown',

In as_mbox.py, you have removed the sender, recipient, subject, and date_sent fields. If these fields are not necessary for your use case, that's fine. However, if you need this information later, consider keeping these fields.

"sender": message["From"],
"recipient": message["To"],
"subject": subject,
"date_sent": date_sent,

In as_text.py, you have added a docSource field. This is a good addition, but it would be better to make the source more descriptive, such as including the file path.

'docSource': f"a text file uploaded by the user from {fullpath}",

In all files, you are calculating the word count using len(content). This will give you the character count, not the word count. To get the word count, you should split the content by spaces.

'wordCount': len(content.split()),

In all files, you are using the file_creation_time(fullpath) function to get the published date. If the file creation time is not the same as the published date, consider using a different method to get the correct date.

normalize parser struct for all file types

06ffdd6

timothycarambat merged commit 5441717 into master Nov 1, 2023

timothycarambat deleted the normalize-parser branch November 1, 2023 23:44

franzbischoff referenced this pull request in franzbischoff/anything-llm Nov 4, 2023

normalize parser struct for all file types (#321)

c0cc9a2

cabwds pushed a commit to cabwds/anything-llm that referenced this pull request Jul 3, 2025

normalize parser struct for all file types (Mintplex-Labs#321)

e6ee7c8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

normalize parser struct for all file types #321

normalize parser struct for all file types #321

timothycarambat commented Nov 1, 2023

Uh oh!

review-agent-prime bot commented Nov 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

normalize parser struct for all file types #321

normalize parser struct for all file types #321

Conversation

timothycarambat commented Nov 1, 2023

Uh oh!

review-agent-prime bot commented Nov 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants