这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@timothycarambat
Copy link
Member

No description provided.

@timothycarambat timothycarambat merged commit 5441717 into master Nov 1, 2023
@timothycarambat timothycarambat deleted the normalize-parser branch November 1, 2023 23:44
@review-agent-prime
Copy link

Here are some suggestions to improve the code:

  1. In as_docx.py and as_text.py, you have hardcoded the docAuthor and description fields as 'Unknown'. It would be better to extract these details from the document metadata if possible. If the metadata is not available, then default to 'Unknown'.
'docAuthor': get_author(fullpath) if get_author(fullpath) else 'Unknown',
'description': get_description(fullpath) if get_description(fullpath) else 'Unknown',
  1. In as_mbox.py, you have removed the sender, recipient, subject, and date_sent fields. If these fields are not necessary for your use case, that's fine. However, if you need this information later, consider keeping these fields.
"sender": message["From"],
"recipient": message["To"],
"subject": subject,
"date_sent": date_sent,
  1. In as_text.py, you have added a docSource field. This is a good addition, but it would be better to make the source more descriptive, such as including the file path.
'docSource': f"a text file uploaded by the user from {fullpath}",
  1. In all files, you are calculating the word count using len(content). This will give you the character count, not the word count. To get the word count, you should split the content by spaces.
'wordCount': len(content.split()),
  1. In all files, you are using the file_creation_time(fullpath) function to get the published date. If the file creation time is not the same as the published date, consider using a different method to get the correct date.

franzbischoff referenced this pull request in franzbischoff/anything-llm Nov 4, 2023
cabwds pushed a commit to cabwds/anything-llm that referenced this pull request Jul 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants