这是indexloc提供的服务,不要输入任何密码
Skip to content

[BUG]: File Parsing Fails for URLs Without Explicit File Extensions #4513

@angelplusultra

Description

@angelplusultra

How are you running AnythingLLM?

All versions

What happened?

When attempting to pull and parse a file using either the RAG Modal or @agent mode, the process fails if the URL does not explicitly end with a file extension (e.g., .pdf, .csv). This occurs even when the server responds with a correct Content-Type header that identifies the file type.

Example:

The following URL fails to be processed, despite responding with an application/pdf content type:

https://arxiv.org/pdf/2307.10265

Observed Behavior:

The application logs display the following error:


[2] Error processing single file File extension .10265 not supported for parsing and cannot be assumed as text file type.

This error originates from the file extension guard located at:

if (!SUPPORTED_FILETYPE_CONVERTERS.hasOwnProperty(fileExtension)) {
if (isTextType(fullFilePath)) {
console.log(
`\x1b[33m[Collector]\x1b[0m The provided filetype of ${fileExtension} does not have a preset and will be processed as .txt.`
);
processFileAs = ".txt";
} else {
trashFile(fullFilePath);
return {
success: false,
reason: `File extension ${fileExtension} not supported for parsing and cannot be assumed as text file type.`,
documents: [],
};
}
}

Expected Behavior:

The system should be able to successfully pull and parse files from URLs that do not explicitly contain a file extension, provided the Content-Type header in the server's response clearly indicates the file's MIME type.


Are there known steps to reproduce?

No response

Metadata

Metadata

Labels

bugSomething isn't workingpossible bugBug was reported but is not confirmed or is unable to be replicated.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions