这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@timothycarambat
Copy link
Member

Migrate away from Python document processor and reimplement all supported files in NodeJS

@review-agent-prime
Copy link

collector/processSingleFile/convert/asTxt.js

It's recommended to use streams when reading and writing files to improve the performance and memory usage of your application, especially when dealing with large files.
Create Issue
See the diff
Checkout the fix

    const fs = require('fs');
    const stream = fs.createReadStream(fullFilePath, 'utf8');
    let content = '';
    stream.on('data', chunk => {
      content += chunk.toString();
    });
    stream.on('end', () => {
      // Continue processing the content
    });
git fetch origin && git checkout -b ReviewBot/Impro-eyl2x90 origin/ReviewBot/Impro-eyl2x90

collector/index.js

It's a good practice to validate and sanitize the input data to prevent potential security vulnerabilities such as SQL Injection, Cross-Site Scripting (XSS), etc.
Create Issue
See the diff
Checkout the fix

    const sanitize = require('sanitize-filename');
    const targetFilename = sanitize(reqBody(request).filename);
git fetch origin && git checkout -b ReviewBot/Impro-0mvfibv origin/ReviewBot/Impro-0mvfibv

@timothycarambat timothycarambat changed the title Collector refactor Document Process v2 Dec 14, 2023
@timothycarambat timothycarambat changed the title Document Process v2 Document Processor v2 Dec 14, 2023
@timothycarambat timothycarambat merged commit 719521c into master Dec 14, 2023
@timothycarambat timothycarambat deleted the collector-refactor branch December 14, 2023 23:14
cabwds pushed a commit to cabwds/anything-llm that referenced this pull request Jul 3, 2025
* wip: init refactor of document processor to JS

* add NodeJs PDF support

* wip: partity with python processor
feat: add pptx support

* fix: forgot files

* Remove python scripts totally

* wip:update docker to boot new collector

* add package.json support

* update dockerfile for new build

* update gitignore and linting

* add more protections on file lookup

* update package.json

* test build

* update docker commands to use cap-add=SYS_ADMIN so web scraper can run
update all scripts to reflect this
remove docker build for branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants