Document Processor v2 #442

timothycarambat · 2023-12-13T23:06:16Z

Migrate away from Python document processor and reimplement all supported files in NodeJS

feat: add pptx support

review-agent-prime · 2023-12-13T23:07:29Z

collector/processSingleFile/convert/asTxt.js

It's recommended to use streams when reading and writing files to improve the performance and memory usage of your application, especially when dealing with large files.
Create Issue
See the diff
Checkout the fix

    const fs = require('fs');
    const stream = fs.createReadStream(fullFilePath, 'utf8');
    let content = '';
    stream.on('data', chunk => {
      content += chunk.toString();
    });
    stream.on('end', () => {
      // Continue processing the content
    });

git fetch origin && git checkout -b ReviewBot/Impro-eyl2x90 origin/ReviewBot/Impro-eyl2x90

collector/index.js

It's a good practice to validate and sanitize the input data to prevent potential security vulnerabilities such as SQL Injection, Cross-Site Scripting (XSS), etc.
Create Issue
See the diff
Checkout the fix

    const sanitize = require('sanitize-filename');
    const targetFilename = sanitize(reqBody(request).filename);

git fetch origin && git checkout -b ReviewBot/Impro-0mvfibv origin/ReviewBot/Impro-0mvfibv

collector/processSingleFile/convert/asTxt.js

collector/index.js

…ollector-refactor

update all scripts to reflect this remove docker build for branch

* wip: init refactor of document processor to JS * add NodeJs PDF support * wip: partity with python processor feat: add pptx support * fix: forgot files * Remove python scripts totally * wip:update docker to boot new collector * add package.json support * update dockerfile for new build * update gitignore and linting * add more protections on file lookup * update package.json * test build * update docker commands to use cap-add=SYS_ADMIN so web scraper can run update all scripts to reflect this remove docker build for branch

timothycarambat added 4 commits December 13, 2023 13:36

wip: init refactor of document processor to JS

844601b

add NodeJs PDF support

73feffc

wip: partity with python processor

b5d9538

feat: add pptx support

fix: forgot files

d2b238c

review-agent-prime bot reviewed Dec 13, 2023

View reviewed changes

collector/processSingleFile/convert/asTxt.js Show resolved Hide resolved

review-agent-prime bot reviewed Dec 13, 2023

View reviewed changes

collector/index.js Show resolved Hide resolved

timothycarambat added 5 commits December 13, 2023 16:47

Remove python scripts totally

433bb07

wip:update docker to boot new collector

fb12c13

Merge branch 'master' of github.com:Mintplex-Labs/anything-llm into c…

c1c2cc5

…ollector-refactor

add package.json support

a05c498

update dockerfile for new build

fc59ca5

timothycarambat changed the title ~~Collector refactor~~ Document Process v2 Dec 14, 2023

timothycarambat changed the title ~~Document Process v2~~ Document Processor v2 Dec 14, 2023

timothycarambat added 6 commits December 14, 2023 14:21

update gitignore and linting

ef3dc52

add more protections on file lookup

663be89

update package.json

f2176f1

Merge branch 'master' of github.com:Mintplex-Labs/anything-llm into c…

a05f07e

…ollector-refactor

test build

0aa4534

update docker commands to use cap-add=SYS_ADMIN so web scraper can run

0193081

update all scripts to reflect this remove docker build for branch

timothycarambat merged commit 719521c into master Dec 14, 2023

timothycarambat deleted the collector-refactor branch December 14, 2023 23:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Document Processor v2 #442

Document Processor v2 #442

Uh oh!

timothycarambat commented Dec 13, 2023

Uh oh!

review-agent-prime bot commented Dec 13, 2023

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Document Processor v2 #442

Document Processor v2 #442

Uh oh!

Conversation

timothycarambat commented Dec 13, 2023

Uh oh!

review-agent-prime bot commented Dec 13, 2023

collector/processSingleFile/convert/asTxt.js

collector/index.js

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants