-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Document Processor v2 #442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
collector/processSingleFile/convert/asTxt.jsIt's recommended to use streams when reading and writing files to improve the performance and memory usage of your application, especially when dealing with large files. const fs = require('fs');
const stream = fs.createReadStream(fullFilePath, 'utf8');
let content = '';
stream.on('data', chunk => {
content += chunk.toString();
});
stream.on('end', () => {
// Continue processing the content
});collector/index.jsIt's a good practice to validate and sanitize the input data to prevent potential security vulnerabilities such as SQL Injection, Cross-Site Scripting (XSS), etc. const sanitize = require('sanitize-filename');
const targetFilename = sanitize(reqBody(request).filename); |
…ollector-refactor
update all scripts to reflect this remove docker build for branch
* wip: init refactor of document processor to JS * add NodeJs PDF support * wip: partity with python processor feat: add pptx support * fix: forgot files * Remove python scripts totally * wip:update docker to boot new collector * add package.json support * update dockerfile for new build * update gitignore and linting * add more protections on file lookup * update package.json * test build * update docker commands to use cap-add=SYS_ADMIN so web scraper can run update all scripts to reflect this remove docker build for branch
Migrate away from Python document processor and reimplement all supported files in NodeJS