θΏ™ζ˜―indexlocζδΎ›ηš„ζœεŠ‘οΌŒδΈθ¦θΎ“ε…₯任何密码
Skip to content

Conversation

@angelplusultra
Copy link
Contributor

@angelplusultra angelplusultra commented Sep 18, 2025

Pull Request Type

  • ✨ feat
  • πŸ› fix
  • ♻️ refactor
  • πŸ’„ style
  • πŸ”¨ chore
  • πŸ“ docs

Relevant Issues

resolves #2110

What is in this change?

This PR introduces an enhancement to the document upload modal: the web scraping input field can now fetch, parse, and convert statically hosted files (e.g., https://example.com/a-file.pdf) and API endpoints returning JSON (e.g., https://jsonplaceholder.typicode.com/posts) into application documents.

Currently Supporting:

Currently Supporting:

  • Text

    • Plain text: .txt, .md, .org, .adoc, .rst
    • HTML: .html
    • CSV: .csv
    • JSON: .json
  • Documents

    • Word (DOCX): .docx (NOTE: .doc files are currently not supported, only .docx)
    • OpenDocument Text: .odt
    • PDF: .pdf
    • EPUB: .epub
  • Presentations

    • PowerPoint (PPTX): .pptx
    • OpenDocument Presentation: .odp
  • Spreadsheets

    • Excel (XLSX): .xlsx
  • Email

    • MBOX: .mbox
  • Audio

    • WAV: .wav
    • MP3: .mp3
  • Video

    • MP4: .mp4
    • MPEG: .mpeg
  • Images

    • PNG: .png
    • JPEG/JPG: .jpg

Additional Information

The core functionality hinges on the Content-Type header of the response. If the mimetype is supported, your file will be pulled and processed.

Developer Validations

  • I ran yarn lint from the root of the repo & committed changes
  • Relevant documentation has been updated
  • I have tested my code functionality
  • Docker build succeeds locally

@angelplusultra angelplusultra marked this pull request as draft September 18, 2025 01:08
…ponse content type, as a result unlocked all supported files
@angelplusultra angelplusultra marked this pull request as ready for review September 19, 2025 00:54
@timothycarambat timothycarambat added the PR:needs review Needs review by core team label Sep 19, 2025
@@ -0,0 +1,31 @@
const { WATCH_DIRECTORY } = require("../../utils/constants");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might make more sense to have this single util that downloads files moved to collector/utils/files/index.js since it has to do with files only and we can probably re-use it too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix in 0bba2c7

@timothycarambat timothycarambat removed the PR:needs review Needs review by core team label Sep 19, 2025
@timothycarambat timothycarambat merged commit f7b9057 into master Oct 1, 2025
1 check passed
@timothycarambat timothycarambat deleted the 2110-download-file-as-document branch October 1, 2025 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR:needs review Needs review by core team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEAT]: Allow bulk web scraper to download files that it discovers

3 participants