-
-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Closed
Labels
investigatingCore team or maintainer will or is currently looking into this issueCore team or maintainer will or is currently looking into this issuepossible bugBug was reported but is not confirmed or is unable to be replicated.Bug was reported but is not confirmed or is unable to be replicated.
Description
How are you running AnythingLLM?
All versions
What happened?
What I did
Just tried out AnythingLLM for the first time. I wanted to test the website scraping with the naive-ui documentation (https://www.naiveui.com/en-US/os-theme/docs/introduction) to apply RAG on it.
Added the URL to the Bulk-link-scraper.
After a short delay, there is a success-message which says "successfully scraped 0 pages".
Retrieving the page in the browser works fine.
What I expected
I expected it to scrape documentation from 20 components which a human finds as links on the above page. If something goes wrong, I expected a meaningful error message,
What I observed
- This seems to be quite a special type of webpage which is impossible for puppeteer to retrieve. I can at least scrape other webpages.
- Container log
2024-12-03 21:56:24 [collector] info: Discovering links...
2024-12-03 21:56:25 [collector] info: Found 1 links to scrape.
2024-12-03 21:56:25 [collector] info: Starting bulk scraping...
2024-12-03 21:56:25 [collector] info: Scraping 1/1: https://www.naiveui.com/en-US/os-theme/docs/introduction
2024-12-03 21:56:25 [collector] error: Failed to scrape https://www.naiveui.com/en-US/os-theme/docs/introduction. TypeError: Cannot read properties of undefined (reading 'length')
2024-12-03 21:56:25 at bulkScrapePages (/app/collector/utils/extensions/WebsiteDepth/index.js:105:20)
2024-12-03 21:56:25 at async websiteScraper (/app/collector/utils/extensions/WebsiteDepth/index.js:160:23)
2024-12-03 21:56:25 at async /app/collector/extensions/index.js:122:29
2024-12-03 21:56:25 [collector] info: Scraped 0 pages.
- I started frontend, server and collector as separate processes from my IDE and set a breakpoint at the scraping. I noticed that puppeteer fails with a "websocket hung up".
This seems not to be handled properly
Are there known steps to reproduce?
Open the bulk scraping tool
enter the URL https://www.naiveui.com/en-US/os-theme/docs/introduction with default parameters
scrape
timothycarambat
Metadata
Metadata
Assignees
Labels
investigatingCore team or maintainer will or is currently looking into this issueCore team or maintainer will or is currently looking into this issuepossible bugBug was reported but is not confirmed or is unable to be replicated.Bug was reported but is not confirmed or is unable to be replicated.