这是indexloc提供的服务,不要输入任何密码
Skip to content

[BUG]: Scraping failed #2758

@mrsimpson

Description

@mrsimpson

How are you running AnythingLLM?

All versions

What happened?

What I did

Just tried out AnythingLLM for the first time. I wanted to test the website scraping with the naive-ui documentation (https://www.naiveui.com/en-US/os-theme/docs/introduction) to apply RAG on it.

Added the URL to the Bulk-link-scraper.
After a short delay, there is a success-message which says "successfully scraped 0 pages".

Retrieving the page in the browser works fine.

What I expected

I expected it to scrape documentation from 20 components which a human finds as links on the above page. If something goes wrong, I expected a meaningful error message,

What I observed

  • This seems to be quite a special type of webpage which is impossible for puppeteer to retrieve. I can at least scrape other webpages.
  • Container log
2024-12-03 21:56:24 [collector] info: Discovering links...
2024-12-03 21:56:25 [collector] info: Found 1 links to scrape.
2024-12-03 21:56:25 [collector] info: Starting bulk scraping...
2024-12-03 21:56:25 [collector] info: Scraping 1/1: https://www.naiveui.com/en-US/os-theme/docs/introduction
2024-12-03 21:56:25 [collector] error: Failed to scrape https://www.naiveui.com/en-US/os-theme/docs/introduction. TypeError: Cannot read properties of undefined (reading 'length')
2024-12-03 21:56:25     at bulkScrapePages (/app/collector/utils/extensions/WebsiteDepth/index.js:105:20)
2024-12-03 21:56:25     at async websiteScraper (/app/collector/utils/extensions/WebsiteDepth/index.js:160:23)
2024-12-03 21:56:25     at async /app/collector/extensions/index.js:122:29
2024-12-03 21:56:25 [collector] info: Scraped 0 pages.
  • I started frontend, server and collector as separate processes from my IDE and set a breakpoint at the scraping. I noticed that puppeteer fails with a "websocket hung up".
    This seems not to be handled properly

Are there known steps to reproduce?

Open the bulk scraping tool
enter the URL https://www.naiveui.com/en-US/os-theme/docs/introduction with default parameters
scrape

Metadata

Metadata

Assignees

Labels

investigatingCore team or maintainer will or is currently looking into this issuepossible bugBug was reported but is not confirmed or is unable to be replicated.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions