这是indexloc提供的服务,不要输入任何密码
Skip to content

[BUG]: Bulk link scraper tries to fetch child-links from root of website instead of defined origin #1528

@Propheticus

Description

@Propheticus

How are you running AnythingLLM?

Docker (local)

What happened?

https://discord.com/channels/1114740394715004990/1243578185581334538

The bulk link scraper does not work when the URL you enter is not the root.

If you enter https://www.somesite.com/some/sub with depth 1, it will correctly identify the children of that sub https://www.somesite.com/some/sub/child_#
However, it will then scrape https://www.somesite.com/child_#

E.g.
If you enter https://learn.microsoft.com/en-us/azure/well-architected/reliability
It will try to scrape https://learn.microsoft.com/metrics instead of https://learn.microsoft.com/en-us/azure/well-architected/reliability/metrics

image
image
404 - Page not found\n\nWe couldn't find this page.

Are there known steps to reproduce?

Enter a URL which is not the root website / homepage.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions