这是indexloc提供的服务,不要输入任何密码
Skip to content

Conversation

@shatfield4
Copy link
Collaborator

resolves #363

Adds an input for users to put links to scrape and insert them into their documents

@shatfield4 shatfield4 self-assigned this Nov 14, 2023
@shatfield4 shatfield4 linked an issue Nov 14, 2023 that may be closed by this pull request
@shatfield4 shatfield4 changed the title Sdding url uploads to document picker Adding url uploads to document picker Nov 14, 2023
@review-agent-prime
Copy link

collector/scripts/link.py

In the process_single_link function in collector/scripts/link.py, the code could be refactored to reduce the level of nesting and improve readability. This can be achieved by using guard clauses to handle error conditions and return early.
Create Issue

    def process_single_link(url):
        if not url:
            return False, "Invalid URL!", None

        session = HTMLSession()
        req = session.get(url)
        if not req.ok:
            return False, "Could not reach this URL.", None

        req.html.render()
        with tempfile.NamedTemporaryFile(mode = "w") as tmp:
            tmp.write(req.html.html)
            tmp.seek(0)
            loader = UnstructuredHTMLLoader(tmp.name)
            data = loader.load()[0]
            full_text = data.page_content

        if not full_text:
            return False, "Could not parse any meaningful data from this URL.", None

        link_meta = append_meta(req, full_text, True)
        token_count = len(tokenize(full_text))
        link_meta['pageContent'] = full_text
        link_meta['token_count_estimate'] = token_count
        return True, None, link_meta

In the process_single_link function in collector/scripts/link.py, the code could be optimized to reduce the number of times the req.html.html is written to the temporary file. This can be achieved by storing the req.html.html in a variable and reusing it.
Create Issue

    def process_single_link(url):
        if not url:
            return False, "Invalid URL!", None

        session = HTMLSession()
        req = session.get(url)
        if not req.ok:
            return False, "Could not reach this URL.", None

        req.html.render()
        html_content = req.html.html
        with tempfile.NamedTemporaryFile(mode = "w") as tmp:
            tmp.write(html_content)
            tmp.seek(0)
            loader = UnstructuredHTMLLoader(tmp.name)
            data = loader.load()[0]
            full_text = data.page_content

        if not full_text:
            return False, "Could not parse any meaningful data from this URL.", None

        link_meta = append_meta(req, full_text, True)
        token_count = len(tokenize(full_text))
        link_meta['pageContent'] = full_text
        link_meta['token_count_estimate'] = token_count
        return True, None, link_meta

@timothycarambat timothycarambat marked this pull request as draft November 14, 2023 22:03
@timothycarambat timothycarambat marked this pull request as ready for review November 17, 2023 01:12
@timothycarambat timothycarambat merged commit 7edfcca into master Nov 17, 2023
@timothycarambat timothycarambat deleted the 363-url-scraping-for-document-sourcing branch November 17, 2023 01:15
cabwds pushed a commit to cabwds/anything-llm that referenced this pull request Jul 3, 2025
* WIP adding url uploads to document picker

* fix manual script for uploading url to custom-documents

* fix metadata for url scraping

* wip url parsing

* update how async link scraping works

* docker-compose defaults added
no autocomplete on URLs

---------

Co-authored-by: timothycarambat <rambat1010@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

URL Scraping for document sourcing

3 participants