Adding url uploads to document picker #375

shatfield4 · 2023-11-14T20:26:39Z

resolves #363

Adds an input for users to put links to scrape and insert them into their documents

review-agent-prime · 2023-11-14T20:28:15Z

collector/scripts/link.py

In the process_single_link function in collector/scripts/link.py, the code could be refactored to reduce the level of nesting and improve readability. This can be achieved by using guard clauses to handle error conditions and return early.
Create Issue

    def process_single_link(url):
        if not url:
            return False, "Invalid URL!", None

        session = HTMLSession()
        req = session.get(url)
        if not req.ok:
            return False, "Could not reach this URL.", None

        req.html.render()
        with tempfile.NamedTemporaryFile(mode = "w") as tmp:
            tmp.write(req.html.html)
            tmp.seek(0)
            loader = UnstructuredHTMLLoader(tmp.name)
            data = loader.load()[0]
            full_text = data.page_content

        if not full_text:
            return False, "Could not parse any meaningful data from this URL.", None

        link_meta = append_meta(req, full_text, True)
        token_count = len(tokenize(full_text))
        link_meta['pageContent'] = full_text
        link_meta['token_count_estimate'] = token_count
        return True, None, link_meta

In the process_single_link function in collector/scripts/link.py, the code could be optimized to reduce the number of times the req.html.html is written to the temporary file. This can be achieved by storing the req.html.html in a variable and reusing it.
Create Issue

    def process_single_link(url):
        if not url:
            return False, "Invalid URL!", None

        session = HTMLSession()
        req = session.get(url)
        if not req.ok:
            return False, "Could not reach this URL.", None

        req.html.render()
        html_content = req.html.html
        with tempfile.NamedTemporaryFile(mode = "w") as tmp:
            tmp.write(html_content)
            tmp.seek(0)
            loader = UnstructuredHTMLLoader(tmp.name)
            data = loader.load()[0]
            full_text = data.page_content

        if not full_text:
            return False, "Could not parse any meaningful data from this URL.", None

        link_meta = append_meta(req, full_text, True)
        token_count = len(tokenize(full_text))
        link_meta['pageContent'] = full_text
        link_meta['token_count_estimate'] = token_count
        return True, None, link_meta

collector/scripts/link.py

no autocomplete on URLs

* WIP adding url uploads to document picker * fix manual script for uploading url to custom-documents * fix metadata for url scraping * wip url parsing * update how async link scraping works * docker-compose defaults added no autocomplete on URLs --------- Co-authored-by: timothycarambat <rambat1010@gmail.com>

WIP adding url uploads to document picker

f899c1b

shatfield4 self-assigned this Nov 14, 2023

shatfield4 linked an issue Nov 14, 2023 that may be closed by this pull request

URL Scraping for document sourcing #363

Closed

shatfield4 changed the title ~~Sdding url uploads to document picker~~ Adding url uploads to document picker Nov 14, 2023

review-agent-prime bot reviewed Nov 14, 2023

View reviewed changes

collector/scripts/link.py Outdated Show resolved Hide resolved

review-agent-prime bot reviewed Nov 14, 2023

View reviewed changes

collector/scripts/link.py Outdated Show resolved Hide resolved

shatfield4 added 2 commits November 14, 2023 12:35

Merge branch 'master' into 363-url-scraping-for-document-sourcing

b7bd21d

fix manual script for uploading url to custom-documents

92a0d19

timothycarambat marked this pull request as draft November 14, 2023 22:03

shatfield4 added 2 commits November 14, 2023 14:25

fix metadata for url scraping

4cd532e

wip url parsing

8576446

timothycarambat closed this Nov 15, 2023

merge with master

6f793a6

timothycarambat reopened this Nov 17, 2023

timothycarambat added 2 commits November 16, 2023 16:55

update how async link scraping works

b68ea53

docker-compose defaults added

93df633

no autocomplete on URLs

timothycarambat marked this pull request as ready for review November 17, 2023 01:12

timothycarambat merged commit 7edfcca into master Nov 17, 2023

timothycarambat deleted the 363-url-scraping-for-document-sourcing branch November 17, 2023 01:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Adding url uploads to document picker #375

Adding url uploads to document picker #375

Uh oh!

shatfield4 commented Nov 14, 2023

Uh oh!

review-agent-prime bot commented Nov 14, 2023

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Adding url uploads to document picker #375

Adding url uploads to document picker #375

Uh oh!

Conversation

shatfield4 commented Nov 14, 2023

Uh oh!

review-agent-prime bot commented Nov 14, 2023

collector/scripts/link.py

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants