Remove tabs and newlines from URLs #589

WGH- · 2021-02-16T03:00:17Z

This might sound weird, but both URL standard[1] specifies it,
and browsers do that as well.

Although the standard specifies it as a "validation error",
this is not a hard error.

This actually happens in the wild: as of now, this Google's page[2]
has the following fragment:

<a class="glue-header__link"
                              href="/intl/ru_ALL
/drive/download/"
>

Yes, the newline here is in the middle of the link, and browsers
do ignore it.

[1] https://url.spec.whatwg.org/#concept-basic-url-parser
[2] https://www.google.com/intl/ru/drive/download/

This might sound weird, but both URL standard[1] specifies it, and browsers do that as well. Although the standard specifies it as a "validation error", this is not a hard error. This actually happens in the wild: as of now, this Google's page[2] has the following fragment: <a class="glue-header__link" href="/intl/ru_ALL /drive/download/" > Yes, the newline here is in the middle of the link, and browsers do ignore it. [1] https://url.spec.whatwg.org/#concept-basic-url-parser [2] https://www.google.com/intl/ru/drive/download/

WGH- · 2021-02-22T21:57:38Z

I know that this solution looks rather dirty, so it probably would be better to not merge it before discussing it a bit.

asciimoo · 2021-02-26T00:21:11Z

I don't think that it is dirty, it looks good to me. The only addition that we might consider is making an option to allow users to disable this feature. I don't know if it makes sense, but perhaps someone wants these white spaces in the URL for some reason.

rochakgupta · 2021-03-08T09:22:00Z

One observation I can make is that the logic to parse the URL is in func Parse(rawurl string) (*URL, error). So, perhaps it would be better to call the created function func RemoveAsciiTabAndNewlines(s string) string in it. Also, since the created function isn't called anywhere else, wouldn't it be better to make it private? Feel free to correct me if I'm wrong.

WGH- · 2021-03-08T23:42:31Z

Also, since the created function isn't called anywhere else, wouldn't it be better to make it private?

Yes. I initially left it exported because I wanted to use it in my code as well: Request.AbsoluteURL doesn't report errors (which sadly can't be fixed without breaking API), so I had to normalize and resolve relative URLs myself.

In any case, I agree that it's better not add more exported functions without good reason. I can live with this function copy-pasted until v3 comes and fixes the API issues that require me to call this function myself to begin with :).

WGH- · 2021-03-08T23:56:12Z

The only addition that we might consider is making an option to allow users to disable this feature. I don't know if it makes sense, but perhaps someone wants these white spaces in the URL for some reason.

I believe they should never appear in URL in unescaped forms, and this code doesn't touch their escaped forms. In theory someone might have use cases insane enough to rely on that some servers do accept unescaped tabs AND might treat them differently than escaped ones, but I'd argue that in this scenario he better use custom code instead of generic crawler framework.

EDIT: I'll add some test cases for escaped versions, just in case.

asciimoo · 2021-03-09T23:54:26Z

I understand, that servers don't accept these urls, but what if someone wants to create a checker that displays malformed urls?

WGH- · 2021-03-10T00:21:49Z

I understand, that servers don't accept these urls, but what if someone wants to create a checker that displays malformed urls?

This PR doesn't touch HTML attributes, so it doesn't prevent one from inspecting ther original values. Tabs and newlines are stripped only when calling Visit and related methods. (if I get your question right)

asciimoo · 2021-03-15T14:09:02Z

My bad, you are right. Please fix the golint issues and it's ready to merge

asciimoo · 2022-03-08T08:38:13Z

@WGH- could you rebase this PR?

WGH- · 2022-03-10T17:21:15Z

Sorry, I forgot to update the status of this PR. With #673, this PR is no longer necessary.

WGH- mentioned this pull request Mar 21, 2021

Better URL parsing according to whatwg URL standard #596

Open

WGH- closed this Mar 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove tabs and newlines from URLs #589

Remove tabs and newlines from URLs #589

Uh oh!

WGH- commented Feb 16, 2021

Uh oh!

WGH- commented Feb 22, 2021

Uh oh!

asciimoo commented Feb 26, 2021

Uh oh!

rochakgupta commented Mar 8, 2021 •

edited

Loading

Uh oh!

WGH- commented Mar 8, 2021 •

edited

Loading

Uh oh!

WGH- commented Mar 8, 2021 •

edited

Loading

Uh oh!

asciimoo commented Mar 9, 2021

Uh oh!

WGH- commented Mar 10, 2021

Uh oh!

asciimoo commented Mar 15, 2021

Uh oh!

asciimoo commented Mar 8, 2022

Uh oh!

WGH- commented Mar 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Remove tabs and newlines from URLs #589

Remove tabs and newlines from URLs #589

Uh oh!

Conversation

WGH- commented Feb 16, 2021

Uh oh!

WGH- commented Feb 22, 2021

Uh oh!

asciimoo commented Feb 26, 2021

Uh oh!

rochakgupta commented Mar 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WGH- commented Mar 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WGH- commented Mar 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asciimoo commented Mar 9, 2021

Uh oh!

WGH- commented Mar 10, 2021

Uh oh!

asciimoo commented Mar 15, 2021

Uh oh!

asciimoo commented Mar 8, 2022

Uh oh!

WGH- commented Mar 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rochakgupta commented Mar 8, 2021 •

edited

Loading

WGH- commented Mar 8, 2021 •

edited

Loading

WGH- commented Mar 8, 2021 •

edited

Loading