+
Skip to content

Conversation

WGH-
Copy link
Collaborator

@WGH- WGH- commented Feb 16, 2021

This might sound weird, but both URL standard[1] specifies it,
and browsers do that as well.

Although the standard specifies it as a "validation error",
this is not a hard error.

This actually happens in the wild: as of now, this Google's page[2]
has the following fragment:

<a class="glue-header__link"
                              href="/intl/ru_ALL
/drive/download/"
>

Yes, the newline here is in the middle of the link, and browsers
do ignore it.

[1] https://url.spec.whatwg.org/#concept-basic-url-parser
[2] https://www.google.com/intl/ru/drive/download/

This might sound weird, but both URL standard[1] specifies it,
and browsers do that as well.

Although the standard specifies it as a "validation error",
this is not a hard error.

This actually happens in the wild: as of now, this Google's page[2]
has the following fragment:

    <a class="glue-header__link"
                                  href="/intl/ru_ALL
    /drive/download/"
    >

Yes, the newline here is in the middle of the link, and browsers
do ignore it.

[1] https://url.spec.whatwg.org/#concept-basic-url-parser
[2] https://www.google.com/intl/ru/drive/download/
@WGH-
Copy link
Collaborator Author

WGH- commented Feb 22, 2021

I know that this solution looks rather dirty, so it probably would be better to not merge it before discussing it a bit.

@asciimoo
Copy link
Member

I don't think that it is dirty, it looks good to me. The only addition that we might consider is making an option to allow users to disable this feature. I don't know if it makes sense, but perhaps someone wants these white spaces in the URL for some reason.

@rochakgupta
Copy link

rochakgupta commented Mar 8, 2021

One observation I can make is that the logic to parse the URL is in func Parse(rawurl string) (*URL, error). So, perhaps it would be better to call the created function func RemoveAsciiTabAndNewlines(s string) string in it. Also, since the created function isn't called anywhere else, wouldn't it be better to make it private? Feel free to correct me if I'm wrong.

@WGH-
Copy link
Collaborator Author

WGH- commented Mar 8, 2021

Also, since the created function isn't called anywhere else, wouldn't it be better to make it private?

Yes. I initially left it exported because I wanted to use it in my code as well: Request.AbsoluteURL doesn't report errors (which sadly can't be fixed without breaking API), so I had to normalize and resolve relative URLs myself.

In any case, I agree that it's better not add more exported functions without good reason. I can live with this function copy-pasted until v3 comes and fixes the API issues that require me to call this function myself to begin with :).

@WGH-
Copy link
Collaborator Author

WGH- commented Mar 8, 2021

The only addition that we might consider is making an option to allow users to disable this feature. I don't know if it makes sense, but perhaps someone wants these white spaces in the URL for some reason.

I believe they should never appear in URL in unescaped forms, and this code doesn't touch their escaped forms. In theory someone might have use cases insane enough to rely on that some servers do accept unescaped tabs AND might treat them differently than escaped ones, but I'd argue that in this scenario he better use custom code instead of generic crawler framework.

EDIT: I'll add some test cases for escaped versions, just in case.

@asciimoo
Copy link
Member

asciimoo commented Mar 9, 2021

I understand, that servers don't accept these urls, but what if someone wants to create a checker that displays malformed urls?

@WGH-
Copy link
Collaborator Author

WGH- commented Mar 10, 2021

I understand, that servers don't accept these urls, but what if someone wants to create a checker that displays malformed urls?

This PR doesn't touch HTML attributes, so it doesn't prevent one from inspecting ther original values. Tabs and newlines are stripped only when calling Visit and related methods. (if I get your question right)

@asciimoo
Copy link
Member

My bad, you are right. Please fix the golint issues and it's ready to merge

@asciimoo
Copy link
Member

asciimoo commented Mar 8, 2022

@WGH- could you rebase this PR?

@WGH-
Copy link
Collaborator Author

WGH- commented Mar 10, 2022

Sorry, I forgot to update the status of this PR. With #673, this PR is no longer necessary.

@WGH- WGH- closed this Mar 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载