-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Remove tabs and newlines from URLs #589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This might sound weird, but both URL standard[1] specifies it, and browsers do that as well. Although the standard specifies it as a "validation error", this is not a hard error. This actually happens in the wild: as of now, this Google's page[2] has the following fragment: <a class="glue-header__link" href="/intl/ru_ALL /drive/download/" > Yes, the newline here is in the middle of the link, and browsers do ignore it. [1] https://url.spec.whatwg.org/#concept-basic-url-parser [2] https://www.google.com/intl/ru/drive/download/
I know that this solution looks rather dirty, so it probably would be better to not merge it before discussing it a bit. |
I don't think that it is dirty, it looks good to me. The only addition that we might consider is making an option to allow users to disable this feature. I don't know if it makes sense, but perhaps someone wants these white spaces in the URL for some reason. |
One observation I can make is that the logic to parse the URL is in |
Yes. I initially left it exported because I wanted to use it in my code as well: In any case, I agree that it's better not add more exported functions without good reason. I can live with this function copy-pasted until v3 comes and fixes the API issues that require me to call this function myself to begin with :). |
I believe they should never appear in URL in unescaped forms, and this code doesn't touch their escaped forms. In theory someone might have use cases insane enough to rely on that some servers do accept unescaped tabs AND might treat them differently than escaped ones, but I'd argue that in this scenario he better use custom code instead of generic crawler framework. EDIT: I'll add some test cases for escaped versions, just in case. |
I understand, that servers don't accept these urls, but what if someone wants to create a checker that displays malformed urls? |
This PR doesn't touch HTML attributes, so it doesn't prevent one from inspecting ther original values. Tabs and newlines are stripped only when calling |
My bad, you are right. Please fix the golint issues and it's ready to merge |
@WGH- could you rebase this PR? |
Sorry, I forgot to update the status of this PR. With #673, this PR is no longer necessary. |
This might sound weird, but both URL standard[1] specifies it,
and browsers do that as well.
Although the standard specifies it as a "validation error",
this is not a hard error.
This actually happens in the wild: as of now, this Google's page[2]
has the following fragment:
Yes, the newline here is in the middle of the link, and browsers
do ignore it.
[1] https://url.spec.whatwg.org/#concept-basic-url-parser
[2] https://www.google.com/intl/ru/drive/download/