-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two URLs that render as the same text but are not equal #6
Comments
It seems as if That is, |
It certainly feels like you're onto something. I can kind of explain why it does what it does, but I might want to wait til I'm in a less jet lagged state. I believe this behavior is inherited from t.p.URL, so @glyph may also be a good explainer in the meantime. Here's a go anyways: the constructor, working from parts, is meant to work equally well with URIs and IRIs. Virtually no decoding happens in the constructor. When converting to text, minimal encoding is used. Percents are certainly valid in the path, and are left alone. A question mark is reserved for the query string, and is encoded. In short (and crudely), one is a URI and one is an IRI, according to this simplified explanation in the docs. I hope I'm right and that makes sense, but am certainly open to correction. It's definitely good to explore these APIs and maybe add some better docs/FAQs, so thank you! |
Basically the idea here is that originally Twisted had a notion of the "real" value and the "encoded" value of a URL path segment; the "real" value had all the %xx values decoded and the "encoded" version was escaped. Unfortunately this is overly simplistic, because of course a human being could type an "encoded" URL into a browser bar, and then that is the text that they wanted to see. Such values can be "down"-converted into ASCII-only machine-readable things that can be sliced up to go onto the wire ("URI") or "up"-converted into textual human-readable things that might be in a language other than English, and/or include emojis 🐮. |
The problem with characters like # and ? is that, while it is possible to "up"-convert a |
So the current behavior is an accident of the implementation, but without some further thought, also kind of inevitable? I would be interested in how we might preserve both behaviors. |
I see two issues here:
The behavior in the example and noted by @glyph above looks like a bug to me, and should raise an exception if I recently ran into a similar issue in w3lib's url canonicalizer (formerly scrapy's). their existing function standardizes a There is room for debate on how to handle the various sub-delims that can either appear in the path or be percent encoded -- but A use case for the sub-delims is here:
But what happens about this URL?
You can canonicalize it for fingerprinting as:
But that has a fundamentally different meaning to a server than:
And for fun, how should this be handled?
IMHO, the safest/smartest bit is to require all input to be percent-encoded FOR THE SUB-DELIMS. You can keep a display version of the 'unescaped', but the only way to ensure you know what a URL intends is to be explicit on the construction. edit: added "for the sub-delims" in the last paragraph. |
Hey Jonathan! These are some good points and I see where you're coming from, but I recommend reviewing this bit of hyperlink design. To summarize here, Hyperlink is meant to represent both escaped and unescaped versions. Interpretation of the constituent parts is based on usage. If you call This contrasts with boltons' urlutils which fully decodes inputs, an approach which is less powerful because it doesn't allow for mixed representations (which are quite common in practice). There are bound to be some corner cases and I'm glad we're working through them. A lot of these scenarios don't seem to have right answers, but I'm still confident with a bit more thought we will arrive at some pleasantly intuitive behaviors :) |
I understand your points. The points I'm trying to convey are not coming across and a bit nuanced, so I'll try again (if you'll put up with me). I've had my head buried in these RFCs and edge-cases for the past month, so I've got a few of these concepts dancing in my head and keeping the right words from forming.
I'm not talking about creating a url object from a full url string, but specifically using the keyword 'path' argument to the constructor (as in the example above). The 'human'/browser-bar representation will always have these two elements encoded as part of the path, because they would signify other components to the URL structure if decoded and necessarily point to a resource that is different. For example, consider this URI
The path must always the following, and can never be decoded in a browser bar:
Unescaping the
The same holds true for
Using the examples in the docs, these two URLs will essentially always mean the same thing - so you can easily change context between them without concern. It doesn't matter how this is represented.
However, the RFC sub-delims and the percent sign are not guaranteed to be the same... !$&'()*+,;= I don't have an answer or a proposal for those characters. That's where the bulk of edge cases and exploration will be -- as there is a lot of mixed usage in there. IMHO, the input should be explicitly marked as encoded or not -- because they can shift the 'meaning' and intent of the URL. It looks like the library is doing the right thing by those characters and leaving them as-is / not recoding them. |
I see what you mean, and yes, enhancing the constructor behavior is seeming more and more like the way to go forward. I mentioned that over here, and I'm glad someone else with their head in the RFCs is kind of independently feeling that angle, too. |
I'm still bothered by the fact that the I think @glyph is saying that it's desirable for str(URL.fromText(X)) to always give you back X and not a normalized form of X. I think that's nuts, because (a) as @glyph said, it's not possible and (b) this bug is the result of trying to do the impossible. If you want the original string, save it somewhere; this isn't a string object. |
I guess what I mean is that if |
The problem is that "encoded" and "unencoded" are ambiguous terms here. There are also representations of mixed validity, like |
You can also have validation failures going in the other direction, with un-representable / invalid IDNA names in the domain part. |
Is part of the problem here that we're using the same object for URI and IRI? Seem they have different rules, so maybe shouldn't be the same… |
@wsanchez URIs and IRIs are both URLs. Because mixed URLs occur in practice, it's not just the rules that overlap. I think it's quite useful and powerful to pursue this hybrid approach. |
In these cases of mixed-validity, an input of So I would interpret these as distinct resources:
|
So, putting aside the design decision to support mixed encoding states, I think I have just committed a solution that fixes this issue (as well as #8 and countless other potential problems) while staying true to URL's unified design. In short, I add a bunch of checking for delimiters to various parameters. Some characters simply must always be encoded in some contexts, a fact that is made clear from the discussion above, not to mention the Let me know if you have any thoughts! |
How do I add a URL parameter that contains someURL = URL()
someURL.add(u'parameter', u'#value').to_text() because it fails with:
But with older versions I can do: someURL.add(u'parameter', u'#value').to_text()
# u'?parameter=%23value' |
@markrwilliams See #44, #11 . |
I can create two URLs that render as the same text but are not equal:
The text was updated successfully, but these errors were encountered: