Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DecodedURL #54

Merged
merged 28 commits into from
Jan 7, 2018
Merged

DecodedURL #54

merged 28 commits into from
Jan 7, 2018

Conversation

mahmoud
Copy link
Member

@mahmoud mahmoud commented Dec 31, 2017

This still needs some documentation, but to address #44 and the handful of issues surrounding the central problem, I present DecodedURL. It takes care of handling reserved characters so you don't have to.

From the docstring:

DecodedURL is a type meant to act as a higher-level interface to the URL. It is the unicode to URL's bytes. DecodedURL has almost exactly the same API as URL, but everything going in and out is in its maximally decoded state. All percent decoding is handled automatically.

Where applicable, a UTF-8 encoding is presumed. Be advised that some interactions, can raise UnicodeEncodeErrors and UnicodeDecodeErrors, just like when working with bytestrings.

Examples of such interactions include handling query strings encoding binary data, and paths containing segments with special characters encoded with codecs other than UTF-8.

It's tested, works, and seems practical, though, so take a look!

…der for any reserved characters, and used it in .child() and .sibling(), will add it in further methods shortly
…obably have the issue with not-yet-normalized query parameter names (mixed decoded and encoded query parameter names that overlap).
…ation (.append, etc.). Also add and test __eq__, __ne__, and __hash__
…RLs of some complexity. Had to add userinfo to URL.normalize() to help with equality checks.
@codecov-io
Copy link

codecov-io commented Dec 31, 2017

Codecov Report

Merging #54 into master will increase coverage by 0.13%.
The diff coverage is 98.59%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #54      +/-   ##
==========================================
+ Coverage    97.8%   97.94%   +0.13%     
==========================================
  Files           6        8       +2     
  Lines        1137     1408     +271     
  Branches      137      164      +27     
==========================================
+ Hits         1112     1379     +267     
- Misses         13       14       +1     
- Partials       12       15       +3
Impacted Files Coverage Δ
hyperlink/test/test_parse.py 100% <100%> (ø)
hyperlink/test/test_decoded_url.py 100% <100%> (ø)
hyperlink/__init__.py 100% <100%> (ø) ⬆️
hyperlink/_url.py 96.1% <97.57%> (+0.37%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 17dc8d3...e8616fa. Read the comment docs.

_UNRESERVED_DECODE_MAP = dict([(k, v) for k, v in _HEX_CHAR_MAP.items()
if v.decode('ascii', 'replace')
in _UNRESERVED_CHARS])

_ROOT_PATHS = frozenset(((), (u'',)))


def _encode_reserved(text, maximal=True):
"""A very comprehensive percent encoding for encoding all
delimeters. Used for arguments to DecodedURL, where a % means a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "delimiters"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh, this kind of typo is so very unlike me, I assure you! ;) Thanks!

return not self.__eq__(other)

def __hash__(self):
return hash((self.__class__, self.scheme, self.userinfo, self.host,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Since equality is delegated upwards to URL, would it be better to do the same for hashing?

e.g.

def __hash__(self):
    return hash(self.normalize().to_uri())

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I may have subtly changed my mind midway through development, and now I lean more toward not delegating up to URL for equality. I'll get that fixed, thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever you're delegating hash like this, remember you should also include a tweak, so that the DecodedURL and the URL representing the same data don't hash the same.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glyph why is that? (correctness shouldn't be effected since dictionaries compare by equality, and I'm not entirely sure what the performance problem you're trying to forestall)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I share @moshez's curiosity.

@mahmoud - will you follow @alexwlchan's advice about delegating __hash__?


__all__ = [
"URL",
"DecodedURL",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to say that we should just avoid exposing this entirely, but it probably needs to be exposed for type annotations.

However, as per #44, could we have a "decoded" property on URL that provides an interface to this, and an "encoded" property on DecodedURL that maps back to a URL?

(At this point I think I'm in favor of adding an EncodedURL alias for URL, then maybe adding a top-level entry point like hyperlink.parse() which takes a decoded kwarg flag which defaults to true, to make it easier to get started with DecodedURL which is what I think we all want most of the time. That can definitely be deferred to a separate ticket though.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am pretty much in favor of all those conveniences :) And I also agree that this probably needs to be exposed.

mahmoud and others added 6 commits December 31, 2017 21:29
…vel API. fix a bug with userinfo where double-escapes were possible because % wasn't marked as safe. add lazy option to DecodedURL, also exposed in parse and URL's new get_decoded_url() method. add a few more notes and comment tweaks
…verage-oriented tests. Remove the password attributes of DecodedURL pending future discussion.
Copy link
Member

@markrwilliams markrwilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach seems good and it addresses my question:

>>> from hyperlink import DecodedURL, URL
>>> d = DecodedURL(URL())
>>> d.add(u'parameter', u'#value').to_text()
u'?parameter=%23value'

I think this means we should use DecodedURL. in twisted.python.url. That's great!

I've left comments about documentation improvements and lingering TODOs. I'd also like to see clarity around __hash__. Please address these issues with changes or PR comments before merging.

@@ -1040,7 +1083,7 @@ def from_text(cls, text):
rooted, userinfo, uses_netloc)

def normalize(self, scheme=True, host=True, path=True, query=True,
fragment=True):
fragment=True, userinfo=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring should mention userinfo.

There's a TODO for userinfo. Should that go?

handled automatically.

Where applicable, a UTF-8 encoding is presumed. Be advised that
some interactions, can raise UnicodeEncodeErrors and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comma splice:

some interactions , can raise UnicodeEncodeErrors...

Also, maybe you want backticks around UnicodeEncodeErrors and UnicodeDecodeErrors.

UnicodeDecodeErrors, just like when working with
bytestrings.

Examples of such interactions include handling query strings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be part of the previous paragraph?

encoding binary data, and paths containing segments with special
characters encoded with codecs other than UTF-8.
"""
def __init__(self, url, lazy=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a public class, then its initializer should be documented. Please include an explanation of url and lazy in the docstring.


def click(self, href=u''):
"Return a new DecodedURL wrapping the result of :meth:`~hyperlink.URL.click()`"
return type(self)(self._url.click(href=href))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not self.__class__?

durl = durl.set(' ', 'spa%ed')
assert durl.get(' ') == ['spa%ed']

durl = DecodedURL(url=durl.to_uri())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of this? Round tripping?


assert durl.set('arg', 'd').get('arg') == ['d']

def test_equivalences(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really tests __eq__ and __hash__, so maybe test_equality_and_hashability?


durl_map = {}
durl_map[durl] = durl
durl_map[durl2] = durl2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also test that a URL and a DecodedURL that represent the same underlying URL don't overlap in a dict (or set.)


assert len(durl_map) == 1

def test_replace_roundtrip(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, but there should also be a roundtrip test between TOTAL_URL and DecodedURL.


BASIC_URL = 'http://example.com/#'
TOTAL_URL = "https://%75%73%65%72:%00%00%00%[email protected]:8080/a/nice%20nice/./path/?zot=23%25&zut#frég"
UNDECODABLE_FRAG_URL = TOTAL_URL + '%C3'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be good to note that this is undecodable because %C3 makes it invalid UTF-8.

@mahmoud mahmoud changed the title WIP: DecodedURL DecodedURL Jan 7, 2018
@mahmoud mahmoud merged commit a23a1a4 into master Jan 7, 2018
@glyph glyph deleted the i44_decoded_url branch January 23, 2018 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants