-
-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input format of idn-email and idn-hostname (and presumably irn and irn-reference, too) #247
Comments
I skipped these built-ins for kdl-py partially because I wasn't sure how to parse them, but reviewing https://datatracker.ietf.org/doc/html/rfc5890, I think the most reasonable answer is that it should allow both a-labels (all ascii, with the xn-- prefix) and u-labels (unicode NFC, must contain at least one non-ASCII character). If your lang doesn't have a built-in for this type (which, uh, none that I know of do), I'd encourage your result class to support either form in the output. Meanwhile, the |
I've been looking into the email RFCs and, uh, I can't make heads nor tails of them. I presume they're saying that email vs idn-email use the same distinction for the part after the @, but I haven't a clue about the part before. |
The local part is pretty much up to the interpretation of the email server and should be forwarded "as-is" (as utf8 string). Theoretically there are some syntax limitations but also a lot more things are allowed then people realize:
There is no (relevant) standard for encoding utf8 local parts into us-ascii, some servers did support puny code in local parts but that is non standard and is equivalent to a mail alias in a similar way of how for some, but not all, mail servers Furthermore as far as I remember servers don't have to treat Practically things are also much more messy, quite a bunch of software has arbitrary non standard conform limitations on the local part, for example:
Additionally for the host name any (syntax) valid domain is an option but converting from/to puny code on the fly can lead to problems for in-transit edge cases and as such shouldn't be done for such cases, but should be done for displaying the domain to users. Also while top-level domain mail addresses are rare they still do exist. Lastly theoretically mail allows having a host part which is not a domain, e.g. you can send a mail to an IP address. This was used in some data-center internal networking systems, but has lost most(?) relevance today, so it's often times fine to ignore that syntax. All in all for many systems it's best to treat mail addresses as an opaque string where you might check if the part after the last @ is a syntactically valid domain and might puny-decode it for displaying it to users (but only then). And then do mail validation like always, i.e. send a "click this link to validate mail" link. Hope that helps. |
So yes: Edit: Or |
The relevant rfcs are:
Also theoretically the mail syntax used by MIME (the email paylod format) allows even more unusual edge cases no one cares about. But any mail which isn't compatible with SMTP can't be send so it doesn't matter. |
I'm wondering, what is the expected input for any of the internationalized email/hostname/url types?
I'm busy building the type annotation parsing for kdl-rb and I'm not really sure how to go about this.
I started with taking idn-hostname in as the punycode format (e.g. xn--9ckb.com) and converting it to unicode, but then when I got to email, it seems the local part is kept as-is, regardless of the domain. So I'm wondering, which of the following would we expect to see in a document?
ASCII format?
or Unicode format?
The text was updated successfully, but these errors were encountered: