RE: "International" email addresses [I18N-ACTION-374]

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

RE: "International" email addresses [I18N-ACTION-374]

Phillips, Addison-2

Dear Steven and XForms,

 

Firstly, the WG *very much* welcomes further discussion from any and all on this list: this is how we find stuff out. (Thanks to Anne, JcK, Jungshik, and Shawn for contributions so far)

 

This is just a note to let you know that the Internationalization WG has taken up a discussion of this topic, which has, obviously, some interesting issues associated with it. We’re aware that, although “EAI” (email address internationalization) has been slow to mature and gain traction, there are serious efforts from vendors and in various countries to bring non-ASCII mail addresses into the mainstream.

 

This doesn’t play well with the current description in HTML (cited by Anne) or various other places. As Shawn and John note, a regex description of IDNA is probably impossible. At best, such a regex would be an approximation.

 

The Internationalization WG is creating a discussion page to capture the issues [1]. We have not had a chance to discuss the issue in greater depth yet, but the WG’s consensus is that this is an interesting problem needing further investigation and documentation. Please note that, owing to the Thanksgiving holiday in the USA, the Internationalization WG is unlikely to make much more of a response for a couple of weeks.

 

Regards (for I18N),

 

Addison

 

[1] https://www.w3.org/International/wiki/EAI_Address_Issues

 

 

From: Shawn Steele [mailto:[hidden email]]
Sent: Wednesday, November 19, 2014 11:37 AM
To: Jungshik SHIN (
신정식)
Cc: Anne van Kesteren; Steven Pemberton; [hidden email]; Forms WG
Subject: RE: "International" email addresses

 

Validating the IDN part is much more complicated than validating the local part, because you need to know the IDN rules.  Which means it probably isn’t just a “simple” regex. 

 

So maybe the rule should allow Unicode in the domain part and encourage complete IDN validation as an additional step?

 

-Shawn

 

From: [hidden email] [[hidden email]] On Behalf Of Jungshik SHIN (???)
Sent: Wednesday, November 19, 2014 10:53 AM
To: Shawn Steele
Cc: Anne van Kesteren; Steven Pemberton; [hidden email]; Forms WG
Subject: Re: "International" email addresses

 

https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489 deals with it (EAI support in email form validation) although the summary is a bit misleading (it only talks about IDN). 

 

Jungshik

 

On Wed, Nov 19, 2014 at 10:07 AM, Shawn Steele <[hidden email]> wrote:

Updating that to support EAI would be good.


-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Anne van Kesteren
Sent: Wednesday, November 19, 2014 2:07 AM
To: Steven Pemberton
Cc: [hidden email]; Forms WG
Subject: Re: "International" email addresses

On Wed, Nov 19, 2014 at 11:00 AM, Steven Pemberton <[hidden email]> wrote:
> So as far as I can see, an internationalised email address is:
>
>  address: atom-list "@" atom-list.
>  atom-list: atom ( "." atom )*
>  atom: C+
>  C: any character in the world EXCEPT (),.:;<>@[\]
>
> a) Do you agree?
> b) It was really hard to find this out. The internet is rife with
> people asking and getting bad answers. Please help the internet by
> being definitive.

I recommend matching HTML's definition:

https://html.spec.whatwg.org/multipage/forms.html#valid-e-mail-address


--
https://annevankesteren.nl/

 

Reply | Threaded
Open this post in threaded view
|

Re: "International" email addresses [I18N-ACTION-374]

Anne van Kesteren-4
On Thu, Nov 20, 2014 at 5:37 PM, Phillips, Addison <[hidden email]> wrote:
> This doesn’t play well with the current description in HTML (cited by Anne)
> or various other places.

Just to be clear, note that HTML does support them as *input* (the UI
done by the UA), it's just that it expects that to be translated to
ASCII. This is not so different from how we deal with this situation
when it comes to URLs.


--
https://annevankesteren.nl/

Reply | Threaded
Open this post in threaded view
|

Re: "International" email addresses [I18N-ACTION-374]

Mark Davis ☕
In reply to this post by Phillips, Addison-2
The HTML spec already has the definition of a domain name, so that can be reused without a problem, IMO. Not a lot of work necessary to fix that.

The only change requiring a bit of work is the local-part. For that, I tend to agree with Anne that the EAI spec is overly broad (for compatibility's sake), and that the HTML spec can be somewhat tighter.



— Il meglio è l’inimico del bene —

On Thu, Nov 20, 2014 at 5:37 PM, Phillips, Addison <[hidden email]> wrote:

Dear Steven and XForms,

 

Firstly, the WG *very much* welcomes further discussion from any and all on this list: this is how we find stuff out. (Thanks to Anne, JcK, Jungshik, and Shawn for contributions so far)

 

This is just a note to let you know that the Internationalization WG has taken up a discussion of this topic, which has, obviously, some interesting issues associated with it. We’re aware that, although “EAI” (email address internationalization) has been slow to mature and gain traction, there are serious efforts from vendors and in various countries to bring non-ASCII mail addresses into the mainstream.

 

This doesn’t play well with the current description in HTML (cited by Anne) or various other places. As Shawn and John note, a regex description of IDNA is probably impossible. At best, such a regex would be an approximation.

 

The Internationalization WG is creating a discussion page to capture the issues [1]. We have not had a chance to discuss the issue in greater depth yet, but the WG’s consensus is that this is an interesting problem needing further investigation and documentation. Please note that, owing to the Thanksgiving holiday in the USA, the Internationalization WG is unlikely to make much more of a response for a couple of weeks.

 

Regards (for I18N),

 

Addison

 

[1] https://www.w3.org/International/wiki/EAI_Address_Issues

 

 

From: Shawn Steele [mailto:[hidden email]]
Sent: Wednesday, November 19, 2014 11:37 AM
To: Jungshik SHIN (
신정식)
Cc: Anne van Kesteren; Steven Pemberton; [hidden email]; Forms WG
Subject: RE: "International" email addresses

 

Validating the IDN part is much more complicated than validating the local part, because you need to know the IDN rules.  Which means it probably isn’t just a “simple” regex. 

 

So maybe the rule should allow Unicode in the domain part and encourage complete IDN validation as an additional step?

 

-Shawn

 

From: [hidden email] [[hidden email]] On Behalf Of Jungshik SHIN (???)
Sent: Wednesday, November 19, 2014 10:53 AM
To: Shawn Steele
Cc: Anne van Kesteren; Steven Pemberton; [hidden email]; Forms WG
Subject: Re: "International" email addresses

 

https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489 deals with it (EAI support in email form validation) although the summary is a bit misleading (it only talks about IDN). 

 

Jungshik

 

On Wed, Nov 19, 2014 at 10:07 AM, Shawn Steele <[hidden email]> wrote:

Updating that to support EAI would be good.


-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Anne van Kesteren
Sent: Wednesday, November 19, 2014 2:07 AM
To: Steven Pemberton
Cc: [hidden email]; Forms WG
Subject: Re: "International" email addresses

On Wed, Nov 19, 2014 at 11:00 AM, Steven Pemberton <[hidden email]> wrote:
> So as far as I can see, an internationalised email address is:
>
>  address: atom-list "@" atom-list.
>  atom-list: atom ( "." atom )*
>  atom: C+
>  C: any character in the world EXCEPT (),.:;<>@[\]
>
> a) Do you agree?
> b) It was really hard to find this out. The internet is rife with
> people asking and getting bad answers. Please help the internet by
> being definitive.

I recommend matching HTML's definition:

https://html.spec.whatwg.org/multipage/forms.html#valid-e-mail-address


--
https://annevankesteren.nl/

 


Reply | Threaded
Open this post in threaded view
|

Re: "International" email addresses [I18N-ACTION-374]

Asmus Freytag (c)
In reply to this post by Phillips, Addison-2
On 11/20/2014 8:37 AM, Phillips, Addison wrote:
As Shawn and John note, a regex description of IDNA is probably impossible. At best, such a regex would be an approximation.

 


The problem is not that it's impossible to do a rigorous description (essentially a regex) of the IDN rules for a given zone, but that the description varies along the tree, and that the knowledge of the rules that apply at each level is imperfect.

As I mentioned, there's an effort underway to define an XML format that allows one to capture any known descriptions in (essentially) a regex-like format expressed in XML that can be parsed and evaluated by a common engine.

If/when IANA's registry gets converted to this format, you should be able to do IDN validation, down to the second level at least, to any level of desired accuracy by querying the correct tables (or able to build approximate regexes with known degrees of accuracy - because you could then test them against any published full specifications).

Anyway, you find a draft here: https://datatracker.ietf.org/doc/draft-davies-idntables/

A./




Reply | Threaded
Open this post in threaded view
|

RE: "International" email addresses [I18N-ACTION-374]

Shawn Steele
In reply to this post by Anne van Kesteren-4
> Just to be clear, note that HTML does support them as *input* (the UI done by the UA), it's just that it expects that to be translated to ASCII. This is not so different from how we deal with this situation when it comes to URLs.

ASCII isn't supported for EAI, and it's typically preferred to keep Domain Names in Unicode except for that pesky resolving step, where they must be punycoded.
Reply | Threaded
Open this post in threaded view
|

RE: "International" email addresses [I18N-ACTION-374]

Shawn Steele
In reply to this post by Asmus Freytag (c)

Personally I don’t much see the point.  If it resolves it’s valid.  If it doesn’t then most apps could care less if it’s well formed.

 

There are a very few applications that actually assign these things, and those certainly need to be able to understand the rules of their domains, but I’m not sure that’s a general problem.

 

-Shawn

 

From: Asmus Freytag [mailto:[hidden email]]
Sent: Thursday, November 20, 2014 9:05 AM
To: Phillips, Addison; Steven Pemberton
Cc: [hidden email]; Forms WG
Subject: Re: "International" email addresses [I18N-ACTION-374]

 

On 11/20/2014 8:37 AM, Phillips, Addison wrote:

As Shawn and John note, a regex description of IDNA is probably impossible. At best, such a regex would be an approximation.

 

 

The problem is not that it's impossible to do a rigorous description (essentially a regex) of the IDN rules for a given zone, but that the description varies along the tree, and that the knowledge of the rules that apply at each level is imperfect.

As I mentioned, there's an effort underway to define an XML format that allows one to capture any known descriptions in (essentially) a regex-like format expressed in XML that can be parsed and evaluated by a common engine.

If/when IANA's registry gets converted to this format, you should be able to do IDN validation, down to the second level at least, to any level of desired accuracy by querying the correct tables (or able to build approximate regexes with known degrees of accuracy - because you could then test them against any published full specifications).

Anyway, you find a draft here: https://datatracker.ietf.org/doc/draft-davies-idntables/

A./



Reply | Threaded
Open this post in threaded view
|

Re: "International" email addresses [I18N-ACTION-374]

John C Klensin-4
In reply to this post by Asmus Freytag (c)


--On Thursday, November 20, 2014 09:05 -0800 Asmus Freytag
<[hidden email]> wrote:

> On 11/20/2014 8:37 AM, Phillips, Addison wrote:
>> As Shawn and John note, a regex description of IDNA is
>> probably  impossible. At best, such a regex would be an
>> approximation.
>>
>>
> The problem is not that it's impossible to do a rigorous
> description (essentially a regex) of the IDN rules for a given
> zone, but that the description varies along the tree, and that
> the knowledge of the rules that apply at each level is
> imperfect.
>...

Asmus,

(Reluctantly putting on four separate virtual hats here: editor
of RFC 5321 (SMTP), co-author of RFC 6055 (the IAB Domain Name
Encoding spec);  co-editor of RFC 5890, 5891, and 5894 (the
framework and definitions specification,  protocol
specification, and background and rational document for
IDNA2008) and contributor to most of the rest of the IDNA
documents; and EAI WG co-chair and co-editor of RFC 6530 (the
overview and definitions document for the EAI work) and
contributor to most of the other email address and header
internationalization documents.)

Basically I agree with the above, but I would have said "but, in
addition..." rather than "but that".

To be less cryptic and in the hope of not having the discussion
deteriorate into the confusion that has characterized parts of
the web-email interface for the last 15 or 20 years, there are
several separate issues, and they are not just about levels.  I
know you know most, if not all, of this but, in the hope of
drawing things together:

(1) The IDNA specifications (RFCs 5890ff) provide a set of
processing rules that, together, define the protocol-level
validity of IDN labels.    There was never any intent that those
rules be completely described by syntax alone and, despite the
XML effort you describe (excerpt quoted below), I don't believe
a complete syntax-based description is possible.  In addition,
while the rules are intended to be constant across versions of
Unicode, the list of permitted code points change with Unicode
versions and, because per-version exception processing in the
IETF is allowed for, the relationship is expected to be mostly
deterministic but may not be entirely so.   "Protocol-level
validity" effectively determines only the set of labels (and
code points, but the two are not the same) that cannot be used
in any zone.  They do not determine what labels _can_ be used in
a zone.  Per-zone criteria are required to specify only labels
that IDNA2008 allows on a protocol basis, but zones are expected
to control their own subsets and label repertoire.  That
expectation is an explicit requirement of IDNA2008.  

IDNA2008 also requires some very specific validity checks of any
applications doing DNS lookups.  It would be, IMO, unwise to not
follow and enforce those requirements.  But, as discussed in
more detail below, there are many possible labels, and even more
fully-qualified domain names, that are valid under IDNA2008 but
not valid in practice (whether actually registered or not).

(2) At the TLD level (entries in the root zone), ICANN processes
control the valid names.  The rules are expected to be very
conservative to, among other things, eliminate any plausible
chance of confusion among names.  At present, the decisions
about what is permitted and what is not are controlled by two
separate processes --one for ccTLD and one for new gTLDs-- that
use different methods and criteria.  A new system has been
developed for, at least, new gTLDs (I do not believe it is not
yet clear how it will apply to country-based TLDs), but it is
not applicable to existing TLDs or applications now in the
system and has therefore not been tested in practice.

(3) At the second level (names appearing within TLDs), policies
differ by zone, with a range from "no IDN labels" to "almost any
IDN label allowed by IDNA2008" with "only specific characters
drawn from specific scripts or languages" lying somewhere in
between, with some zones applying additional restrictions
prohibiting specific strings and types of strings (just as has
been the case for non-IDN labels).  The long time ICANN policy
has been that each zone is allowed by develop its own rules
although they are invited (sometimes expected) to report the
characters they allow to IANA.  As I understand it, the new
ideas represented by:

> As I mentioned, there's an effort underway to define an XML
> format that allows one to capture any known descriptions in
> (essentially) a regex-like format expressed in XML that can be
> parsed and evaluated by a common engine.

are of importance to those second-level tables because it should
be a considerable improvement over simple lists of code points.
However, anyone expecting to use that format should understand
that, unless ICANN makes major changes in policy, using the
mechanism will require that one identify the TLD, identify the
second-level rule set associated with that TLD (of there is one)
and then apply that rule set.    I note in passing that the
ccTLD community has very strongly rejected ICANN's ability to
require that they submit such tables and even more strongly
rejected ICANN's authority to tell them what the registration
rules should be.   Also, if a zone prohibits particular labels
for moral, religious, aesthetic, political, or other reasons, it
is not clear whether any regex-like algorithm will be of
significant use in recording those rules (which also tend to be
moderately volatile).

> If/when IANA's registry gets converted to this format, you
> should be able to do IDN validation, down to the second level
> at least, to any level of desired accuracy by querying the
> correct tables (or able to build approximate regexes with
> known degrees of accuracy - because you could then test them
> against any published full specifications).
 
> Anyway, you find a draft here:
> https://datatracker.ietf.org/doc/draft-davies-idntables/

Indeed, subject to the comments above.  Note that, because there
is no requirement for uniform policies among TLDs, for
second-level domains one may well have to deal with at least a
thousand or two separate sets of rules in the near future.

(4) At the third level and below, labels are essentially open
season, constrained only by IDNA2008 and whatever sense of
proprietary and user protection exists for a particular zone.
Even if individual zones were to publish their rules, we are
talking about many millions of zones, each potentially with its
own rules.  An SLD zone could try to restrict the names its
delegated zones use by contract, but such strategies have not
proven successful in the past, especially for subdomains of
those subdomains and below, and one could even argue that the
DNS was designed to make enforcement of such rules difficult.

So, it is possible to make syntax-based rules that will tell you
what domains (or labels) are clearly invalid.   One can push
those rules further by investing additional effort and
processing time in identifying a larger population of invalid
labels as invalid.    The one thing that, IMO, one should be
careful about is to not adopt rules, or extrapolate rules and
restrictions from one domain to another that would identify
perfectly valid and registered domain names as invalid.   We
have had that problem fairly extensively already due, for
example, to browsers, forms, or other software at the web-email
boundary deciding that such characters as "/", "+", and even "."
are invalid in email addresses.  Doing so makes legitimate
addresses inaccessible and causes a great deal of unhappiness
among users and those who use the web, especially those who
treat email addresses as personal identifiers.

As those who are concerned about making domains containing IDN
labels more consistently accessible are fond of pointing out
(somewhat unfairly given the history), we used to have a common
practice of applications "knowing" all the TLD names or at least
the rules by which TLD names were formed.  When new TLDs and
TLDs not conforming to those historical naming rules, were
introduced, they were inaccessible from those applications until
they were upgraded (sometimes years or longer).  That is
generally a bad situation.

It is perhaps even worth pointing out that national and cultural
sensitivity about IDNs run extremely deep.  If a browser vendor,
or user of a form system, wanted to experiment with whether a
particular country could be provoked far enough to make use or
importation of a particular browser, product, or web site
illegal, making it impossible to use valid domains that the
country considered culturally or strategically important would
probably make a good test case for such an experiment.

There are tradeoffs about load on the root servers, but I have
some sympathy for enforcement of the IDNA2008 rules only and
otherwise following the principle that Shawn Steele advocates:

--On Thursday, November 20, 2014 19:38 +0000 Shawn Steele
<[hidden email]> wrote:

> Personally I don't much see the point.  If it resolves
> it's valid.  If it doesn't then most apps could care less
> if it's well formed.

All of the above is strictly about the domain part of an email
address.  As I and many others have noted, the issues with the
local part are entirely different and should really be discussed
separately.   But I do feel a need to comment on one suggestion:


--On Thursday, November 20, 2014 17:54 +0100 Mark Davis ☕️
<[hidden email]> wrote:

> The only change requiring a bit of work is the local-part. For
> that, I tend to agree with Anne that the EAI spec is overly
> broad (for compatibility's sake), and that the HTML spec can
> be somewhat tighter.

That "broadness" of the EAI specs are due to two things, a need
to be consistent with SMTP and a need to reflect actual
practices in email address usage.  I note in particular that
SMTP requires that upper and lower case local parts be treated
as distinct, even in ASCII.  Equivalencing or aliasing of
strings that differ only by case (or in other ways) is
explicitly permitted and mail server operators have been advised
for decades to not have such strings identify separate mailboxes
unless they have very specific reasons to do so.  Applied to the
EAI environment, that rule saves a world of pain (pain we have
experienced with IDNs and continue to experience) because the
decision as to whether one string is the upper or lower case
equivalent of another is determined entirely in the context of
the server supporting the mailbox -- sending or intermediate
systems are not allowed to assume the equivalence.  Those
supposedly "over broad" rules therefore protect us from
culturally-unpleasant arguments about the various case folding
edge cases.  

Please don't break that by deciding to impose "tighter" rules.
Also understand that, if HTML or the web-email interface makes
up its own set of rules, you folks will be taking responsibility
for telling the owners and users of email systems what email
local parts they can use and, given the number of environments
in which personal names are used as part of local parts, what
names they can have or give their children.  I wouldn't want to
go there.  YMMD.

best,
    john


Reply | Threaded
Open this post in threaded view
|

Re: "International" email addresses [I18N-ACTION-374]

Martin J. Dürst
In reply to this post by Shawn Steele
On 2014/11/21 04:36, Shawn Steele wrote:

[Anne van Kesteren wrote]
>> Just to be clear, note that HTML does support them as *input* (the UI done by the UA), it's just that it expects that to be translated to ASCII. This is not so different from how we deal with this situation when it comes to URLs.
>
> ASCII isn't supported for EAI, and it's typically preferred to keep Domain Names in Unicode except for that pesky resolving step, where they must be punycoded.

"ASCII isn't supported for EAI" is extremely short. What it means is
that there is no ACE (ASCII-compatible encoding) for the left-hand side
(LHS; the part before the @) of an EAI email address.

So for a purely hypothetical case of an address like
café@café.example.com, it's possible to change the domain name part to
xn--caf-dma.example.com (using punycode), but there is no such thing for
internationalized email addresses.

In a mailto: URI, the above becomes
mailto:caf%C3%A9@caf%C3%A9.example.com,
but it's impossible to make the plain email address
caf%C3%A9@caf%C3%A9.example.com, because the '%' could be part of the
left-hand side of another email address.

Also, while HTTP is, as per spec, limited to ASCII-only URIs, it's
impossible to send anything to café@café.example.com without the EAI
extension to SMTP, and with that extension, everything including the
addressee's address is in raw UTF-8.

Given that the Web is mostly UTF-8 these days and is asymptotically
approximating an UTF-8-only target state, and that EAI addresses are
handled as plain UTF-8 throughout the EAI email infrastructure, trying
to interpose a non-existing ASCII-only form between these two systems is
a non-starter.

For cafe@café.example.com (LHS ASCII only), it may make sense to
downgrade to [hidden email], because then neither the
logic and database behind the Web form nor the email infrastructure when
the backend sends a mail to that address need any changes.

But for café@café.example.com, the situation is quite different. The
backend has to make sure its mail sending infrastructure is updated to
EAI. For that, it will also have to upgrade/update its backend logic and
make sure the database takes and keeps the Unicode mail address (maybe
in UTF-16 if not in UTF-8). So it is clear that it wants the address as
UTF-8 rather than something else.

So "HTML does support them as *input*" is clearly not good enough and
counterproductive for true EAI email addresses. Please fix, thanks!

Regards,   Martin.

Reply | Threaded
Open this post in threaded view
|

Re: "International" email addresses [I18N-ACTION-374]

Anne van Kesteren-4
On Tue, Nov 25, 2014 at 9:29 AM, "Martin J. Dürst"
<[hidden email]> wrote:
> So "HTML does support them as *input*" is clearly not good enough and
> counterproductive for true EAI email addresses. Please fix, thanks!

Sorry for confusing everyone. It seems I misunderstood the current
situation in HTML. This issue has been tracked since early 2012:

  https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489


--
https://annevankesteren.nl/

Reply | Threaded
Open this post in threaded view
|

Re: "International" email addresses [I18N-ACTION-374]

Martin J. Dürst
On 2014/11/25 18:45, Anne van Kesteren wrote:
> On Tue, Nov 25, 2014 at 9:29 AM, "Martin J. Dürst"
> <[hidden email]> wrote:
>> So "HTML does support them as *input*" is clearly not good enough and
>> counterproductive for true EAI email addresses. Please fix, thanks!
>
> Sorry for confusing everyone. It seems I misunderstood the current
> situation in HTML. This issue has been tracked since early 2012:
>
>    https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489

Great. I just commented there. I hope we'll see some real action soon!

Regards,   Martin.

Reply | Threaded
Open this post in threaded view
|

Re: "International" email addresses [I18N-ACTION-374]

Steven Pemberton-3
In reply to this post by Phillips, Addison-2
Addison, I18N group,

Many thanks for the discussion so far, and for creating an issue for this topic.

To add to the discussion, I would like to point out the several dimensions to this issue which have been exposed:

1. Syntax, Static Semantics, Dynamic Semantics

To draw an analogy with programming languages, there are several properties of an identifier that can be validated:
Things that can be checked at compile time:
   1. Syntax: Can this thing be an identifier?
   2. Static semantics: Has it been declared? (etc)
Things that can be checked at run-time:
   1. Does it have a value? (etc)

With respect to validating email addresses, there are several comparable properties:
   1. Syntax: Could this string imaginably be a valid email address (regardless of specific details for instance for particular zones, or available TLDs).
   2. Static Semantics: Is this string an allowable email address, taking into account current rules for zones, which TLDs there are, etc.
   3. Dynamic semantics: Does the domain really exist? Does the email address really work?

There is another dimension too, that XML Schema distinguishes as "lexical space" and "value space"[1]:
   1. Lexical space: in this case, what the user thinks of, and types in, as a valid email address.
   2. Value space: in this case the email address as it might go over the wire, which may include puny-code processing.

It is noticeable that many answers across the internet to the vexing question of what is a valid international email address mix these things up in lots of interesting ways, without properly distinguishing them.

In this case, the XForms group is only interested in the Syntax of the Lexical Space. We are not interested, at the level of processing that we are now talking about, in whether it is a valid domain, if the zone parts follow the rules for that zone, or whether the email address really exists. The user may be typing in an address that represents a future address for a domain that doesn't yet exist, or for a TLD that doesn't yet exist.

As a result, I still believe that my original message was more or less right on this point: a syntactically correct email address is defined by rfc5322 as modified by rfc6532:

   address: atom-list "@" atom-list.
   atom-list: atom ( "." atom )*
   atom: C+
   C: any character in the world EXCEPT (),.:;<>@[\]

with the added exclusion of control characters in the list for C.

[1] http://www.w3.org/TR/xmlschema-2/#value-space

Best wishes,

Steven Pemberton
For the Forms WG

On Thu, 20 Nov 2014 17:37:23 +0100, Phillips, Addison <[hidden email]> wrote:

Dear Steven and XForms,

 

Firstly, the WG *very much* welcomes further discussion from any and all on this list: this is how we find stuff out. (Thanks to Anne, JcK, Jungshik, and Shawn for contributions so far)

 

This is just a note to let you know that the Internationalization WG has taken up a discussion of this topic, which has, obviously, some interesting issues associated with it. We’re aware that, although “EAI” (email address internationalization) has been slow to mature and gain traction, there are serious efforts from vendors and in various countries to bring non-ASCII mail addresses into the mainstream.

 

This doesn’t play well with the current description in HTML (cited by Anne) or various other places. As Shawn and John note, a regex description of IDNA is probably impossible. At best, such a regex would be an approximation.

 

The Internationalization WG is creating a discussion page to capture the issues [1]. We have not had a chance to discuss the issue in greater depth yet, but the WG’s consensus is that this is an interesting problem needing further investigation and documentation. Please note that, owing to the Thanksgiving holiday in the USA, the Internationalization WG is unlikely to make much more of a response for a couple of weeks.

 

Regards (for I18N),

 

Addison

 

[1] https://www.w3.org/International/wiki/EAI_Address_Issues

 

 

From: Shawn Steele [mailto:[hidden email]]
Sent: Wednesday, November 19, 2014 11:37 AM
To: Jungshik SHIN (
신정식)
Cc: Anne van Kesteren; Steven Pemberton; [hidden email]; Forms WG
Subject: RE: "International" email addresses

 

Validating the IDN part is much more complicated than validating the local part, because you need to know the IDN rules.  Which means it probably isn’t just a “simple” regex. 

 

So maybe the rule should allow Unicode in the domain part and encourage complete IDN validation as an additional step?

 

-Shawn

 

From: [hidden email] [[hidden email]] On Behalf Of Jungshik SHIN (???)
Sent: Wednesday, November 19, 2014 10:53 AM
To: Shawn Steele
Cc: Anne van Kesteren; Steven Pemberton; [hidden email]; Forms WG
Subject: Re: "International" email addresses

 

https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489 deals with it (EAI support in email form validation) although the summary is a bit misleading (it only talks about IDN). 

 

Jungshik

 

On Wed, Nov 19, 2014 at 10:07 AM, Shawn Steele <[hidden email]> wrote:

Updating that to support EAI would be good.


-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Anne van Kesteren
Sent: Wednesday, November 19, 2014 2:07 AM
To: Steven Pemberton
Cc: [hidden email]; Forms WG
Subject: Re: "International" email addresses

On Wed, Nov 19, 2014 at 11:00 AM, Steven Pemberton <[hidden email]> wrote:
> So as far as I can see, an internationalised email address is:
>
>  address: atom-list "@" atom-list.
>  atom-list: atom ( "." atom )*
>  atom: C+
>  C: any character in the world EXCEPT (),.:;<>@[\]
>
> a) Do you agree?
> b) It was really hard to find this out. The internet is rife with
> people asking and getting bad answers. Please help the internet by
> being definitive.

I recommend matching HTML's definition:

https://html.spec.whatwg.org/multipage/forms.html#valid-e-mail-address


--
https://annevankesteren.nl/

 




Reply | Threaded
Open this post in threaded view
|

RE: "International" email addresses [I18N-ACTION-374]

Phillips, Addison-2

Hi Steven,

 

I understand about the desire to limit yourself to the lexical space (which is something you can reasonably address and which has the most utility for what you’re working on).

 

I do have some concerns about your suggested syntax. While it certainly is consistent with what RFC6532 says, I’d be concerned that, for example, ‘atom’ can start with combining marks or consist solely of non-starting Unicode code points or other values that would be problematic. These are the sorts of problems described in [1] and [2]. That is, I’m pretty sure that the following Unicode code point sequence isn’t ever a valid email address, notwithstanding it’s apparent “lexical validity”:

 

U+0300 U+0301 U+FE0F U+0040 U+09C4 U+002E U+0063 U+006F U+006D

 

(that’s two combining accents, a variation selector, the @ sign, a Bengali combining vowel marker, “dot com”)

 

So I’d suggest that ‘atom’ at least always starts with a Unicode code point with a combining class of 0 (or possibly an unassigned code point for a given version of Unicode that might later been assigned a non-zero combining value).

 

Addison

 

[1] http://www.unicode.org/reports/tr31/

[2] http://www.w3.org/TR/charmod-norm/#unicodeNormalization

 

From: Steven Pemberton [mailto:[hidden email]]
Sent: Wednesday, November 26, 2014 1:50 PM
To: Steven Pemberton; Phillips, Addison
Cc: [hidden email]; Forms WG
Subject: Re: "International" email addresses [I18N-ACTION-374]

 

Addison, I18N group,

 

Many thanks for the discussion so far, and for creating an issue for this topic.

 

To add to the discussion, I would like to point out the several dimensions to this issue which have been exposed:

 

1. Syntax, Static Semantics, Dynamic Semantics

 

To draw an analogy with programming languages, there are several properties of an identifier that can be validated:

Things that can be checked at compile time:

   1. Syntax: Can this thing be an identifier?

   2. Static semantics: Has it been declared? (etc)

Things that can be checked at run-time:

   1. Does it have a value? (etc)

 

With respect to validating email addresses, there are several comparable properties:

   1. Syntax: Could this string imaginably be a valid email address (regardless of specific details for instance for particular zones, or available TLDs).

   2. Static Semantics: Is this string an allowable email address, taking into account current rules for zones, which TLDs there are, etc.

   3. Dynamic semantics: Does the domain really exist? Does the email address really work?

 

There is another dimension too, that XML Schema distinguishes as "lexical space" and "value space"[1]:

   1. Lexical space: in this case, what the user thinks of, and types in, as a valid email address.

   2. Value space: in this case the email address as it might go over the wire, which may include puny-code processing.

 

It is noticeable that many answers across the internet to the vexing question of what is a valid international email address mix these things up in lots of interesting ways, without properly distinguishing them.

 

In this case, the XForms group is only interested in the Syntax of the Lexical Space. We are not interested, at the level of processing that we are now talking about, in whether it is a valid domain, if the zone parts follow the rules for that zone, or whether the email address really exists. The user may be typing in an address that represents a future address for a domain that doesn't yet exist, or for a TLD that doesn't yet exist.

 

As a result, I still believe that my original message was more or less right on this point: a syntactically correct email address is defined by rfc5322 as modified by rfc6532:


   address: atom-list "@" atom-list.
   atom-list: atom ( "." atom )*
   atom: C+
   C: any character in the world EXCEPT (),.:;<>@[\]

 

with the added exclusion of control characters in the list for C.

 

 

Best wishes,

 

Steven Pemberton

For the Forms WG

 

On Thu, 20 Nov 2014 17:37:23 +0100, Phillips, Addison <[hidden email]> wrote:

 

Dear Steven and XForms,

 

Firstly, the WG *very much* welcomes further discussion from any and all on this list: this is how we find stuff out. (Thanks to Anne, JcK, Jungshik, and Shawn for contributions so far)

 

This is just a note to let you know that the Internationalization WG has taken up a discussion of this topic, which has, obviously, some interesting issues associated with it. We’re aware that, although “EAI” (email address internationalization) has been slow to mature and gain traction, there are serious efforts from vendors and in various countries to bring non-ASCII mail addresses into the mainstream.

 

This doesn’t play well with the current description in HTML (cited by Anne) or various other places. As Shawn and John note, a regex description of IDNA is probably impossible. At best, such a regex would be an approximation.

 

The Internationalization WG is creating a discussion page to capture the issues [1]. We have not had a chance to discuss the issue in greater depth yet, but the WG’s consensus is that this is an interesting problem needing further investigation and documentation. Please note that, owing to the Thanksgiving holiday in the USA, the Internationalization WG is unlikely to make much more of a response for a couple of weeks.

 

Regards (for I18N),

 

Addison

 

[1] https://www.w3.org/International/wiki/EAI_Address_Issues

 

 

From: Shawn Steele [[hidden email]]
Sent: Wednesday, November 19, 2014 11:37 AM
To: Jungshik SHIN (
신정식)
Cc: Anne van Kesteren; Steven Pemberton; [hidden email]; Forms WG
Subject: RE: "International" email addresses

 

Validating the IDN part is much more complicated than validating the local part, because you need to know the IDN rules.  Which means it probably isn’t just a “simple” regex. 

 

So maybe the rule should allow Unicode in the domain part and encourage complete IDN validation as an additional step?

 

-Shawn

 

From: [hidden email] [[hidden email]] On Behalf Of Jungshik SHIN (???)
Sent: Wednesday, November 19, 2014 10:53 AM
To: Shawn Steele
Cc: Anne van Kesteren; Steven Pemberton; [hidden email]; Forms WG
Subject: Re: "International" email addresses

 

https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489 deals with it (EAI support in email form validation) although the summary is a bit misleading (it only talks about IDN). 

 

Jungshik

 

On Wed, Nov 19, 2014 at 10:07 AM, Shawn Steele <[hidden email]> wrote:

Updating that to support EAI would be good.


-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Anne van Kesteren
Sent: Wednesday, November 19, 2014 2:07 AM
To: Steven Pemberton
Cc: [hidden email]; Forms WG
Subject: Re: "International" email addresses

On Wed, Nov 19, 2014 at 11:00 AM, Steven Pemberton <[hidden email]> wrote:
> So as far as I can see, an internationalised email address is:
>
>  address: atom-list "@" atom-list.
>  atom-list: atom ( "." atom )*
>  atom: C+
>  C: any character in the world EXCEPT (),.:;<>@[\]
>
> a) Do you agree?
> b) It was really hard to find this out. The internet is rife with
> people asking and getting bad answers. Please help the internet by
> being definitive.

I recommend matching HTML's definition:

https://html.spec.whatwg.org/multipage/forms.html#valid-e-mail-address


--
https://annevankesteren.nl/

 



Reply | Threaded
Open this post in threaded view
|

RE: "International" email addresses [I18N-ACTION-374]

Shawn Steele

The EAI RFCs don’t say anything about the local part making sense in Unicode, so the first part, though nonsense, is permitted.  Presumably whomever is assigning mailboxes in that domain would use wiser rules though…. 

 

Presuming that the @ sign is actually interpreted as a delimiter per the RFC’s despite the unexpected combining mark, the domain part is invalid IDN, so that would be clear.

 

-Shawn

 

From: Phillips, Addison [mailto:[hidden email]]
Sent: Wednesday, November 26, 2014 2:20 PM
To: Steven Pemberton
Cc: [hidden email]; Forms WG
Subject: RE: "International" email addresses [I18N-ACTION-374]

 

Hi Steven,

 

I understand about the desire to limit yourself to the lexical space (which is something you can reasonably address and which has the most utility for what you’re working on).

 

I do have some concerns about your suggested syntax. While it certainly is consistent with what RFC6532 says, I’d be concerned that, for example, ‘atom’ can start with combining marks or consist solely of non-starting Unicode code points or other values that would be problematic. These are the sorts of problems described in [1] and [2]. That is, I’m pretty sure that the following Unicode code point sequence isn’t ever a valid email address, notwithstanding it’s apparent “lexical validity”:

 

U+0300 U+0301 U+FE0F U+0040 U+09C4 U+002E U+0063 U+006F U+006D

 

(that’s two combining accents, a variation selector, the @ sign, a Bengali combining vowel marker, “dot com”)

 

So I’d suggest that ‘atom’ at least always starts with a Unicode code point with a combining class of 0 (or possibly an unassigned code point for a given version of Unicode that might later been assigned a non-zero combining value).

 

Addison

 

[1] http://www.unicode.org/reports/tr31/

[2] http://www.w3.org/TR/charmod-norm/#unicodeNormalization

 

From: Steven Pemberton [[hidden email]]
Sent: Wednesday, November 26, 2014 1:50 PM
To: Steven Pemberton; Phillips, Addison
Cc: [hidden email]; Forms WG
Subject: Re: "International" email addresses [I18N-ACTION-374]

 

Addison, I18N group,

 

Many thanks for the discussion so far, and for creating an issue for this topic.

 

To add to the discussion, I would like to point out the several dimensions to this issue which have been exposed:

 

1. Syntax, Static Semantics, Dynamic Semantics

 

To draw an analogy with programming languages, there are several properties of an identifier that can be validated:

Things that can be checked at compile time:

   1. Syntax: Can this thing be an identifier?

   2. Static semantics: Has it been declared? (etc)

Things that can be checked at run-time:

   1. Does it have a value? (etc)

 

With respect to validating email addresses, there are several comparable properties:

   1. Syntax: Could this string imaginably be a valid email address (regardless of specific details for instance for particular zones, or available TLDs).

   2. Static Semantics: Is this string an allowable email address, taking into account current rules for zones, which TLDs there are, etc.

   3. Dynamic semantics: Does the domain really exist? Does the email address really work?

 

There is another dimension too, that XML Schema distinguishes as "lexical space" and "value space"[1]:

   1. Lexical space: in this case, what the user thinks of, and types in, as a valid email address.

   2. Value space: in this case the email address as it might go over the wire, which may include puny-code processing.

 

It is noticeable that many answers across the internet to the vexing question of what is a valid international email address mix these things up in lots of interesting ways, without properly distinguishing them.

 

In this case, the XForms group is only interested in the Syntax of the Lexical Space. We are not interested, at the level of processing that we are now talking about, in whether it is a valid domain, if the zone parts follow the rules for that zone, or whether the email address really exists. The user may be typing in an address that represents a future address for a domain that doesn't yet exist, or for a TLD that doesn't yet exist.

 

As a result, I still believe that my original message was more or less right on this point: a syntactically correct email address is defined by rfc5322 as modified by rfc6532:


   address: atom-list "@" atom-list.
   atom-list: atom ( "." atom )*
   atom: C+
   C: any character in the world EXCEPT (),.:;<>@[\]

 

with the added exclusion of control characters in the list for C.

 

 

Best wishes,

 

Steven Pemberton

For the Forms WG

 

On Thu, 20 Nov 2014 17:37:23 +0100, Phillips, Addison <[hidden email]> wrote:

 

Dear Steven and XForms,

 

Firstly, the WG *very much* welcomes further discussion from any and all on this list: this is how we find stuff out. (Thanks to Anne, JcK, Jungshik, and Shawn for contributions so far)

 

This is just a note to let you know that the Internationalization WG has taken up a discussion of this topic, which has, obviously, some interesting issues associated with it. We’re aware that, although “EAI” (email address internationalization) has been slow to mature and gain traction, there are serious efforts from vendors and in various countries to bring non-ASCII mail addresses into the mainstream.

 

This doesn’t play well with the current description in HTML (cited by Anne) or various other places. As Shawn and John note, a regex description of IDNA is probably impossible. At best, such a regex would be an approximation.

 

The Internationalization WG is creating a discussion page to capture the issues [1]. We have not had a chance to discuss the issue in greater depth yet, but the WG’s consensus is that this is an interesting problem needing further investigation and documentation. Please note that, owing to the Thanksgiving holiday in the USA, the Internationalization WG is unlikely to make much more of a response for a couple of weeks.

 

Regards (for I18N),

 

Addison

 

[1] https://www.w3.org/International/wiki/EAI_Address_Issues

 

 

From: Shawn Steele [[hidden email]]
Sent: Wednesday, November 19, 2014 11:37 AM
To: Jungshik SHIN (
신정식)
Cc: Anne van Kesteren; Steven Pemberton; [hidden email]; Forms WG
Subject: RE: "International" email addresses

 

Validating the IDN part is much more complicated than validating the local part, because you need to know the IDN rules.  Which means it probably isn’t just a “simple” regex. 

 

So maybe the rule should allow Unicode in the domain part and encourage complete IDN validation as an additional step?

 

-Shawn

 

From: [hidden email] [[hidden email]] On Behalf Of Jungshik SHIN (???)
Sent: Wednesday, November 19, 2014 10:53 AM
To: Shawn Steele
Cc: Anne van Kesteren; Steven Pemberton; [hidden email]; Forms WG
Subject: Re: "International" email addresses

 

https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489 deals with it (EAI support in email form validation) although the summary is a bit misleading (it only talks about IDN). 

 

Jungshik

 

On Wed, Nov 19, 2014 at 10:07 AM, Shawn Steele <[hidden email]> wrote:

Updating that to support EAI would be good.


-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Anne van Kesteren
Sent: Wednesday, November 19, 2014 2:07 AM
To: Steven Pemberton
Cc: [hidden email]; Forms WG
Subject: Re: "International" email addresses

On Wed, Nov 19, 2014 at 11:00 AM, Steven Pemberton <[hidden email]> wrote:
> So as far as I can see, an internationalised email address is:
>
>  address: atom-list "@" atom-list.
>  atom-list: atom ( "." atom )*
>  atom: C+
>  C: any character in the world EXCEPT (),.:;<>@[\]
>
> a) Do you agree?
> b) It was really hard to find this out. The internet is rife with
> people asking and getting bad answers. Please help the internet by
> being definitive.

I recommend matching HTML's definition:

https://html.spec.whatwg.org/multipage/forms.html#valid-e-mail-address


--
https://annevankesteren.nl/

 

 

Reply | Threaded
Open this post in threaded view
|

Re: "International" email addresses [I18N-ACTION-374]

Mark Davis ☕
In reply to this post by Phillips, Addison-2
I mostly about the 3 distinctions that Steven is drawing. It is certainly important to distinguish between "well-formed" (syntax) and "valid" (would actually work at runtime). However, the syntactic distinction could be tighter than what he suggests, but looser than https://html.spec.whatwg.org/multipage/forms.html#valid-e-mail-address, since the latter doesn't take into account either EAI or IDNA. I'd suggest something like the following:
email         = local-part "@" host
local-part    = 1*atext2 *("." 1*atext2)
host          = < as defined in https://url.spec.whatwg.org/#host-parsing >

atext2        = atext | utext
atext         = < as defined in http://tools.ietf.org/html/rfc5322#section-3.2.3 >
utext         = XID_Start 1*XID_Continue
XID_Start     = < as defined in http://www.unicode.org/reports/tr31 >
XID_Continue  = < as defined in http://www.unicode.org/reports/tr31 >


Additional conditions: 
 atext2 must be in NFC format, as defined in http://www.unicode.org/reports/tr15

Notes: 
 * for local-part, see dot-atom-text in rfc5322 section 3.2.3
 * the above doesn't provide for quoted email addresses; the syntax would have to be enhanced to allow for those.
 * the restriction to NFC is recommended in http://tools.ietf.org/html/rfc6530#section-10.1, but not required there. (I'd prefer NFKC over NFC.)
 * the restriction to a Unicode identifier is not in rfc6530, but helps to prevent bizarre email addresses. However, it could be made more lenient, eg to allow symbols (if you want your email address, for example, to be an emoji).



— Il meglio è l’inimico del bene —

On Wed, Nov 26, 2014 at 11:19 PM, Phillips, Addison <[hidden email]> wrote:

Hi Steven,

 

I understand about the desire to limit yourself to the lexical space (which is something you can reasonably address and which has the most utility for what you’re working on).

 

I do have some concerns about your suggested syntax. While it certainly is consistent with what RFC6532 says, I’d be concerned that, for example, ‘atom’ can start with combining marks or consist solely of non-starting Unicode code points or other values that would be problematic. These are the sorts of problems described in [1] and [2]. That is, I’m pretty sure that the following Unicode code point sequence isn’t ever a valid email address, notwithstanding it’s apparent “lexical validity”:

 

U+0300 U+0301 U+FE0F U+0040 U+09C4 U+002E U+0063 U+006F U+006D

 

(that’s two combining accents, a variation selector, the @ sign, a Bengali combining vowel marker, “dot com”)

 

So I’d suggest that ‘atom’ at least always starts with a Unicode code point with a combining class of 0 (or possibly an unassigned code point for a given version of Unicode that might later been assigned a non-zero combining value).

 

Addison

 

[1] http://www.unicode.org/reports/tr31/

[2] http://www.w3.org/TR/charmod-norm/#unicodeNormalization

 

From: Steven Pemberton [mailto:[hidden email]]
Sent: Wednesday, November 26, 2014 1:50 PM
To: Steven Pemberton; Phillips, Addison
Cc: [hidden email]; Forms WG
Subject: Re: "International" email addresses [I18N-ACTION-374]

 

Addison, I18N group,

 

Many thanks for the discussion so far, and for creating an issue for this topic.

 

To add to the discussion, I would like to point out the several dimensions to this issue which have been exposed:

 

1. Syntax, Static Semantics, Dynamic Semantics

 

To draw an analogy with programming languages, there are several properties of an identifier that can be validated:

Things that can be checked at compile time:

   1. Syntax: Can this thing be an identifier?

   2. Static semantics: Has it been declared? (etc)

Things that can be checked at run-time:

   1. Does it have a value? (etc)

 

With respect to validating email addresses, there are several comparable properties:

   1. Syntax: Could this string imaginably be a valid email address (regardless of specific details for instance for particular zones, or available TLDs).

   2. Static Semantics: Is this string an allowable email address, taking into account current rules for zones, which TLDs there are, etc.

   3. Dynamic semantics: Does the domain really exist? Does the email address really work?

 

There is another dimension too, that XML Schema distinguishes as "lexical space" and "value space"[1]:

   1. Lexical space: in this case, what the user thinks of, and types in, as a valid email address.

   2. Value space: in this case the email address as it might go over the wire, which may include puny-code processing.

 

It is noticeable that many answers across the internet to the vexing question of what is a valid international email address mix these things up in lots of interesting ways, without properly distinguishing them.

 

In this case, the XForms group is only interested in the Syntax of the Lexical Space. We are not interested, at the level of processing that we are now talking about, in whether it is a valid domain, if the zone parts follow the rules for that zone, or whether the email address really exists. The user may be typing in an address that represents a future address for a domain that doesn't yet exist, or for a TLD that doesn't yet exist.

 

As a result, I still believe that my original message was more or less right on this point: a syntactically correct email address is defined by rfc5322 as modified by rfc6532:


   address: atom-list "@" atom-list.
   atom-list: atom ( "." atom )*
   atom: C+
   C: any character in the world EXCEPT (),.:;<>@[\]

 

with the added exclusion of control characters in the list for C.

 

 

Best wishes,

 

Steven Pemberton

For the Forms WG

 

On Thu, 20 Nov 2014 17:37:23 +0100, Phillips, Addison <[hidden email]> wrote:

 

Dear Steven and XForms,

 

Firstly, the WG *very much* welcomes further discussion from any and all on this list: this is how we find stuff out. (Thanks to Anne, JcK, Jungshik, and Shawn for contributions so far)

 

This is just a note to let you know that the Internationalization WG has taken up a discussion of this topic, which has, obviously, some interesting issues associated with it. We’re aware that, although “EAI” (email address internationalization) has been slow to mature and gain traction, there are serious efforts from vendors and in various countries to bring non-ASCII mail addresses into the mainstream.

 

This doesn’t play well with the current description in HTML (cited by Anne) or various other places. As Shawn and John note, a regex description of IDNA is probably impossible. At best, such a regex would be an approximation.

 

The Internationalization WG is creating a discussion page to capture the issues [1]. We have not had a chance to discuss the issue in greater depth yet, but the WG’s consensus is that this is an interesting problem needing further investigation and documentation. Please note that, owing to the Thanksgiving holiday in the USA, the Internationalization WG is unlikely to make much more of a response for a couple of weeks.

 

Regards (for I18N),

 

Addison

 

[1] https://www.w3.org/International/wiki/EAI_Address_Issues

 

 

From: Shawn Steele [[hidden email]]
Sent: Wednesday, November 19, 2014 11:37 AM
To: Jungshik SHIN (
신정식)
Cc: Anne van Kesteren; Steven Pemberton; [hidden email]; Forms WG
Subject: RE: "International" email addresses

 

Validating the IDN part is much more complicated than validating the local part, because you need to know the IDN rules.  Which means it probably isn’t just a “simple” regex. 

 

So maybe the rule should allow Unicode in the domain part and encourage complete IDN validation as an additional step?

 

-Shawn

 

From: [hidden email] [[hidden email]] On Behalf Of Jungshik SHIN (???)
Sent: Wednesday, November 19, 2014 10:53 AM
To: Shawn Steele
Cc: Anne van Kesteren; Steven Pemberton; [hidden email]; Forms WG
Subject: Re: "International" email addresses

 

https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489 deals with it (EAI support in email form validation) although the summary is a bit misleading (it only talks about IDN). 

 

Jungshik

 

On Wed, Nov 19, 2014 at 10:07 AM, Shawn Steele <[hidden email]> wrote:

Updating that to support EAI would be good.


-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Anne van Kesteren
Sent: Wednesday, November 19, 2014 2:07 AM
To: Steven Pemberton
Cc: [hidden email]; Forms WG
Subject: Re: "International" email addresses

On Wed, Nov 19, 2014 at 11:00 AM, Steven Pemberton <[hidden email]> wrote:
> So as far as I can see, an internationalised email address is:
>
>  address: atom-list "@" atom-list.
>  atom-list: atom ( "." atom )*
>  atom: C+
>  C: any character in the world EXCEPT (),.:;<>@[\]
>
> a) Do you agree?
> b) It was really hard to find this out. The internet is rife with
> people asking and getting bad answers. Please help the internet by
> being definitive.

I recommend matching HTML's definition:

https://html.spec.whatwg.org/multipage/forms.html#valid-e-mail-address


--
https://annevankesteren.nl/

 




Reply | Threaded
Open this post in threaded view
|

Re: "International" email addresses [I18N-ACTION-374]

John C Klensin-4


--On Thursday, November 27, 2014 08:58 +0100 Mark Davis ☕️
<[hidden email]> wrote:

> I mostly about the 3 distinctions that Steven is drawing. It
> is certainly important to distinguish between "well-formed"
> (syntax) and "valid" (would actually work at runtime).
> However, the syntactic distinction could be tighter than what
> he suggests, but looser than
> https://html.spec.whatwg.org/multipage/forms.html#valid-e-mail
> -address, since the latter doesn't take into account either
> EAI or IDNA. I'd suggest something like the following:
>...

Folks,

Personally, I favor the types of restrictions Mark, Addison, and
others have suggested.  However, divergent specifications do no
one any good, especially when their effect is to prevent someone
from using an address as input in a web context that may be
valid and in use (even if, in your/our judgment unwise) in email
more generally.  So I strongly suggest:

(1) If these groups believe that the IETF specs are too
permissive or badly defined, put together a proposal and submit
it through the IETF process.

(2) Incorporate syntax rules and restrictions by reference to
the IETF specs only, thereby preventing both accidental
divergence and confusion about what the "real" rules are.

In addition to the obvious reasons for the above, there are some
email-specific issues involved in what the IETF documents
specified that may not be as familiar to this group.  For
example, Mark wrote...
 
>  * the above doesn't provide for quoted email addresses; the
> syntax would have to be enhanced to allow for those.

The decision to restrict quoted email addresses when the
SMTPUTF8 (aka "EAI") extensions where in use was discussed
carefully and at length.  The decision that was made was based
on circa 30 years of experience with the quoted forms causing
multiple problems as addresses were passed among systems with
different conventions.  The conclusion was that tightening the
rules a bit under the protection of the extension mechanism
would improve overall interoperability and not hurt anyone.  The
net result is that one can use both fancy quoting forms an
all-ASCII addresses or use non-ASCII addresses but not the
quoting forms.  It doesn't affect addresses unless name phrases
are used, but the SMTPUTF8 extensions also effectively prohibit
anything but (valid) UTF-8, even in encoded words [1], while,
historically, email with ASCII-only addresses and headers allows
encoded words in just about any "charset".   It seems to me
those restrictions are entirely in line with W3C and WHATWG
moves in other areas such as the Encoding spec.

    john

[1] Anyone participating in this discussion who doesn't know
_exactly_ what a "name phrase" and/or "encoded word" is in an
email context _really_ needs to go read the relevant specs.



Reply | Threaded
Open this post in threaded view
|

Re: "International" email addresses [I18N-ACTION-374]

Mark Davis ☕

On Thu, Nov 27, 2014 at 9:33 AM, John C Klensin <[hidden email]> wrote:
(1) If these groups believe that the IETF specs are too
permissive or badly defined, put together a proposal
​ ​
and submit
it through the IETF process.

​I agree with you, in principle. There are times, however, where another spec (like HTML5) may support a narrower syntactic format than the original specification.

Now, it might be that what I suggest is too narrow. Perhaps something like this 
would be better
​:


What would be interesting would be to gather some statistics on the number of email messages in use that would be affected...​

(And just speaking for myself, I have too few years of life left to get involved in another IETF proposal 
😱. Addison
​ is still young; maybe he​
is game
 🎱 ...)
​ 

 
The decision to restrict quoted email addresses when the
SMTPUTF8 (aka "EAI") extensions where in use was discussed
carefully and at length. 
​... 
​... ​
The

net result is that one can use both fancy quoting forms an
all-ASCII addresses or use non-ASCII addresses but not the
quoting forms.
​ ... 
It seems to me
 
those restrictions are entirely in line with W3C and WHATWG
moves in other areas such as the Encoding spec.

​Thanks for the note, and I agree with you. I meant my note to just apply to just the ​ASCII forms.


— Il meglio è l’inimico del bene —