These are draft minutes from the "Birds of a Feather" session on IRIs held at IETF 76 in Hiroshima, Japan on November 10, 2009. If you have changes, please send them to the BoF chairs so that they can upload final minutes.
AD = Adam Roach AM = Alexey Melnikov BL = Barry Leiba JH = Joe Hildebrand JK = John Klensin LD = Lisa Dusseault LM = Larry Masinter MD = Martin Duerst MS = Michael Smith PR = Pete Resnick SC = Stuart Cheshire TH = Ted Hardie
TH: We are trying to restore interoperability to a part of the Internet infrastructure where it has been lost. URI mechanism is one of the most important pieces of the application space. URIs were originally designed to be (1) under the hood and (2) ASCII. Assumption that IRIs were a way of presenting data that was really represented in a URI. That assumption has changed to make IRIs more of a first-class citizen. XML being a prime example. Now there are no less than 9 communities working on IRIs (W3C, etc.). Not trying to add a 10th.
JK: Counter-theory is that IRIs have been a disaster for internationalization.
LM: There is a horrible mess, but there is the possibility of making things a little bit better.
1. IRI as protocol element vs. mapping of IRI to URI. Until now, IRI has been defined as a sequence of Unicode characters that's converted into a URI by translating to UTF-8 and then percent-encoding. The meaning of the IRI was to be exactly the URI to which it was mapped. Later, it became clear that most implementations parsed the UTF-8 and translated to hex-encoding only if necessary. We also found that some applications were using IRIs as strings for namespaces (e.g., XML namespaces). In other words, applications were using IRIs directly as protocol elements.
2. Normative reference to IDNA. The IRI spec defined translation of domain name components using IDNA, but IDNA is under transition.
3. Different levels of "liberal processing". IRIs that weren't actually valid were accepted in places like HTML documents. Two levels here: one defined by XML community, another by the HTML community (i.e., what browsers currently implement).
[slide] Other documents and committees
- HTML5 work in WhatWG and W3C HTML WG - IETF IDNABIS WG - IETF EAI WG
TH: Start introducing discussion. In particular, let's have comments and questions on IRI as protocol element.
MD: We still need to make sure that the conversion from IRI to URI is well-defined.
PR: Is everybody OK with movement from presentation layer to protocol element?
AM: Is there any difference on the wire? If so, it would be nice to show some examples.
LM: There are protocols and protocol elements. That's why we came up with IRIs vs. URIs in the first place.
AM: That was not my question. Will conversion processes described in different versions of the spec produce the same data on the wire?
LM: No, because formerly Unicode to UTF-8, hex encoded. It was listed as an option to do Unicode to punycode. This would give you two processing paths, because you might end up with hex-encoded UTF-8 *or* ASCII punycode. If you then took a URI that had percent-encoding and passed it to a non-Unicode-aware resolver, it would give you different results than if you had passed it a punycode hostname in the URI.
MD: RFC 3987 didn't explain that very well. But you don't know if underneath you are dealing with DNS or some other kind of system. This needs to be an open issue.
JK: The plenary on Thursday night will discuss these issues as well. The argument there will be that it's not a good idea to have two different encodings of the same information (UTF-8 and punycode). It's an even worse idea to have three (UTF-8, UTF-8 hex-encoded, and punycode).
TH: That would appear to be opposed to deployed reality. Certainly we have these IRIs/URIs in content (e.g., HTML files), not just on the wire.
JK: One of the problems here is that we're showing many signs of digging a deep hole.
TH: So stop digging?
JK: Start digging a hole over which we have more control.
JH: I want to make sure there's an encoding that doesn't require conversion to punycode or hex-encoding.
PR: You mean UTF-8?
JH: XMPP is all UTF-8, it would be nice if we could just use that.
LM: URIs are a sequence of characters, not bytes. That needs to be encoded as UTF-8 or UTF-16 or whatever if you want a sequence of bytes instead of characters.
SC: We'll discuss this on Thursday night at the plenary. Unicode identifies characters, but you need to encode them.
AR: I hear that we're trying to make changes based on implementation, but are we trying to get rid of IDNA?
LM: No, we're discussing these issues with implementors, not ignoring implementation reality.
MD: I agree that two representations are bad, and three are worse. Problem is that domain names don't appear in an IRI or URI only in the path component, they can appear elsewhere (e.g., in query string).
TH: The reality is that things are very messy. Application developers try very hard to get people where they want to go, even if that input is not valid. Even if we came up with a totally new identifier, that would not fix what's out there. My take is that scrapping what we have now and starting over does us no good, because it might be completely better but completely undeployed. We need to find something in the middle, because leaving things as they are right now doesn't help.
MS: HTML5 spec has ended up including text about (1) error handling for URIs and (2) character encodings in query strings. We hope to have spec text that we can normatively reference in HTML5 so that implementors can do the right thing going forward.
TH: Was there discussion of other URI schemes at that point (in the W3C or WhatWG), or was it limited to HTTP?
MS: I believe it was limited to HTTP.
MS: The "goals" page linked to from the draft charter is very useful and I'd like to see those concerns addressed.
JK: Ted, I agree with you, but URI syntax is currently employed as a user interface, and I think we'll need to move away from that eventually. I'd be happy if we could deprecate ASCII-only URIs.
TH: Do we have consensus to do the work necessary to deprecate ASCII-only URIs?
JK: I would define it differently. Some URIs are not user-facing but instead are network-facing only, so we haven't felt the need to define internationalized versions of those.
TH: Yes, the browser location bar screwed us royally because HTTP URIs were originally supposed to be hidden behind hyperlinks in web pages.
JK: The choices are either looking at things on a scheme-by-scheme basis, or solving the problem globally for all schemes.
TH: How much work are people willing to put in? Because that's a lot of work.
MD: I've already put in a lot of time, but I'm willing to continue working. The deployed content that is not compliant is a small percentage, but the mountain of content is extremely large.
LM: A few points. draft-duerst-iri-bis-07 recommends (1) renaming the existing IANA registry of URI schemes to be a registry of URI/IRI schemes, (2) adding as a requirement for new schemes to define the non-ASCII characters that are appropriate for the scheme, and (3) reviewing all the existing schemes for their appropriateness as IRI identifiers, where the default is that it's an old-style URI. This would go a long way to making all of the identifiers into IRIs.
TH: I remember an effort to bring all the URIs up to date, and we burned out three people in the process. Now it would be just as much work, or more.
MD: Say there is a URI scheme for IP addresses, then it's just numbers and we don't need any internationalization. On the other hand, for something like mailto you can only use ASCII because it is so old.
JK: If we had a URI scheme for IP addresses, you can be sure that we would hear calls for encoding those digits in a localized version, instead of in "Arabic numerals".
LM: I think we have enough problems without imagining new ones.
BL: Why is this not a presentation issue instead of a protocol issue?
TH: Barry, that was the theory, but it failed the test of implementation.
BL: As we discover that more things need internationalization in the presentation layer, can't we just say that applications need to become better at presentation?
AR: If I understand that we'd go back and convert old URIs to IRIs, I would suggest that the effort to do that for SIP alone would be enormous.
LM: Currently you could hex-encode UTF-8.
AR: No-ASCII characters are not allowed in SIP URIs. But SIP has messed this up because there is no normative text about it. I would be stunned if we're not the only protocol in that boat.
LM: So either it works or it doesn't. If it doesn't, it's ASCII only until someone fixes it. The registry would enable that to be defined.
AR: So you would be defining a framework, not fixing each scheme.
LM: Right. A framework that says it's either like IRIs now, like URIs now, or some other definition.
TH: To Barry's point, we've never been able to force people how to layer things correctly in their applications. The danger is that we're going to go back into defining human-friendly names. The reality is that the protocol elements will bleed into the human side of things. But we can't be so liberal that things break if there are no humans involved, e.g. if the data is provided to a lower layer.
LM: Preventing bad things from happening to humans is a high priority -- dealing with issues of ambiguity and reliability. Let me speak a bit about the bleeding of protocol elements into human interfaces, because a big part of the Internet economy came from being able to advertise your domain name -- which was bleeding of a protocol element into user space.
BL: I disagree, because if things worked right then a Chinese or Japanese or Cyrllic name could be converted correctly into a protocol element.
LM: I'm confused. The use of i18n identifiers works and it's deployed. There are just a few issues around the edges. The problem is that we need to address those issues in a coordinated fashion.
PR: The presentation layer leaks into protocol elements. Larry said IRIs are used as protocol elements, but they are not encoded. And percent encoded representations provide i18n, too. What we want to get away from is conflating i18n with a particular encoding. E.g., people use UTF-8, so what's being argued for is to standardize that usage by saying that internationalized identifiers are to be encoded in UTF-8.
LM: Sorry that I was not specific enough. Where they are deployed is in HTML documents.
PR: But those are not in UTF-8, they are in ISO 8859-foo Do you want these identifiers to be represented in *any* encoding? We need to be careful about which path we're on.
LM: This isn't something that I *want*, it's what *is*. There is a lot of software and content that treats a sequence of characters in the encoding of that document, converts that into Unicode (usually UTF-16 but perhaps also UTF-8), and uses the result as a URI/IRI.
BL: The inclusion of this in an HTML does not make this into a protocol element.
TH: Some people think they are protocol elements. Both views might be correct. We can get sidetracked into which encodings we prefer. One of the goals here is to minimize the number of translations that occur between applications. Make it as simple as possible, but not simpler. Don't simplify identifiers, reduce the number of iterations of translation in any given protocol handoff. Not easy, but different from what was just described.
PR: Let's reframe. In RFC 822, in the header an address is a protocol element. In the body, the address is a protocol element in text.
BL: This is going in the wrong direction.
TH: Does anyone think we don't have a problem here?
BL: I'm not saying we shouldn't do this, only that we need to scope it.
[slide] Charter review
LM: There's a draft charter, perhaps we can review that? Can we address the problem only by working on these three documents?
Do we need to work on more? Can we get away with working on even fewer?
JK: First, I don't think we can narrow the scope to this. We need to look at the impact on all schemes, not just HTTP URIs. Second, I'm leery of having this WG try to fix mailto, especially in the context of EAI, because mailto needs to be fixed by people who understand mail.
MD: The mailto draft says very little about EAI. I'm not an expert about all the details about how you do escaping in email addresses and the like, so we need people who know about that to provide comments.
AM: I'm happy about reducing the scope because success is good, but the mailto/EAI interaction needs to be addressed. However, if this WG will focus only on HTTP then it might make choices that are not generic enough.
LM: The intent is not to focus on HTTP.
AM: But HTTP is the main application. Maybe the concern lacks a basis.
LM: The charter mentions explicit coordination with other groups. Not meant to be an exclusive list, and mainly to coordinate requirements.
JK: We have once again fallen into the habit of always mentioning the the example of web browsers. I agree with Alexey that we need more examples, at least one example but three more would be even better. But mailto is probably not a good example.
TH: Maybe look at URNs. They are similar but different enough to be useful for this work.
MD: RFC 3987 references POP, IMAP, data URI scheme, URN, etc. The idea to look at other schemes is important. But do we put out a new spec for those?
LM: I don't think we need to update data: URIs.
MD: We need to look at other schemes, but we don't know which until we investigate them in detail.
JH: I do think XMPP is a good example because it is more recent. I think we also need recommendations for people defining new schemes. We can discuss in the XMPP WG whether more formal cooperation is needed.
LM: So maybe take mailto out and put others in.
AM: Completing the mailto update can happen elsewhere.
LM: There is discussion in the charter of perhaps splitting the documents, e.g., move everything about domain names into a separate document. Also perhaps a separate BCP for informational text about why some characters are problematic and others are not.
Chairs begin asking questions of the room...
TH: HUM Is there a problem here for the IETF to solve?
Hum "yes" -- many Hum "no" -- silence Hum "not enough information to decide" -- silence
TH: Raise your hand if you are willing to: - be on a mailing list to discuss these issues - review documents - replace me and Pete at the front of the room
TH: A fair number of volunteers on the first two.
TH: I'm going to ask for two directional questions about the charter....
TH: HUM Should it be within the charter to scrap the existing approach from RFC 3987 and start over with an entirely new approach?
Hum "yes" -- about one-third. Hum "no" -- about one-third. Hum "not enough information to decide" -- about one-third.
TH: HUM Should the charter include an explicit list of schemes to review?
Hum "yes" -- about one-third. Hum "no" -- about one-third. Hum "not enough information to decide" -- about one-third.
Clarifying question from the mic: would that include making it a minimum list, not a limiting list?
LD: Can we ask if this is a deal-breaker?
TH: I don't think we need to go there now, let's clarify the charter first.
JK: I don't care which specific schemes are selected for review, just as long as the WG reviews multiple schemes.
TH: HUM Must we decide specifically which schemes need to be investigated *before* the WG is chartered, or can the WG sort that out?
Hum "yes" -- none for "before" Hum "no" -- many for "WG can sort it out"
TH: Must the charter specify a minimum number of schemes to investigate, or can the WG sort that out?
Hum "yes" -- one hum for "minimum number" Hum "no" -- many for "WG can sort it out"
TH: So we seem to have consensus that there are volunteers to work on IRIs at the IETF. Thanks!