Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
86 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

Christophe Lauret
Jan wrote:

Help me, I am just not getting it:
Why do you insist on 'fixing STD 66'?
What is the reason you are not willing to reframe the problem to 'fixing how we get from the provided string -the input to the reference construction process- to a STD-66-valid result'?
To me this is really what you are aiming at and dropping the 'fix the URI spec' language would get everyone on board immediately in my perception.

I completely agree - I don't understand the insistence on "fixing" STD 66.

As a Web developer who's had to write code multiple times to handle URIs in very different contexts, I actually *like* the constraints in STD 66, there are many instances where it is simpler to assume that the error handling has been done prior and simply reject an invalid URI. By reducing the space of valid addresses around rigid rules to match more closely the set of addressable resources, it reduces my scope of work.

I understand that it isn't practical in some contexts (e.g. browsers) where it is inevitable that strings that are not considered valid URI references will crop up and it is desirable for everyone to deal with these in a consistent manner. So, I applaud Anne and Ian for wanting to standardize the error handling and tackling this problem head on.
But why not do it as a separate spec?

Increasing the space of valid addresses, when the set of addressable resources is not actually increasing only means more complex parsing rules.
I accept that it is a requirement in some contexts, but it is also completely unnecessary in others where coding against STD 66 is just fine if not more desirable.

Personally, I don't have any qualms in calling that larger set of addresses "URLs" if it is what most people use, and leave STD 66 and URIs are as they are.

Christophe-

(I've used 'address' as a short for a "string which should ultimately resolve as a valid STD 66 result")

Reply | Threaded
Open this post in threaded view
|

Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

Ian Hickson
In reply to this post by Jan Algermissen-3
On Wed, 24 Oct 2012, Jan Algermissen wrote:
>
> What matters is that nothing of the existing URI spec *changes*.
>
> Can you agree on that?

Do you mean the actual text, or the normative meaning of the text?

--
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Reply | Threaded
Open this post in threaded view
|

Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

Ian Hickson
In reply to this post by Christophe Lauret
On Wed, 24 Oct 2012, Christophe Lauret wrote:
>
> As a Web developer who's had to write code multiple times to handle URIs
> in very different contexts, I actually *like* the constraints in STD 66,
> there are many instances where it is simpler to assume that the error
> handling has been done prior and simply reject an invalid URI.

I think we can agree that the error handling should be, at the option of
the software developer, either to handle the input as defined by the
spec's algorithms, or to abort and not handle the input at all.


> But why not do it as a separate spec?

Having multiple specs means an implementor has to refer to multiple specs
to implement one algorithm, which is not a way to get interoperability.
Bugs creep in much faster when implementors have to switch between specs
just in the implementation of one algorithm.


> Increasing the space of valid addresses, when the set of addressable
> resources is not actually increasing only means more complex parsing rules.

I'm not saying we should increase the space of valid addresses. The de
facto parsing rules are already complicated by de facto requirements for
handling errors, so defining those doesn't increase complexity either
(especially if such behaviour is left as optional, as discussed above.)

--
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Reply | Threaded
Open this post in threaded view
|

Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

Ted Hardie-2
On Tue, Oct 23, 2012 at 4:51 PM, Ian Hickson <[hidden email]> wrote:
> Having multiple specs means an implementor has to refer to multiple specs
> to implement one algorithm, which is not a way to get interoperability.
> Bugs creep in much faster when implementors have to switch between specs
> just in the implementation of one algorithm.
>

First, do you have data that supports this assertion?

Second, multiple folks in this conversation have asserted that the
right way to approach this is to have *two* algorithms.  The first is
"method to get from string to URI"  and the second is "Process URI".
It is not obvious that those need to be in the same document, any more
than the processing of DNS names needs to be described in the same
document of URLs whose schemes include DNS names.

(In case it is not obvious, it is the string-which-may-become-a-URI
that I have referred to as a "fleen" in previous notes).

It also seems far more likely to me that bugs will creep in from
re-defining a known algorithm (the "process URI" bit from the pair
above) than from the separation of that from a different operation.
If the results of the rewording would be different operations the, as
I have noted before, you really should use different terms and admit
to the fork.

My personal opinion, as has been noted,

regards,

Ted Hardie

Reply | Threaded
Open this post in threaded view
|

Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

Ian Hickson
On Tue, 23 Oct 2012, Ted Hardie wrote:
> On Tue, Oct 23, 2012 at 4:51 PM, Ian Hickson <[hidden email]> wrote:
> > Having multiple specs means an implementor has to refer to multiple
> > specs to implement one algorithm, which is not a way to get
> > interoperability. Bugs creep in much faster when implementors have to
> > switch between specs just in the implementation of one algorithm.
>
> First, do you have data that supports this assertion?

No.


> Second, multiple folks in this conversation have asserted that the right
> way to approach this is to have *two* algorithms. The first is "method
> to get from string to URI"  and the second is "Process URI".

That would be inefficient, so isn't likely to be a solid implementation
strategy in many environments. I don't see any reason to do it this way.


> It also seems far more likely to me that bugs will creep in from
> re-defining a known algorithm (the "process URI" bit from the pair
> above) than from the separation of that from a different operation.

That's what regression tests and review are for. Obviously if we rewrote
the algorithms and introduced bugs and just left them as is, we'd be
pretty awfully incompetent. Nobody is suggesting that the work will be
done before we have reached a point where the new spec is as good or
better than the existing specs.


> If the results of the rewording would be different operations then, as I
> have noted before, you really should use different terms and admit to
> the fork.

There's no desire to make the new spec incompatible with existing
software. That would not be a useful spec.

--
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Reply | Threaded
Open this post in threaded view
|

RE: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

Manger, James H
In reply to this post by Ian Hickson
> From: Ian Hickson [mailto:[hidden email]]
> I think we can agree that the error handling should be, at the option
> of the software developer, either to handle the input as defined by the
> spec's algorithms, or to abort and not handle the input at all.

Currently, I don't think url.spec.whatwg.org distinguishes between strings that are
valid URLs and strings that can be interpreted as URLs by applying its standardised error handling. Consequently, error handling cannot be at the option of the software developer as you cannot tell which bits are error handling.

This might be why some are unhappy with url.spec.whatwg.org.

url.spec.whatwg.org does have separate "Writing" and "Parsing" sections. Perhaps the implicit idea is that any output of the "Writing" section is a valid URL (that all URL-processing software should handle). The "Parsing" section accepts more strings than can be created by the "Writing" section. The difference is the error handling. It's OK for a software developer not to parse this difference if it makes its parser simpler, safer, or that is the way its parser works today.


--
James Manger

Reply | Threaded
Open this post in threaded view
|

Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

David Sheets-2
In reply to this post by Ian Hickson
On Tue, Oct 23, 2012 at 4:51 PM, Ian Hickson <[hidden email]> wrote:

> On Wed, 24 Oct 2012, Christophe Lauret wrote:
>>
>> As a Web developer who's had to write code multiple times to handle URIs
>> in very different contexts, I actually *like* the constraints in STD 66,
>> there are many instances where it is simpler to assume that the error
>> handling has been done prior and simply reject an invalid URI.
>
> I think we can agree that the error handling should be, at the option of
> the software developer, either to handle the input as defined by the
> spec's algorithms, or to abort and not handle the input at all.

Yes, input is handled according to the specs' algorithmS.

>> But why not do it as a separate spec?
>
> Having multiple specs means an implementor has to refer to multiple specs
> to implement one algorithm, which is not a way to get interoperability.
> Bugs creep in much faster when implementors have to switch between specs
> just in the implementation of one algorithm.

One algorithm? There seem to be several functions...

- URI reference parsing (parse : scheme -> string -> raw uri_ref)
- URI reference normalization (normalize : raw uri_ref -> normal uri_ref)
- absolute URI predicate (absp : normal uri_ref -> absolute uri_ref option)
- URI resolution (resolve : absolute uri_ref -> _ uri_ref -> absolute uri_ref)

Of course, some of these may be composed in any given implementation.
In the case of a/@href and img/@src, it appears to be something like
(one_algorithm = (resolve base_uri) . normalize . parse (scheme
base_uri)) is in use.

A good way to get interop is to thoroughly define each function and
supply implementors with test cases for each processing stage
(one_algorithm's test cases define some tests for parse, normalize,
and resolve as well).

Some systems use more than the simple function composition of web browsers...

>> Increasing the space of valid addresses, when the set of addressable
>> resources is not actually increasing only means more complex parsing rules.
>
> I'm not saying we should increase the space of valid addresses.

Anne's current draft increases the space of valid addresses. This
isn't obvious as Anne's draft lacks a grammar and URI component
alphabets. You support Anne's draft and its philosophy, therefore you
are saying the space of valid addresses should be expanded.

Here is an example of a grammar extension that STD 66 disallows but
WHATWGRL allows:
<http://www.rfc-editor.org/errata_search.php?rfc=3986&eid=3330>

> The de facto parsing rules are already complicated by de facto requirements for
> handling errors, so defining those doesn't increase complexity either
> (especially if such behaviour is left as optional, as discussed above.)

*parse* is separate from *normalize* is separate from checking if a
reference is absolute (*absp*) is separate from *resolve*.

Why don't we have a discussion about the functions and types involved
in URI processing?

Why don't we discuss expanding allowable alphabets and production rules?

David


Reply | Threaded
Open this post in threaded view
|

Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

John Cowan-3
David Sheets scripsit:

> Anne's current draft increases the space of valid addresses. This
> isn't obvious as Anne's draft lacks a grammar and URI component
> alphabets. You support Anne's draft and its philosophy, therefore you
> are saying the space of valid addresses should be expanded.

Before confusion is worse confounded, Anne's draft does not extend the space
of valid addresses, but rather provides processing for both valid and
invalid addresses.  As such, it extends the space of what may be called
processable or usable addresses.

--
"The serene chaos that is Courage, and the phenomenon   [hidden email]
of Unopened Consciousness have been known to the        John Cowan
Great World eons longer than Extaboulism."
"Why is that?" the woman inquired.
"Because I just made that word up", the Master said wisely.
        --Kehlog Albran, The Profit             http://www.ccil.org/~cowan

Reply | Threaded
Open this post in threaded view
|

Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

David Sheets-2
On Tue, Oct 23, 2012 at 8:59 PM, John Cowan <[hidden email]> wrote:

> David Sheets scripsit:
>
>> Anne's current draft increases the space of valid addresses. This
>> isn't obvious as Anne's draft lacks a grammar and URI component
>> alphabets. You support Anne's draft and its philosophy, therefore you
>> are saying the space of valid addresses should be expanded.
>
> Before confusion is worse confounded, Anne's draft does not extend the space
> of valid addresses, but rather provides processing for both valid and
> invalid addresses.  As such, it extends the space of what may be called
> processable or usable addresses.

In the version of the spec that I am reading
<http://url.spec.whatwg.org/>, I see definition of an "invalid flag"
<http://url.spec.whatwg.org/#invalid-flag> and a "valid attribute"
<http://url.spec.whatwg.org/#dom-url-valid>.

It would appear that <##> is a valid WHATWGRL but an invalid URI reference.

Under WHATWGRL processing (and as is found in extant browsers), <##>
is handled as it is literally written and not through coercion to a
"valid" address from a "usable" address.

Given these facts, I don't understand how Anne's spec doesn't extend
the space of valid addresses. If the spec says that input is output
without modification and that input is outside of the STD 66 space,
doesn't that expand the space of valid addresses?

WHATWGRLs with "[" and "]" in path, query, and fragment are allowed as well.

David


Reply | Threaded
Open this post in threaded view
|

RE: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

Ian Hickson
In reply to this post by David Sheets-2
On Wed, 24 Oct 2012, Manger, James H wrote:
>
> Currently, I don't think url.spec.whatwg.org distinguishes between
> strings that are valid URLs and strings that can be interpreted as URLs
> by applying its standardised error handling. Consequently, error
> handling cannot be at the option of the software developer as you cannot
> tell which bits are error handling.

Well first, the whole point of discussions like this is to work out what
the specs _should_ say; if the specs were perfect then there wouldn't be
any need for discussion.

But second, I believe it's already Anne's intention to add to the parsing
algorithm the ability to abort whenever the URL isn't conforming, he just
hasn't done that yet because he hasn't specced what's conforming in the
first place.


On Tue, 23 Oct 2012, David Sheets wrote:
>
> One algorithm? There seem to be several functions...
>
> - URI reference parsing (parse : scheme -> string -> raw uri_ref)
> - URI reference normalization (normalize : raw uri_ref -> normal uri_ref)
> - absolute URI predicate (absp : normal uri_ref -> absolute uri_ref option)
> - URI resolution (resolve : absolute uri_ref -> _ uri_ref -> absolute uri_ref)

I don't understand what your four algorithms are supposed to be.

There's just one algorithm as far as I can tell -- it takes as input an
arbitrary string and a base URL object, and returns a normalised absolute
URL object, where a "URL object" is a conceptual construct consisting of
the components scheme, userinfo, host, port, path, query, and
fragment, which can be serialised together into a string form.

(I guess you could count the serialiser as a second algorithm, in which
case there's two.)


> Anne's current draft increases the space of valid addresses.

No, Anne hasn't finished defining conformance yet. (He just started
today.)

You may be getting confused by the "invalid flag", which doesn't mean the
input is non-conforming, but means that the input is uninterpretable.


> > The de facto parsing rules are already complicated by de facto
> > requirements for handling errors, so defining those doesn't increase
> > complexity either (especially if such behaviour is left as optional,
> > as discussed above.)
>
> *parse* is separate from *normalize* is separate from checking if a
> reference is absolute (*absp*) is separate from *resolve*.

No, it doesn't have to be. That's actually a more complicated way of
looking at it than necessary, IMHO.


> Why don't we have a discussion about the functions and types involved in
> URI processing?
>
> Why don't we discuss expanding allowable alphabets and production rules?

Personally I think this kind of open-ended approach is not a good way to
write specs. Better is to put forward concrete use cases, technical data,
etc, and let the spec editor take all that into account and turn it into a
standard. Arguing about what precise alphabets are allowed and whether to
spec something using prose or production rules is just bikeshedding.

--
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Reply | Threaded
Open this post in threaded view
|

RE: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

Manger, James H
> On Wed, 24 Oct 2012, Manger, James H wrote:
> >
> > Currently, I don't think url.spec.whatwg.org distinguishes between
> > strings that are valid URLs and strings that can be interpreted as
> > URLs by applying its standardised error handling. Consequently, error
> > handling cannot be at the option of the software developer as you
> > cannot tell which bits are error handling.

> Well first, the whole point of discussions like this is to work out
> what the specs _should_ say; if the specs were perfect then there
> wouldn't be any need for discussion.
>
> But second, I believe it's already Anne's intention to add to the
> parsing algorithm the ability to abort whenever the URL isn't
> conforming, he just hasn't done that yet because he hasn't specced
> what's conforming in the first place.

That is good to hear. There is no hint about this in the current text/outline. There is an "invalid" flag in the current text -- but that is for strings that are so broken no error handling can resurrect a URL. There is no mention of a separate "conforming" flag, even if the rules for when to set it are yet to be fixed (though it should have been easy to say conforming=conforming-as-per-rfc3987/3987 if that was the intention).

Assuming this is Anne's intention, then 1 spec for URI/IRI/error-handling would be helpful. I'm not sure that parsing rules with conforming/non-conforming branches would be pretty, but perhaps this isn't necessary if what a conforming URL is is clear from other parts of the spec.

--
James Manger

Reply | Threaded
Open this post in threaded view
|

Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

Anne van Kesteren-4
On Wed, Oct 24, 2012 at 8:41 AM, Manger, James H
<[hidden email]> wrote:
> That is good to hear. There is no hint about this in the current text/outline. There is an "invalid" flag in the current text -- but that is for strings that are so broken no error handling can resurrect a URL. There is no mention of a separate "conforming" flag, even if the rules for when to set it are yet to be fixed (though it should have been easy to say conforming=conforming-as-per-rfc3987/3987 if that was the intention).

Thanks for pointing that out. I renamed it to "fatal error flag" and
added an issue about the ability to halt on the first (non-fatal)
error. Ian is right that it's not defined yet because conformance is
not defined. And yes, as you say I've yet to figure out if branches in
the parser section is easy enough to do. If it is I think it's
worthwhile because it makes implementing a conformance checker (or
strict parser) much more straightforward.


--
http://annevankesteren.nl/

Reply | Threaded
Open this post in threaded view
|

Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

Jan Algermissen-3
In reply to this post by Ian Hickson

On Oct 24, 2012, at 1:47 AM, Ian Hickson <[hidden email]> wrote:

> On Wed, 24 Oct 2012, Jan Algermissen wrote:
>>
>> What matters is that nothing of the existing URI spec *changes*.
>>
>> Can you agree on that?
>
> Do you mean the actual text, or the normative meaning of the text?

I ideally mean the actual text, but it might be that there is some overlap in the construction algorithms - I am not expert enough there to judge that.

The point really was to make it very clear that *additional* stuff is going to be said and that existing implementations that follow the URI spec strictly remain conforming.

Jan


>
> --
> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
>


Reply | Threaded
Open this post in threaded view
|

Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

John C Klensin
In reply to this post by Mark Nottingham-2


--On Wednesday, October 24, 2012 11:39 +0100 Brian E Carpenter
<[hidden email]> wrote:

>
> On 23/10/2012 00:32, Mark Nottingham wrote:
> ...
>> The underlying point that people seem to be making is that
>> there's legitimate need for URIs to be a separate concept
>> from "strings that will become URIs." By collapsing them into
>> one thing, you're doing those folks a disservice. Browser
>> implementers may not care, but it's pretty obvious that lots
>> of other people do.
>
> Thanks for bringing this point out. It was explained to me in
> 1993 by TBL and Robert Cailliau that URLs (the only term used
> then, I think) should never be typed in by a user, and
> preferably never even seen by a user. It's because that
> doctrine was abandoned a year or so later that we have this
> problem today. I think there would be value in a document
> making this clear, as a framework for clearly separating the
> specification of what is allowable as a URI on the wire from
> what is acceptable as a user input string (UIS?).
>
> UIS to URI conversion may well end up as a heuristic algorithm.

Very useful perspective, IMO.

Seen that way, IRIs might then be considered a different flavor
of UIS.  Less heuristic than some, but not a flavor of URI.

best,
   john


Reply | Threaded
Open this post in threaded view
|

Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

Stephen Farrell
In reply to this post by Ted Hardie-2

On 10/24/2012 11:36 AM, Jari Arkko wrote:
> Ted, Ian,
>
>> Un-marked context shifts are
>> likely, and likely to be bad.  Avoiding them by picking a new term is
>> both easy and appropriate.
>
> FWIW, I agree with Ted's advice above.

Further to that, some guy's fine tool [1] says that
RFC 3986 is referenced by 193 other RFCs. Google
scholar says that RFC 3986 has 2090 citations. [2]
While that's nowhere near the full story, it is
nonetheless significant.

Buggering about with a spec like that whilst only
considering a limited context seems hugely dumb to
me, no matter how good a case you seem to have for
changing bits and pieces of it. So far, I've not seen
the argument for changing 3986, but there do seem
to be good arguments for additional things such as
error-handling specific to the browser context.

I honestly don't get the argument for forking,
and I agree with Ted that that's what's at issue
and would be stupid. I don't know that inventing a
new term is that good an approach either really,
it'd seem better to me to try specify the additional
stuff needed in the whatwg context and then see if
there's any change to 3986 needed and if so, then
propose those changes as an update to the RFC,
following the IETF process. That is not what I
see happening now and I do see an apparent intent
to create the stupid fork that Ted identified.

Getting an update to 3986 through the IETF process
will I'm sure be a huge PITA for whoever does
propose it, since there are so many other dependent
specs. But that's life, and its quite do-able, if
done by someone who's able to handle that kind of
work.

S.

[1] http://www.arkko.com/tools/allstats/citations-rfc3986.html
[2]
http://scholar.google.com/scholar?cites=6121683772362091274&as_sdt=2005&sciodt=0,5&hl=en


>
>>
>> My personal opinion, as always,
>>
>
> Mine too.
>
> Jari
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

Carsten Bormann
In reply to this post by David Sheets-2
On Oct 24, 2012, at 06:20, David Sheets <[hidden email]> wrote:

> WHATWGRL

Hey, call them EARLs.  Error-tolerant web-Address Repairing Labels or whatever.
(Just not URLs, that term is already taken in the Web.)

Grüße, Carsten


Reply | Threaded
Open this post in threaded view
|

Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

Ian Hickson
In reply to this post by Jan Algermissen-3
On Wed, 24 Oct 2012, Jan Algermissen wrote:

> On Oct 24, 2012, at 1:47 AM, Ian Hickson <[hidden email]> wrote:
> > On Wed, 24 Oct 2012, Jan Algermissen wrote:
> >>
> >> What matters is that nothing of the existing URI spec *changes*.
> >>
> >> Can you agree on that?
> >
> > Do you mean the actual text, or the normative meaning of the text?
>
> I ideally mean the actual text, but it might be that there is some
> overlap in the construction algorithms - I am not expert enough there to
> judge that.

Well I definitely don't think we should constrain a spec editor to being
forced to use text he didn't even write, that seems like a very poor way
to write a spec. Especially given that here the text is already spread
across two specs (URI and IRI).

I think it makes sense to be conservative and say that URL synatx
conformance requirements should probably not change from what the IRI spec
says today unless there's a really compelling reason, though.


> The point really was to make it very clear that *additional* stuff is
> going to be said and that existing implementations that follow the URI
> spec strictly remain conforming.

Well unless there's a very good reason (e.g. following the current specs
involves a security vulnerability or something like that) then I'd think
that was a reasonably strong technical requirement, sure. But that's
independent of how the spec is written. It's trivial to write a spec that
uses existing text while making all existing implementations
non-conforming, for example (just add a line that says "implementations
MUST NOT do what the following section says" or something...).

--
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Reply | Threaded
Open this post in threaded view
|

Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

Roy T. Fielding
In reply to this post by Mark Nottingham-2
On Oct 24, 2012, at 3:39 AM, Brian E Carpenter wrote:
> On 23/10/2012 00:32, Mark Nottingham wrote:
> ...
>> The underlying point that people seem to be making is that there's legitimate need for URIs to be a separate concept from "strings that will become URIs." By collapsing them into one thing, you're doing those folks a disservice. Browser implementers may not care, but it's pretty obvious that lots of other people do.
>
> Thanks for bringing this point out. It was explained to me in 1993 by TBL and
> Robert Cailliau that URLs (the only term used then, I think)

As a historical footnote, the term URL was created by the same
BOF that created the Uniform Resource Identifiers working group
at the IETF meeting in July 1992.

The early Web protocol specs had used the term "network address".

The term "Document Identifiers" came from Brewster Kahle and was
later used in a call for proposals by the Coalition for Networked
Information's Architectures & Standards Working Group, which in
turn led to TimBL propose Web addresses as Universal Document
Identifiers for a BOF at IETF 24 (Cambridge, MA).  Somewhere
in that BOF discussion, the URI working group was proposed and
TimBL's proposal was renamed Uniform Resource Locators
to distinguish it from other ideas for URNs
[see IETF 24 proceedings, p.184, and the following link].

 ftp://ftp.ietf.org/ietf/92jul/udi-minutes-92jul.txt

TimBL had originally specified that addresses in HREF could be
provided in full or partial form.  The IETF removed the partial
form, leading to all sorts of bad decisions regarding syntax,
and so I revived it in 1994 as Relative URLs [RFC1808].  That
spec is the only one that came close to defining what Anne
is trying to do here -- a single parsing standard for
potentially relative references.

It is easy to claim that the merging of syntax specs that
created RFC2396 lost some value when the parsing standard was
replaced by a non-normative appendix.  However, it was discussed
extensively at the time, including with the browser developers,
and there was simply nothing common enough to make standard.
The best I could do for 2396 and 3986 was to include a
regular expression that accepts all strings and parses them
into the component parts.

I have absolutely no problem with writing a proposed standard
for parsing references, particularly if browser developers are
willing to adhere to one.  However, it is not a redefinition of
URLs, nor does it make sense for error-correcting transformations
(like pct-encoding embedded spaces) to be "the standard" for
parsing when there are plenty of applications that string
parse references for the sake of generating invalid test cases
(e.g., the example attributed to curl).

It is not non-interoperable behavior to parse input data
differently depending on the context in which it is entered.
What matters is that the context be properly documented to
indicate what pre/post-processing is applied, just as we
expect a browser's combined search/location dialog bars to
be documented as not merely URL-entry forms (or be banned
due to the privacy leakage of incremental search results).

....Roy
Reply | Threaded
Open this post in threaded view
|

Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

David Sheets-2
In reply to this post by Ian Hickson
On Tue, Oct 23, 2012 at 10:05 PM, Ian Hickson <[hidden email]> wrote:

> On Wed, 24 Oct 2012, Manger, James H wrote:
>>
>> Currently, I don't think url.spec.whatwg.org distinguishes between
>> strings that are valid URLs and strings that can be interpreted as URLs
>> by applying its standardised error handling. Consequently, error
>> handling cannot be at the option of the software developer as you cannot
>> tell which bits are error handling.
>
> Well first, the whole point of discussions like this is to work out what
> the specs _should_ say; if the specs were perfect then there wouldn't be
> any need for discussion.

Good! Let's have a discussion about what the spec should say.

> On Tue, 23 Oct 2012, David Sheets wrote:
>>
>> One algorithm? There seem to be several functions...
>>
>> - URI reference parsing (parse : scheme -> string -> raw uri_ref)
>> - URI reference normalization (normalize : raw uri_ref -> normal uri_ref)
>> - absolute URI predicate (absp : normal uri_ref -> absolute uri_ref option)
>> - URI resolution (resolve : absolute uri_ref -> _ uri_ref -> absolute uri_ref)
>
> I don't understand what your four algorithms are supposed to be.

Ian, these are common descriptors (and function signatures).

Here are (longer) prose descriptions for those unfamiliar with
standard functional notation:

*parse* is a function which takes the contextual scheme and a string
to be parsed and produces a structure of unnormalized reference
components.
*normalize* is a function which takes a structure of unnormalized
reference components and produces a structure of normalized reference
components (lower-casing scheme, lower-casing host for some schemes,
collapsing default ports, coercing invalid codepoints, etc).
*absp* is a function which takes a structure of normalized reference
components and potentially produces a structure of normalized
reference components which is guaranteed to be absolute (or nothing:
in JS, this roughly corresponds to nullable).
*resolve* is a function which takes a URI structure and a reference
component structure and produces a URI structure corresponding to the
reference resolution of the second argument against the first (base)
argument.

See my original message for how these compose into your one_algorithm.

> There's just one algorithm as far as I can tell -- it takes as input an
> arbitrary string and a base URL object, and returns a normalised absolute
> URL object, where a "URL object" is a conceptual construct consisting of
> the components scheme, userinfo, host, port, path, query, and
> fragment, which can be serialised together into a string form.

How is the arbitrary string deconstructed? How is the result
normalized? What constitutes an absolute reference? How does a
reference resolve against a base URI?

>> Anne's current draft increases the space of valid addresses.
>
> No, Anne hasn't finished defining conformance yet. (He just started
> today.)

This is a political dodge to delay the inevitable discussion of
address space expansion.

>From what I have read of WHATWG's intentions and discussed with you
and others, you are codifying current browser behavior for
'interoperability'. Current browsers happily consume and emit URIs
that are invalid per STD 66.

<http://url.spec.whatwg.org/#writing> presently says:
"A fragment is "#", followed by any URL unit that is not one of
U+0009, U+000A, and U+000D."
This is larger than STD 66's space of valid addresses.

>> > The de facto parsing rules are already complicated by de facto
>> > requirements for handling errors, so defining those doesn't increase
>> > complexity either (especially if such behaviour is left as optional,
>> > as discussed above.)
>>
>> *parse* is separate from *normalize* is separate from checking if a
>> reference is absolute (*absp*) is separate from *resolve*.
>
> No, it doesn't have to be. That's actually a more complicated way of
> looking at it than necessary, IMHO.

Why use several simple, flexible sharp tools when you could use a
single complicated, monolithic blunt tool?

Why do you insist on producing a single, brittle, opaque function when
you could produce several simply-defined functions that actually model
the data type transformations?

Vendors are, of course, always free to implement an optimized
composition for their specific use cases.

>> Why don't we have a discussion about the functions and types involved in
>> URI processing?
>>
>> Why don't we discuss expanding allowable alphabets and production rules?
>
> Personally I think this kind of open-ended approach is not a good way to
> write specs.

The specs already exist and use these formalisms successfully. Why do
you think discussions about the model of the problem-space are
'open-ended'? Why are you trying to stop a potentially productive
discussion?

> Better is to put forward concrete use cases, technical data,
> etc, and let the spec editor take all that into account and turn it into a
> standard.

Is <https://github.com/dsheets/ocaml-uri/blob/master/lib/uri.ml#L108>
correct or should safe_chars_for_fragment include '#'?

Whatever 'standard' you produce will require me to exert significant
effort on your One Giant Algorithm to factor it into its proper
components and reconcile it with competing standards for my users. I
have applications that use each of the above functions separately.

> Arguing about what precise alphabets are allowed and whether to
> spec something using prose or production rules is just bikeshedding.

I can only conclude that you understand neither the value of precision
nor the meaning of "bikeshedding".

You are not constructing anything remotely comparable to a nuclear reactor.

I am expressing a genuine desire to discuss the actual technical
content of the relevant specifications in the most precise and concise
way possible.

I am losing confidence in your technical leadership.

Sincerely,

David Sheets


Reply | Threaded
Open this post in threaded view
|

Re: [whatwg] New URL Standard from Anne van Kesteren on 2012-09-24 (public-whatwg-archive@w3.org from September 2012)

Anne van Kesteren-4
On Thu, Oct 25, 2012 at 5:37 AM, David Sheets <[hidden email]> wrote:

> On Tue, Oct 23, 2012 at 10:05 PM, Ian Hickson <[hidden email]> wrote:
>> No, Anne hasn't finished defining conformance yet. (He just started
>> today.)
>
> This is a political dodge to delay the inevitable discussion of
> address space expansion.
>
> From what I have read of WHATWG's intentions and discussed with you
> and others, you are codifying current browser behavior for
> 'interoperability'. Current browsers happily consume and emit URIs
> that are invalid per STD 66.

Correct, that does not mean valid input is similarly relaxed. That is
also not the case for HTML for example. Literally anything produces a
tree of sorts, but far from all input is considered valid.


> <http://url.spec.whatwg.org/#writing> presently says:
> "A fragment is "#", followed by any URL unit that is not one of
> U+0009, U+000A, and U+000D."
> This is larger than STD 66's space of valid addresses.

I aligned it with IRI now, apart from private Unicode ranges. Not
really sure why we should ban them in one place and not in another.


We discussed how to write the algorithms on the WHATWG list before and
again I challenge you to write out your approach and convince the
world it's better. I'm not interested in doing that work for you.


--
http://annevankesteren.nl/

12345