URI components question

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

URI components question

michele vivoda

Hi all,
I have a question about URIs.

I was wondering if is correct thinking that an uri can be decomposed in
components
that can be stored in unescaped form mantaining the uri semantics, so the
possibility
to reconstruct from the components the (same or equivalent) uri they were
composing.

Perhaps better said the question is: can we always build an URI from
unescaped components ?

Many programming apis offer the possibility to build an uri from unescaped
components.
For 99% of the cases, for me, it worked good. But considering the following
URI:

http://a/b?p1=R%26D&p2=q

the unescaped query component, orignally containing 2 parameters becomes:

p1=R&D&p2=q

loosing its meaning since now we have 3 parameters.
My conclusion is that (at least) query component cannot be unescaped.
Is this right, does it apply only to query or unescaped components should
not exist at all ?

Regards
Michele Vivoda




Reply | Threaded
Open this post in threaded view
|

Re: URI components question

Frank Ellermann

michele vivoda wrote:

> My conclusion is that (at least) query component cannot be
> unescaped. Is this right, does it apply only to query or
> unescaped components should not exist at all ?

Everywhere.  The standard says query = *( pchar / "/" / "?" )

In other words you can use "/", "?", and any unescaped pchar
directly without percent-escapes.  A parser that found the
query is not more interested in "?" starting the query.  It
is also not more interested in "/" used in the path before
this "?".

If you check pchar you'll find that it allows to use ":" and
"@" directly, similar reasons, the only places where ":" and
"@" are relevant is before the path / query.

But if the parser reached the query it still has to find its
end, e.g. ">", '"', "#", or white space.  These characters
must be escaped if they are part of the query, pchar doesn't
contain them directly.

But you can use "&" and "=" directly in a query, as in your
example p1=R%26D&p2=q   The issue is that the standard does
not define an internal structure of queries, this could be
anything depending on the scheme and / or server.

E.g. for http some servers accept ";" instead of "&" to
delimit parameters (key=value or simply value).  So if you
send query strings to servers where "&", ";", and "=" have a
meaning as delimiters, you can't escape them if that's what
you want, otherwise you must escape them.

In your example you have a value R&D for p1.  Because "&" is
used to delimit parameters in your query you need  p1=R%26D
You would also %-encode "=" or ";" within keys or values of
queries sent to normal http-servers.

Actually you could get away with  p1=R&D&p2=q  if "&" is not
allowed in key names, and if singleton values (no key =) are
also not allowed (excl. the special "isindex" query):

p1=R plus D&p2=q would be an invalid key D&p2
p1=R plus D plus p2=q would be an invalid singleton D
p1=R&D plus p2=q would be the last chance to make sense of it.

                       Bye, Frank



Reply | Threaded
Open this post in threaded view
|

Re: URI components question

Jeremy Carroll
In reply to this post by michele vivoda

I have been thinking about this too in the last week or two, and cannot
work out a decent API that captures the escaping/unescaping semantics.

My analysis is as follows:

1) The issue resolves around the reserved characters:
  reserved      = gen-delims / sub-delims
    gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
    sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

2) Some of these reserved characters (usually the gen-delims) have
syntactic significance in the generic syntax. For these it is possible
to then give the unescaped form of terms from that generic syntax.

3) Under (2) we are mainly talking about components; however, in some
cases we are talking about subcomponents. For example, if a path
contains a segment which contains a "/" in unescaped form, then that "/"
must be % encoded in the URL, and it is not possible to provide an API
that treats the path as an atomic component that can be presented in
both escaped and unescaped form, because the unescaped form of

http://example.org/a/b/c/d
http://example.org/a%2Fb/c/d
http://example.org/a/b%2Fc/d
http://example.org/a%2Fb%2Fc/d

are all the same, yet the segments are different in each case.
It is possible to conceive of an API that talks about a path as an array
of strings, each being segments, in which each segment is presented in
either escaped or unescaped form.

4) the sub-delims are used both for scheme specific and application
specific semantics. So for instance, the ftp scheme reserve ';' in a
path. So in this case we would be best served by an API that explicitly
supported that, and splits the path on ';' and (re)uses a generic path
API for the part before a syntactically significant ';' and then perhaps
has a name=value API for the part after the ';'.

5) The query string is left as totally generic in the HTTP spec, but is
often used, as in your example, with a value that follows the HTML form
behaviour of a sequence of name=value pairs.

6) Perhaps the starting point is to split a URL into a sequence of
pairs, each pair consisting of a string of syntactically significant
reserved character, and a string of characters.

e.g.
http://example.org/a/b%2Fc/d

==>   ""  "http"
       "://" "example.org"
       "/"   "a"
       "/"   "b/c"
       "/"   "d"


If we %-escape any reserved character from the second column then we
should be able to construct a correct URI.
However, this representation is not very useful, because it does not
reflect the semantic grouping into components. Also we will
unnecessarily %-escape many reserved characters that are not
syntactically significant in that context.

Another issue here is that within any of these components there may be
an embedded URL, which may itself have some %-escapes, which should in
turn be %-escaped!

e.g. modifying your example:


http://a/b?p1=R%26D&p2=q

If the query values:
    p1   R&D
    p2   q
have a third value
    p3   http://a/b?p1=R%26D&p2=q

then the correct URL may be

http://a/b?p1=R%26D&p2=q&p3=http://a/b?p1%3DR%2526D%26p2%3Dq

Where the %2526 represents an & doubly encoded.

Perhaps the API design should have methods such as
   String[][] URI.getQuery(String regex)
returning an array of pairs of Strings as above, where the regex maybe
something like "([^=]*=^&]*&)*([^=]*=[^&]*)" and is used to know which
terms should be escaped/unescaped. At least in this case, the same regex
and an array of just the names and values could be used to construct the
query part correctly, with the regex being used to insert the syntactic
& and =.

Jeremy

michele vivoda wrote:

>
> Hi all,
> I have a question about URIs.
>
> I was wondering if is correct thinking that an uri can be decomposed in
> components
> that can be stored in unescaped form mantaining the uri semantics, so
> the possibility
> to reconstruct from the components the (same or equivalent) uri they
> were composing.
>
> Perhaps better said the question is: can we always build an URI from
> unescaped components ?
>
> Many programming apis offer the possibility to build an uri from
> unescaped components.
> For 99% of the cases, for me, it worked good. But considering the
> following URI:
>
> http://a/b?p1=R%26D&p2=q
>
> the unescaped query component, orignally containing 2 parameters becomes:
>
> p1=R&D&p2=q
>
> loosing its meaning since now we have 3 parameters.
> My conclusion is that (at least) query component cannot be unescaped.
> Is this right, does it apply only to query or unescaped components
> should not exist at all ?
>
> Regards
> Michele Vivoda
>
>
>
>




Reply | Threaded
Open this post in threaded view
|

RE: URI components question

Dave Risney
In reply to this post by michele vivoda

> Perhaps better said the question is: can we always build an URI from
> unescaped components ?

        Michele, as you noted in your example and in Jeremy's example we
can't build an equivalent URI from a set of URI components that have had
percent-encoded reserved characters decoded.

> My conclusion is that (at least) query component cannot be unescaped.
> Is this right, does it apply only to query or unescaped components
> should not exist at all ?

        There is no good way to do this for an arbitrary scheme.  When
writing a general URI parsing API you shouldn't decode every
percent-encoded octet in a URI component.

As stated in RFC 3986 if you percent-encode or decode a reserved
character you get a new URI that is not equivalent to the original.
This is true in the query component as well as anywhere else reserved
characters may appear.  If you decode reserved characters in your stored
URI component you have now lost information as to whether that character
was originally percent-encoded or not.  Jeremy's paths from (2) are
great examples.

        Why do you want to decode the URI components?  If your intent is
to extract the underlying data the URI components represent then this
requires much more than just decoding percent-encoded octets.  To do
this you must know more about the scheme, specifically how the scheme
converts its underlying data to and from URI components.  For something
like mailto where it's clear what the underlying data is (email
addresses, subject, body, etc) and where it's clear what the associated
transformations to and from URI components are then you can decode
appropriately and obtain the underlying information.

For the http scheme this is trickier because except for perhaps the
userinfo and host components it's not clear what the underlying data is.
An HTTP server may convert the URI path component into a Unix file path,
or a database query, or it may base64 decode the path and return the
resulting data.  Without specific knowledge of what the URI components
represent you may only rely on the rules set by the RFC.  You could
split the URI as Jeremy suggested in (6) but you would have to split on
all reserved characters, you wouldn't be able to distinguish which are
"syntactically significant" and which are not.

At that point you'd have to decide if such general parsing of URI
components would be useful.  My guess is that in the 99% of cases you
mentioned, having the path split around every reserved character would
be irritating when you want the path to just represent directories in
your file system.

        I think the better way to go about this is to have APIs that
compose and decompose URIs to and from a set of URI components
(appropriately percent-encoded) and to have a separate set of APIs to
convert between URI components and their underlying data for very
specific scenarios (which would handle percent-encoding and decoding).
For example, to get the name, value pairs out of a query URI component
that is using the application/x-www-form-urlencoded
<http://www.w3.org/TR/html4/interact/forms.html#form-content-type>
encoding form, one would first call the hypothetical URI decomposition
API to obtain the query URI component.  Then one would call the
hypothetical getMapFromUrlEncodedQueryComponent function to obtain the
name to value map that is the underlying data.  The second function
would handle decoding the percent-encoded octets since it's the
application/x-www-form-urlencoded encoding that mandated the
percent-encoding in the first place and has the knowledge to
appropriately decode.  Or to convert a file path from your favorite OS
to a path URI component one would first call the
createPathURIComponentFromFilePath function which would among other
things handle percent-encoding characters.  Then one would call the URI
composition API to construct a new URI using the path URI component.

-Dave

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Jeremy
Carroll
Sent: Friday, January 27, 2006 3:39 AM
To: michele vivoda
Cc: [hidden email]
Subject: Re: URI components question


I have been thinking about this too in the last week or two, and cannot
work out a decent API that captures the escaping/unescaping semantics.

My analysis is as follows:

1) The issue resolves around the reserved characters:
  reserved      = gen-delims / sub-delims
    gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
    sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

2) Some of these reserved characters (usually the gen-delims) have
syntactic significance in the generic syntax. For these it is possible
to then give the unescaped form of terms from that generic syntax.

3) Under (2) we are mainly talking about components; however, in some
cases we are talking about subcomponents. For example, if a path
contains a segment which contains a "/" in unescaped form, then that "/"

must be % encoded in the URL, and it is not possible to provide an API
that treats the path as an atomic component that can be presented in
both escaped and unescaped form, because the unescaped form of

http://example.org/a/b/c/d
http://example.org/a%2Fb/c/d
http://example.org/a/b%2Fc/d
http://example.org/a%2Fb%2Fc/d

are all the same, yet the segments are different in each case.
It is possible to conceive of an API that talks about a path as an array

of strings, each being segments, in which each segment is presented in
either escaped or unescaped form.

4) the sub-delims are used both for scheme specific and application
specific semantics. So for instance, the ftp scheme reserve ';' in a
path. So in this case we would be best served by an API that explicitly
supported that, and splits the path on ';' and (re)uses a generic path
API for the part before a syntactically significant ';' and then perhaps

has a name=value API for the part after the ';'.

5) The query string is left as totally generic in the HTTP spec, but is
often used, as in your example, with a value that follows the HTML form
behaviour of a sequence of name=value pairs.

6) Perhaps the starting point is to split a URL into a sequence of
pairs, each pair consisting of a string of syntactically significant
reserved character, and a string of characters.

e.g.
http://example.org/a/b%2Fc/d

==>   ""  "http"
       "://" "example.org"
       "/"   "a"
       "/"   "b/c"
       "/"   "d"


If we %-escape any reserved character from the second column then we
should be able to construct a correct URI.
However, this representation is not very useful, because it does not
reflect the semantic grouping into components. Also we will
unnecessarily %-escape many reserved characters that are not
syntactically significant in that context.

Another issue here is that within any of these components there may be
an embedded URL, which may itself have some %-escapes, which should in
turn be %-escaped!

e.g. modifying your example:


http://a/b?p1=R%26D&p2=q

If the query values:
    p1   R&D
    p2   q
have a third value
    p3   http://a/b?p1=R%26D&p2=q

then the correct URL may be

http://a/b?p1=R%26D&p2=q&p3=http://a/b?p1%3DR%2526D%26p2%3Dq

Where the %2526 represents an & doubly encoded.

Perhaps the API design should have methods such as
   String[][] URI.getQuery(String regex)
returning an array of pairs of Strings as above, where the regex maybe
something like "([^=]*=^&]*&)*([^=]*=[^&]*)" and is used to know which
terms should be escaped/unescaped. At least in this case, the same regex

and an array of just the names and values could be used to construct the

query part correctly, with the regex being used to insert the syntactic
& and =.

Jeremy

michele vivoda wrote:
>
> Hi all,
> I have a question about URIs.
>
> I was wondering if is correct thinking that an uri can be decomposed
in

> components
> that can be stored in unescaped form mantaining the uri semantics, so
> the possibility
> to reconstruct from the components the (same or equivalent) uri they
> were composing.
>
> Perhaps better said the question is: can we always build an URI from
> unescaped components ?
>
> Many programming apis offer the possibility to build an uri from
> unescaped components.
> For 99% of the cases, for me, it worked good. But considering the
> following URI:
>
> http://a/b?p1=R%26D&p2=q
>
> the unescaped query component, orignally containing 2 parameters
becomes:

>
> p1=R&D&p2=q
>
> loosing its meaning since now we have 3 parameters.
> My conclusion is that (at least) query component cannot be unescaped.
> Is this right, does it apply only to query or unescaped components
> should not exist at all ?
>
> Regards
> Michele Vivoda
>
>
>
>






Reply | Threaded
Open this post in threaded view
|

RE: URI components question

Windows-world
In reply to this post by michele vivoda
Hi,
 
You don't need an API to replace specials characters.
Simply, if you use php codes , you must use the fonction ' urlencode()' to encode specials characters.
 
And After you need to use 'urldecode()' to decode specials characters.
 
Friendly,