If not JSON, what then ?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
50 messages Options
123
Reply | Threaded
Open this post in threaded view
|

If not JSON, what then ?

Poul-Henning Kamp
Based on discussions in email and at the workshop in Stockholm,
JSON doesn't seem like a good fit for HTTP headers.

A number of inputs came up in Stockholm which informs the process,
Marks earlier attempt to classify header syntax into groups and the
desire the for a efficient binary encoding in HTTP[3-6] (or HPACK++)

My personal intuition was that we should find a binary serialization
(like CORS), and base64 it into HTTP1-2:  Ie: design for the future
and shoe-horn into the present.  But no obvious binary serialization
seems to exist, CORS was panned by a number of people in the WS as
too complicated, and gag-reflexes were triggered by ASN.1.

Inspired by Marks HTTP-header classification, I spent the train-trip
back home to Denmark pondering the opposite attack:  Is there a
common data structure which (many) existing headers would fit into,
which could serve our needs going forward?

This document chronicles my deliberations, and the strawman I came
up with:  Not only does it seem possible, it has some very interesting
possibilities down the road.

Disclaimer:  ABNF may not be perfect.

Structure of headers
====================

I surveyed current headers, and a very large fraction of them
fit into this data structure:

        header: ordered sequence of named dictionaries

The "ordered" constraint arises in two ways:  We have explicitly
ordered headers like {Content|Transfer}-Encoding and we have headers
which have order by their q=%f parameters.

If we unserialize this model from RFC723x definitions, then ',' is
the list separator and ';' the dictionary indicator and separator:

     Accept: audio/*; q=0.2, audio/basic

The "ordered ... named" combination does not map directly to most
contemporary object models (JSON, python, ...) where dictionary
order is undefined, so a definition list is required to represent
this in JSON:

        [
            [ "audio/*", { "q": 0.2 }],
            [ "audio/basic", {}]
        ]

It looks tempting to find a way to make the toplevel JSON a dictionary
too, but given the use of wildcards in many of the keys ("text/*"),
and the q=%f ordering, that would not be helpful.

Next we want to give people the ability to have deeper structure,
and we can either do that recursively (ie: nested ordered seq of
dict) or restrict the deeper levels to only dict.

That is probably a matter of taste more than anything, but the
recursive design will probably appeal aesthetically to more than
just me, and as we shall see shortly, it comes with certain economies.

So let us use '<...>' to mark the recursion, since <> are shorter than
[] and {} in HPACK/huffman.

Here is a two level example:

        foobar: foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar

Parsed into JSON that would be:

        [
            [
                "foo",
                {
                    "p1": 1,
                    "p4": {},
                    "p3": [
                        [
                            "x1",
                            {}
                        ],
                        [
                            "x2",
                            {}
                        ],
                        [
                            "x3",
                            {
                                "y2": 2
                                "y1": 1,
                            }
                        ]
                    ],
                    "p2": "abc"
                }
            ],
            [
                "bar",
                {}
            ]
        ]

(NB shuffled dictionary elements to show that JSON dicts are unordered)

And now comes the recursion economy:

First we wrap the entire *new* header in <...>:

        foobar: <foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar>

This way, the first character of the header tells us that this header
has "common structure".

That explicit "common structure" signal means privately defined
headers can use "common structure" as well, and middleware and
frameworks will automatically Do The Right Thing with them.

Next, we add a field to the IANA HTTP header registry (one can do
that I hope ?) classifying their "angle-bracket status":

 A) not angle-brackets -- incompatible structure use topical parser
        Range

 B) implicit angle-brackets -- Has common structure but is not <> enclosed
        Accept
        Content-Encoding
        Transfer-Encoding

 C) explicit angle-brackets -- Has common structure and <> encloosed
        all new headers go here

 D) unknown status.
        As it says on the tin.

Using this as whitelist, and given suitable schemas, a good number
of existing headers can go into the common parser.

And then for the final trick:   We can now define new variants of
existing headers which "sidegrade" them into the common parser:

        Date: < 1469734833 >

This obviously needs a signal/negotiation so we know the other side
can grok them (HTTP2: SETTINGS, HTTP1: TE?)

Next:

Data Types
==========

I think we need these fundamental data types, and subtypes:

1)   Unicode strings

2) ascii-string (maybe)

3) binary blob

4)   Token

5)   Qualified-token

6)   Number

7)      integer

8)   Timestamp

In addition to these subtypes, schemas can constrain types
further, for instance integer ranges, string lengths etc.
more on this below.

I will talk about each type in turn, but it goes without saying
that we need to fit them all into RFC723x, in a way that is not
going to break anything important and HPACK should not hate
them either.

In HTTP3+, they should be serialized intelligently, but that
should be trivial and I will not cover that here.

1) Unicode string
-----------------

The first question is do we mean "unrestricted unicode" or do
we want to try to sanitize it.

An example of sanitation is RFC7230's "quoted-string" which bans
control characters except forward horizontal white-space (=TAB).

Another is I-JSON (RFC7493)'s:

   MUST NOT include code points that identify Surrogates or
   Noncharacters as defined by UNICODE.

As far as I can tell, that means that you have to keep a full UNICODE
table handy at all times, and update it whenever additions are made
to unicode.  Not cool IMO.

Imposing a RFC7230 like restriction on unicode gets totally
roccoco:  What does "forward horizontal white-space" mean on
a line which used both left-to-right and right-to-left alphabets ?
What does it mean in alphabets which write vertically ?

Let us absolve the parser from such intimate unicode scholarship
and simply say that the data type "unicode string" is what it says,
and use the schemas to sanitize its individual use.

Encoding unicode strings in HTTP1+2 requires new syntax and
for any number of reasons, I would like to minimize that
and {re-|ab-}use quoted-strings.

RFC7230 does not specify what %80-%FF means in quoted-string, but
hints that it might be ISO8859.

Now we want it to become UTF-8.

My proposal at the workshop, to make the first three characters
inside the quotes a UTF-8 BOM is quite pessimal in HPACK's huffman
encoding:  It takes 68 bits.

Encoding the BOM as '\ufeff' helps but still takes an unreasonable
48 bits in HPACK/huffman.

In both H1 and H2 defining a new "\U" escape seems better.

Since we want to carry unrestricted unicode, we also need escapes
to put the <%20 codepoints back in.  I suggest "\u%%%%" like JSON.

(We should not restict which codepoints may/should use \u%%%% until
we have studied if \u%%%% may HPACK/huffman better than "raw" UTF-8
in asian codepages.)

The heuristic for parsing a quoted-string then becomes:

        1) If the quoted-string first two characters are "\U"
                -> UTF-8

        2)  If the quoted-string contains "\u%%%%" escape anywhere
                -> UTF-8

        3)  If the quoted-string contains only %09-%7E
                -> UTF-8 (actually: ASCII)

        4)  If the quoted-string contains any %7F-%8F
                -> UTF-8

        5)  If header definition explicitly says ISO-8859
                -> ISO8859

        6)  else
                -> UTF-8

2) Ascii strings
----------------

I'm not sure if we need these or if they are even a good idea.

The "pro" argument is if we insist they are also english text
so we have something the entire world stands a chance to understand.

The "contra" arguement is that some people will be upset about that.

If we want them, they're quoted-strings from RFC723x without %7F-%FF.

It is probably better the schema them from unicode strings.

3) Binary blobs
---------------

Fitting binary blobs from crypto into RFC7230 should squeeze into
quoted-string as well, since we cannot put any kinds of markers or
escapes on tokens without breaking things.

Proposal:

        Quoted-string with "\#" as first two chars indicates base64
        encoded binary blob.

I chose "\#" because "#" is not in the base64 set, so if some
nonconforming implementation eliminates the "unnecessary escape"
it will be clearly visible (and likely recoverable) rather than
munge up the content of the base64.

Base64 is chosen because it is the densest well known encoding which
works well with HPACK/huffman:  The b64 characters on average emit
6.46 bits.

I have no idea how these blobs would look when parsed into JSON,
probably as base64 ?  But in languages which can, they should
probably become native byte-strings.

4) Token
--------

As we know it from RFC7230:

   tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." /
    "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA
   token = 1*tchar

5) Qualified Token
------------------

   qualified_token = token 0*1("/" token)

All keys in all dictionaries are of this type.  (In JSON/python...
the keys are strings)

Schemas can restrict this further.

6 Numbers
---------

These are signed decimal numbers which may have a fraction

In HTTP1+2 we want them always on "%f" format and we want them to
fit in IEEE754 64 bit floating point, which lead to the following
definition:

        0*1"-" DIGIT 0*nDIGIT 0*1("." 0*mDIGIT ) n+m < 15

(15 digits fit in IEEE754 64 binary floating point.)

These numbers can (also) be used for millisecond resolution absolute
UNIX-epoch relative timestamps for all forseeable future.

7) Integers
-----------

        0*1"-" 1*15 DIGIT

Same restriction as above to fit into IEEE 754.

Range can & should be restricted by schemas as necessary.

8 Timestamps
------------

I propose we do these as subtype of Numbers, as UNIX-epoch relative
time.  That is somewhat human-hostile and is leap-second-challenged.

If you know from the schema that a timestamp is coming, the parser
can easily tell the difference between a RFC7231 IMF-fixdate or a
Number-Date.

Without guidance from a schema it becomes inefficient to determine
if it is an IMF-fixdate, since the week day part looks like a token,
but it is not impossible.


Schemas
=======

There needs a "ABNF"-parallel to specify what is mandatory and
allowed for these headers in "common structure".

Ideally this should be in machine-readable format, so that
validation tools and parser-code can be produced without
(too much) human intervation.  I'm tempted to say we should
make the schemas JSON, but then we need to write JSON schemas
for our schemas :-/

Since schemas basically restict what you are allowed to
express, we need to examine and think about what restrictions
we want to be able to impose, before we design the schema.

This is the least thought about part of this document, since
the train is now in Lund:

Unicode strings:
----------------

* Limit by (UTF-8) encoded length.
        Ie: a resource restriction, not a typographical restriction.

* Limit by codepoints
        Example: Allow only "0-9" and "a-f"
        The specification of code-points should be list of codepoint
        ranges.  (Ascii strings could be defined this way)

* Limit by allowed strings
        ie: Allow only "North", "South", "East" and "West"

Tokens
------

* Limit by codepoints
        Example: Allow only "A-Z"

* Limit by length
        Example: Max 7 characters

* Limit by pattern
        Example: "A-Z" "a-z" "-" "0-9" "0-9"
        (use ABNF to specify ?)

* Limit by well known set
        Example: Token must be ISO3166-1 country code
        Example: Token must be in IANA FooBar registry

Qualified Tokens
----------------

* Limit each of the two component tokens as above.
       
Binary Blob
-----------

* Limit by length in bytes
        Example: 128 bytes
        Example: 16-64 or 80 bytes

Number
------

* Limit resolution
        Example: exactly 3 decimal digits

* Limit range
        Example: [2.716 ... 3.1415]

Integer
-------

* Limit range
        Example [0 ... 65535]

Timestamp
---------

(I cant thing of usable restrictions here)


Aaand... I'm in Copenhagen...

Let me know if any of this looks usable...

--
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
[hidden email]         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Carsten Bormann
Hi Poul,

(I'm only lurking in this WG because the HTTP protocols are useful for
certain IoT applications; however I do understand the design focus here
is more on browser-web/big-web applications.)

Poul-Henning Kamp wrote:
> My personal intuition was that we should find a binary serialization
> (like CORS),

I'm assuming here you mean CBOR?

> and base64 it into HTTP1-2:  Ie: design for the future
> and shoe-horn into the present.  But no obvious binary serialization
> seems to exist, CORS was panned by a number of people in the WS as
> too complicated,

(If you are talking about CBOR:)
Well, it is more complicated than doing nothing.

Once a bespoke design of a data model and serialization is completed,
that is likely to be as complicated as CBOR (or even more).

The real problem is then that we have added another data model and
serialization of that data model to the overall complexity that needs to
be managed by a system that connects to the web.
(That may not make a difference for a browser, but it does for IoT and
other machine-to-machine applications.)

The specific data model that you have designed looks fine to me; it
should be representable in CBOR without problem.  What remains is the
cognitive dissonance of having done a base64(url) transformation, but as
you say that is not a real technical problem with HPACK.
(I am aware of the debugging advantages of text encoding, but these are
values that go into HTTP/2 headers and are obscured by HPACK, anyway.)

> and gag-reflexes were triggered by ASN.1.

I do sympathize here :-)

Grüße, Carsten

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Willy Tarreau-3
In reply to this post by Poul-Henning Kamp
Hi Poul-Henning,

I've read this a bit quickly but had a thought regarding this :

On Mon, Aug 01, 2016 at 07:43:34AM +0000, Poul-Henning Kamp wrote:
> 2) Ascii strings
> ----------------
>
> I'm not sure if we need these or if they are even a good idea.

That made me think that most of the header fields I'm seeing do not use
non-ascii characters at all, I'd even say non-printable-ascii. Most of
them contain :
  - host names (Host)
  - uris (Referer, Location)
  - user-agent strings (UA)
  - tokens (Connection, Accept, ...)
  - numbers

Thus in fact I'm wondering if it's really worth focusing the efforts on
non-ascii strings instead. I think we should be able to support them, but
not try to save too many resources on them if this comes at the expense of
the rest of the encoding, or with extra implementation complexity and/or
risks of vulnerabilities (eg: when an unterminated utf-8 sequence precedes
an LF character, I've seen some parsers consume that LF as part of the
character thus not seeing it as a line delimiter).

Just a few thoughts anyway. I'll take more time reading everything in
details, but for now I'm seeing good stuff here overall.

Cheers,
Willy

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Poul-Henning Kamp
In reply to this post by Carsten Bormann
--------
In message <[hidden email]>, Carsten Bormann writes:

>Poul-Henning Kamp wrote:
>> My personal intuition was that we should find a binary serialization
>> (like CORS),
>
>I'm assuming here you mean CBOR?

Sorry yes, not enough tea yet...

>Once a bespoke design of a data model and serialization is completed,
>that is likely to be as complicated as CBOR (or even more).

Not if we follow the datamodel I proposed.

>The real problem is then that we have added another data model and
>serialization of that data model to the overall complexity that needs to
>be managed by a system that connects to the web.

Well, the problem is that if we do not add a common datamodel,
each and every new header brings its own.

And by basing the datamodel on the existing HTTP(1) header syntax,
we would not need to base64 encode a binary format in HTTP1

--
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
[hidden email]         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Poul-Henning Kamp
In reply to this post by Willy Tarreau-3
--------
In message <[hidden email]>, Willy Tarreau writes:

>That made me think that most of the header fields I'm seeing do not use
>non-ascii characters at all, I'd even say non-printable-ascii. Most of
>them contain :
>  - host names (Host)
>  - uris (Referer, Location)
>  - user-agent strings (UA)
>  - tokens (Connection, Accept, ...)
>  - numbers
>
>Thus in fact I'm wondering if it's really worth focusing the efforts on
>non-ascii strings instead.

My take is that the data-model and serialization should be general
and unconstrained, and the constraints be applied in a/the schema
for each individual header.

--
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
[hidden email]         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Nicolas Mailhot
In reply to this post by Poul-Henning Kamp
IMHO it would be way simpler to specify that the dicts used in http are ordered rather than invent another representation

Anyway, please do not use < or > web people have enough tag soup problems in html (that Will be used with http)

If you're ready to invent binary représentations it's way simple to specify utf8 as encoding than fall again on multiple encoding trap which instead of helping anyone means everyone is incompatible with everyone else in subtle way

Finaly , is hostile to everyone that writes numbers unlike the USA
--
Envoyé de mon appareil Android avec K-9 Mail. Veuillez excuser ma brièveté.
Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Willy Tarreau-3
In reply to this post by Poul-Henning Kamp
On Mon, Aug 01, 2016 at 09:57:25AM +0000, Poul-Henning Kamp wrote:

> --------
> In message <[hidden email]>, Willy Tarreau writes:
>
> >That made me think that most of the header fields I'm seeing do not use
> >non-ascii characters at all, I'd even say non-printable-ascii. Most of
> >them contain :
> >  - host names (Host)
> >  - uris (Referer, Location)
> >  - user-agent strings (UA)
> >  - tokens (Connection, Accept, ...)
> >  - numbers
> >
> >Thus in fact I'm wondering if it's really worth focusing the efforts on
> >non-ascii strings instead.
>
> My take is that the data-model and serialization should be general
> and unconstrained, and the constraints be applied in a/the schema
> for each individual header.

But we're talking about protocol efficiency as well, which passes via
taking into account what we have. We could for example consider the
notion of "extended strings" which are only used for header fields
which are not relevant to the protocol itself (eg: not used in
accept/range/connection/...) and which would allow unicode to be
safely transmitted. It might be used for user-agent if needed.

Willy

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Poul-Henning Kamp
In reply to this post by Nicolas Mailhot
--------
In message <[hidden email]>, Nicolas Mailhot
writes:

>IMHO it would be way simpler to specify that the dicts used in
>http are ordered rather than invent another representation

We sort of already did that, only we never formally declared
that they were dicts or what the datamodel actually looked like.

My document was an attempt to do that.

No matter what we decide, we cannot change how JSON defined their
dicts, and consequently whatever we do needs to be mapped into JSON,
python, $lang's data models somehow.

>Anyway, please do not use < or > web people have enough tag
>soup problems in html (that Will be used with http)

They are part of the serialization, like ',' and ';' and they would
not be visible in any context near HTML.

>If you're ready to invent binary representations it's way
> imple to specify utf8 as encoding than fall again on multiple
> encoding trap which instead of helping anyone means everyone is
> incompatible with everyone else in subtle way

Please elaborate, I have no idea what your are talking about here.

>Finaly , is hostile to everyone that writes numbers unlike the USA

We already use ',' as the field delimiter in HTTP headers, and we
should *never* have to take I18N/NLS into account to *parse* a
HTTP header.

I18N/NLS may be necessary to *interpret* the HTTP header, but it
should not be necessary to *parse* the HTTP header.

--
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
[hidden email]         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Poul-Henning Kamp
In reply to this post by Willy Tarreau-3
--------
In message <[hidden email]>, Willy Tarreau writes:

>> My take is that the data-model and serialization should be general
>> and unconstrained, and the constraints be applied in a/the schema
>> for each individual header.
>
>But we're talking about protocol efficiency as well, which passes via
>taking into account what we have. We could for example consider the
>notion of "extended strings" which are only used for header fields
>which are not relevant to the protocol itself (eg: not used in
>accept/range/connection/...) and which would allow unicode to be
>safely transmitted. It might be used for user-agent if needed.

Unicode can already be safely transmitted as UTF-8, problem is that
people don't know if it is UTF-8 or ISO8859.

The "\U" prefix/escape would solve that efficiently.

--
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
[hidden email]         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Cory Benfield
In reply to this post by Poul-Henning Kamp

> On 1 Aug 2016, at 11:50, Poul-Henning Kamp <[hidden email]> wrote:
>
> No matter what we decide, we cannot change how JSON defined their
> dicts, and consequently whatever we do needs to be mapped into JSON,
> python, $lang's data models somehow.

JSON, sure, but don’t let Python hold you back. All supported versions of Python have an OrderedDict in their standard library. And any Python tool dealing with HTTP has inevitably had to invent something like a CaseInsensitiveOrderedMultiDict in order to deal with HTTP headers, so any tool that’s likely to deal with this kind of thing is already swimming in dictionary representations that we can use for ordering fields in header values.

So just to clarify: the lack of ordering in a JSON object is a reasonable problem with using JSON, but that doesn’t mean we can’t use ordered representations in other serialisation formats. Programming languages have all the abstractions required to do this, and it’s just not that hard to write an Ordered Mapping in $LANG that wraps $LANG’s built-in Mapping type. (Hell, some Python interpreters have *all* dicts ordered, such that they define OrderedDict by simply writing “OrderedDict = dict”).

Cory
Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

James M Snell
In reply to this post by Poul-Henning Kamp
phk,

I'm very happy to see the discussion of efficient binary encoding of
HTTP headers coming back around. This is an area that I had explored
fairly extensively early in the process of designing HTTP/2 with the
"Binary-optimized Header Encoding" I-D's (see
https://tools.ietf.org/html/draft-snell-httpbis-bohe-13). While HPACK
won out with regards to being the header compression scheme used for
HTTP/2, there is still quite a bit in the BOHE drafts that could be
useful here.

- James

On Mon, Aug 1, 2016 at 12:43 AM, Poul-Henning Kamp <[hidden email]> wrote:

> Based on discussions in email and at the workshop in Stockholm,
> JSON doesn't seem like a good fit for HTTP headers.
>
> A number of inputs came up in Stockholm which informs the process,
> Marks earlier attempt to classify header syntax into groups and the
> desire the for a efficient binary encoding in HTTP[3-6] (or HPACK++)
>
> My personal intuition was that we should find a binary serialization
> (like CORS), and base64 it into HTTP1-2:  Ie: design for the future
> and shoe-horn into the present.  But no obvious binary serialization
> seems to exist, CORS was panned by a number of people in the WS as
> too complicated, and gag-reflexes were triggered by ASN.1.
>
> Inspired by Marks HTTP-header classification, I spent the train-trip
> back home to Denmark pondering the opposite attack:  Is there a
> common data structure which (many) existing headers would fit into,
> which could serve our needs going forward?
>
> This document chronicles my deliberations, and the strawman I came
> up with:  Not only does it seem possible, it has some very interesting
> possibilities down the road.
>
> Disclaimer:  ABNF may not be perfect.
>
> Structure of headers
> ====================
>
> I surveyed current headers, and a very large fraction of them
> fit into this data structure:
>
>         header: ordered sequence of named dictionaries
>
> The "ordered" constraint arises in two ways:  We have explicitly
> ordered headers like {Content|Transfer}-Encoding and we have headers
> which have order by their q=%f parameters.
>
> If we unserialize this model from RFC723x definitions, then ',' is
> the list separator and ';' the dictionary indicator and separator:
>
>      Accept: audio/*; q=0.2, audio/basic
>
> The "ordered ... named" combination does not map directly to most
> contemporary object models (JSON, python, ...) where dictionary
> order is undefined, so a definition list is required to represent
> this in JSON:
>
>         [
>             [ "audio/*", { "q": 0.2 }],
>             [ "audio/basic", {}]
>         ]
>
> It looks tempting to find a way to make the toplevel JSON a dictionary
> too, but given the use of wildcards in many of the keys ("text/*"),
> and the q=%f ordering, that would not be helpful.
>
> Next we want to give people the ability to have deeper structure,
> and we can either do that recursively (ie: nested ordered seq of
> dict) or restrict the deeper levels to only dict.
>
> That is probably a matter of taste more than anything, but the
> recursive design will probably appeal aesthetically to more than
> just me, and as we shall see shortly, it comes with certain economies.
>
> So let us use '<...>' to mark the recursion, since <> are shorter than
> [] and {} in HPACK/huffman.
>
> Here is a two level example:
>
>         foobar: foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar
>
> Parsed into JSON that would be:
>
>         [
>             [
>                 "foo",
>                 {
>                     "p1": 1,
>                     "p4": {},
>                     "p3": [
>                         [
>                             "x1",
>                             {}
>                         ],
>                         [
>                             "x2",
>                             {}
>                         ],
>                         [
>                             "x3",
>                             {
>                                 "y2": 2
>                                 "y1": 1,
>                             }
>                         ]
>                     ],
>                     "p2": "abc"
>                 }
>             ],
>             [
>                 "bar",
>                 {}
>             ]
>         ]
>
> (NB shuffled dictionary elements to show that JSON dicts are unordered)
>
> And now comes the recursion economy:
>
> First we wrap the entire *new* header in <...>:
>
>         foobar: <foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar>
>
> This way, the first character of the header tells us that this header
> has "common structure".
>
> That explicit "common structure" signal means privately defined
> headers can use "common structure" as well, and middleware and
> frameworks will automatically Do The Right Thing with them.
>
> Next, we add a field to the IANA HTTP header registry (one can do
> that I hope ?) classifying their "angle-bracket status":
>
>  A) not angle-brackets -- incompatible structure use topical parser
>         Range
>
>  B) implicit angle-brackets -- Has common structure but is not <> enclosed
>         Accept
>         Content-Encoding
>         Transfer-Encoding
>
>  C) explicit angle-brackets -- Has common structure and <> encloosed
>         all new headers go here
>
>  D) unknown status.
>         As it says on the tin.
>
> Using this as whitelist, and given suitable schemas, a good number
> of existing headers can go into the common parser.
>
> And then for the final trick:   We can now define new variants of
> existing headers which "sidegrade" them into the common parser:
>
>         Date: < 1469734833 >
>
> This obviously needs a signal/negotiation so we know the other side
> can grok them (HTTP2: SETTINGS, HTTP1: TE?)
>
> Next:
>
> Data Types
> ==========
>
> I think we need these fundamental data types, and subtypes:
>
> 1)   Unicode strings
>
> 2)      ascii-string (maybe)
>
> 3)      binary blob
>
> 4)   Token
>
> 5)   Qualified-token
>
> 6)   Number
>
> 7)      integer
>
> 8)   Timestamp
>
> In addition to these subtypes, schemas can constrain types
> further, for instance integer ranges, string lengths etc.
> more on this below.
>
> I will talk about each type in turn, but it goes without saying
> that we need to fit them all into RFC723x, in a way that is not
> going to break anything important and HPACK should not hate
> them either.
>
> In HTTP3+, they should be serialized intelligently, but that
> should be trivial and I will not cover that here.
>
> 1) Unicode string
> -----------------
>
> The first question is do we mean "unrestricted unicode" or do
> we want to try to sanitize it.
>
> An example of sanitation is RFC7230's "quoted-string" which bans
> control characters except forward horizontal white-space (=TAB).
>
> Another is I-JSON (RFC7493)'s:
>
>    MUST NOT include code points that identify Surrogates or
>    Noncharacters as defined by UNICODE.
>
> As far as I can tell, that means that you have to keep a full UNICODE
> table handy at all times, and update it whenever additions are made
> to unicode.  Not cool IMO.
>
> Imposing a RFC7230 like restriction on unicode gets totally
> roccoco:  What does "forward horizontal white-space" mean on
> a line which used both left-to-right and right-to-left alphabets ?
> What does it mean in alphabets which write vertically ?
>
> Let us absolve the parser from such intimate unicode scholarship
> and simply say that the data type "unicode string" is what it says,
> and use the schemas to sanitize its individual use.
>
> Encoding unicode strings in HTTP1+2 requires new syntax and
> for any number of reasons, I would like to minimize that
> and {re-|ab-}use quoted-strings.
>
> RFC7230 does not specify what %80-%FF means in quoted-string, but
> hints that it might be ISO8859.
>
> Now we want it to become UTF-8.
>
> My proposal at the workshop, to make the first three characters
> inside the quotes a UTF-8 BOM is quite pessimal in HPACK's huffman
> encoding:  It takes 68 bits.
>
> Encoding the BOM as '\ufeff' helps but still takes an unreasonable
> 48 bits in HPACK/huffman.
>
> In both H1 and H2 defining a new "\U" escape seems better.
>
> Since we want to carry unrestricted unicode, we also need escapes
> to put the <%20 codepoints back in.  I suggest "\u%%%%" like JSON.
>
> (We should not restict which codepoints may/should use \u%%%% until
> we have studied if \u%%%% may HPACK/huffman better than "raw" UTF-8
> in asian codepages.)
>
> The heuristic for parsing a quoted-string then becomes:
>
>         1) If the quoted-string first two characters are "\U"
>                 -> UTF-8
>
>         2)  If the quoted-string contains "\u%%%%" escape anywhere
>                 -> UTF-8
>
>         3)  If the quoted-string contains only %09-%7E
>                 -> UTF-8 (actually: ASCII)
>
>         4)  If the quoted-string contains any %7F-%8F
>                 -> UTF-8
>
>         5)  If header definition explicitly says ISO-8859
>                 -> ISO8859
>
>         6)  else
>                 -> UTF-8
>
> 2) Ascii strings
> ----------------
>
> I'm not sure if we need these or if they are even a good idea.
>
> The "pro" argument is if we insist they are also english text
> so we have something the entire world stands a chance to understand.
>
> The "contra" arguement is that some people will be upset about that.
>
> If we want them, they're quoted-strings from RFC723x without %7F-%FF.
>
> It is probably better the schema them from unicode strings.
>
> 3) Binary blobs
> ---------------
>
> Fitting binary blobs from crypto into RFC7230 should squeeze into
> quoted-string as well, since we cannot put any kinds of markers or
> escapes on tokens without breaking things.
>
> Proposal:
>
>         Quoted-string with "\#" as first two chars indicates base64
>         encoded binary blob.
>
> I chose "\#" because "#" is not in the base64 set, so if some
> nonconforming implementation eliminates the "unnecessary escape"
> it will be clearly visible (and likely recoverable) rather than
> munge up the content of the base64.
>
> Base64 is chosen because it is the densest well known encoding which
> works well with HPACK/huffman:  The b64 characters on average emit
> 6.46 bits.
>
> I have no idea how these blobs would look when parsed into JSON,
> probably as base64 ?  But in languages which can, they should
> probably become native byte-strings.
>
> 4) Token
> --------
>
> As we know it from RFC7230:
>
>    tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." /
>     "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA
>    token = 1*tchar
>
> 5) Qualified Token
> ------------------
>
>    qualified_token = token 0*1("/" token)
>
> All keys in all dictionaries are of this type.  (In JSON/python...
> the keys are strings)
>
> Schemas can restrict this further.
>
> 6 Numbers
> ---------
>
> These are signed decimal numbers which may have a fraction
>
> In HTTP1+2 we want them always on "%f" format and we want them to
> fit in IEEE754 64 bit floating point, which lead to the following
> definition:
>
>         0*1"-" DIGIT 0*nDIGIT 0*1("." 0*mDIGIT )        n+m < 15
>
> (15 digits fit in IEEE754 64 binary floating point.)
>
> These numbers can (also) be used for millisecond resolution absolute
> UNIX-epoch relative timestamps for all forseeable future.
>
> 7) Integers
> -----------
>
>         0*1"-" 1*15 DIGIT
>
> Same restriction as above to fit into IEEE 754.
>
> Range can & should be restricted by schemas as necessary.
>
> 8 Timestamps
> ------------
>
> I propose we do these as subtype of Numbers, as UNIX-epoch relative
> time.  That is somewhat human-hostile and is leap-second-challenged.
>
> If you know from the schema that a timestamp is coming, the parser
> can easily tell the difference between a RFC7231 IMF-fixdate or a
> Number-Date.
>
> Without guidance from a schema it becomes inefficient to determine
> if it is an IMF-fixdate, since the week day part looks like a token,
> but it is not impossible.
>
>
> Schemas
> =======
>
> There needs a "ABNF"-parallel to specify what is mandatory and
> allowed for these headers in "common structure".
>
> Ideally this should be in machine-readable format, so that
> validation tools and parser-code can be produced without
> (too much) human intervation.  I'm tempted to say we should
> make the schemas JSON, but then we need to write JSON schemas
> for our schemas :-/
>
> Since schemas basically restict what you are allowed to
> express, we need to examine and think about what restrictions
> we want to be able to impose, before we design the schema.
>
> This is the least thought about part of this document, since
> the train is now in Lund:
>
> Unicode strings:
> ----------------
>
> * Limit by (UTF-8) encoded length.
>         Ie: a resource restriction, not a typographical restriction.
>
> * Limit by codepoints
>         Example: Allow only "0-9" and "a-f"
>         The specification of code-points should be list of codepoint
>         ranges.  (Ascii strings could be defined this way)
>
> * Limit by allowed strings
>         ie: Allow only "North", "South", "East" and "West"
>
> Tokens
> ------
>
> * Limit by codepoints
>         Example: Allow only "A-Z"
>
> * Limit by length
>         Example: Max 7 characters
>
> * Limit by pattern
>         Example: "A-Z" "a-z" "-" "0-9" "0-9"
>         (use ABNF to specify ?)
>
> * Limit by well known set
>         Example: Token must be ISO3166-1 country code
>         Example: Token must be in IANA FooBar registry
>
> Qualified Tokens
> ----------------
>
> * Limit each of the two component tokens as above.
>
> Binary Blob
> -----------
>
> * Limit by length in bytes
>         Example: 128 bytes
>         Example: 16-64 or 80 bytes
>
> Number
> ------
>
> * Limit resolution
>         Example: exactly 3 decimal digits
>
> * Limit range
>         Example: [2.716 ... 3.1415]
>
> Integer
> -------
>
> * Limit range
>         Example [0 ... 65535]
>
> Timestamp
> ---------
>
> (I cant thing of usable restrictions here)
>
>
> Aaand... I'm in Copenhagen...
>
> Let me know if any of this looks usable...
>
> --
> Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
> [hidden email]         | TCP/IP since RFC 956
> FreeBSD committer       | BSD since 4.3-tahoe
> Never attribute to malice what can adequately be explained by incompetence.
>

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Alcides Viamontes E-2
In reply to this post by Cory Benfield
Hi!


TL;DR: I also think that trying to fit HTTP headers in anything else other than their current representation is a bad idea. But creating a semi-formal compilation of rules and behaviours for core HTTP headers would be worth it. 

Long rant: 

Recently we revamped how ShimmerCat handles HTTP headers and we ended up creating a separate library with a "Headers Document Object Model". The bare minimum set of different headers we needed to understand and manipulate to offer basic functionality is 20, and for each of them we needed to take the following into account:


       * Representation round-trip: How header ASCII values are parsed to "things that the program can easily manipulate" (see next), and the other way around, how to convert to ASCII values. This is slightly different for HTTP/1.1 and HTTP/2, because of connection specific headers, the "Cookie: " header, and the rather non-trivial dance with "Host: " and ":authority:". 

        * What in-memory representation makes sense for the program: "Date: " should be a date, "Cookie: "  is  a dictionary, "Set-Cookie: " is a set indexed by cookie name,  path and perhaps other attributes (exercising the RFC with  Wordpress teaches you one or two things), "Forwarded: " is actually a list, "Link: " headers from the point of view of a server doing HTTP/2 Push are all different beasts each getting their own thing, and so on. This of course is very program specific and probably not generally interesting, but it is easier to talk about data structures instead of ASCII text when defining:

       * How header values combine: there shouldn't be more than one "Date: " in a given response, even if both a proxy server and an application may try to stamp a "Date: ". However, a server may "add" cookies to an application response, and the "Forwarded" header needs to be composed in a sequence. Similar decisions are needed with CORS headers, Link headers, Cache-Control, Etag and so on. 

        * Headers are extensible, so one needs default policies for header values where there are no RFC dispositions. 

I would find daunting the task of fitting all the idiosyncrasies and behaviours of HTTP headers in a common bytes representation (serialisation) without some kind of updated compendium of what they do and how they behave. For example, it would be nice to have a doc similar to RFC 4229, with formalised candidate data structures and algorithms for how intermediaries in different roles should handle the headers. Furthermore, some HTTP headers are more important/common  than others (is there anybody using the "From:" header?), or they are relevant to different roles, so maybe we need to group headers in some sensible way (so that we can say, e.g. "my CMS emits core http content headers" or "my server is caching-compliant because it interprets correctly http core caching headers" or "my server/application implements correctly the security measures implied by the core security HTTP headers", whatever any of these can be). 


On Mon, Aug 1, 2016 at 2:30 PM, Cory Benfield <[hidden email]> wrote:

> On 1 Aug 2016, at 11:50, Poul-Henning Kamp <[hidden email]> wrote:
>
> No matter what we decide, we cannot change how JSON defined their
> dicts, and consequently whatever we do needs to be mapped into JSON,
> python, $lang's data models somehow.

JSON, sure, but don’t let Python hold you back. All supported versions of Python have an OrderedDict in their standard library. And any Python tool dealing with HTTP has inevitably had to invent something like a CaseInsensitiveOrderedMultiDict in order to deal with HTTP headers, so any tool that’s likely to deal with this kind of thing is already swimming in dictionary representations that we can use for ordering fields in header values.

So just to clarify: the lack of ordering in a JSON object is a reasonable problem with using JSON, but that doesn’t mean we can’t use ordered representations in other serialisation formats. Programming languages have all the abstractions required to do this, and it’s just not that hard to write an Ordered Mapping in $LANG that wraps $LANG’s built-in Mapping type. (Hell, some Python interpreters have *all* dicts ordered, such that they define OrderedDict by simply writing “OrderedDict = dict”).

Cory


./Alcides
Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Mark Nottingham-2
In reply to this post by Poul-Henning Kamp
Hey PHK,

Sorry for the delay, been in (and still remain in) transit hell (hi from FRA!).

Overall I like this.

A few thoughts come to mind:

1) Using the first character of the field-value as a signal that the encoding is in use is interesting. I was thinking of indicating it with a suffix on the header field name (e.g., Date-J). Either is viable, but I don't think it's a good idea to reuse existing header field names and rely on that signal to differentiate the value type; that seems like it would cause a lot of interop problems to me. Defining a new header field (whether it's Date-J or Date2 or whatever) seems much safer to me.

2) Regardless of #1, using < as your indicator character is going to collide with the existing syntax of the Link header.

3) I really, really wonder whether we need recursion beyond one level; e.g., I can see a list of dicts, or a dict of dicts, but beyond that seems like a lot of complexity to support. Fields like Accept that have complex structure turn out not to be implemented (qvalues are commonly ignored); having an ordered list would work much better (and defining new header fields as per #1 means we have an opportunity to do this!).

4) I agree with the sentiment that non-ascii strings in header field values are comparatively rare (since most headers are not intended for display), so while we should accommodate them, they shouldn't be the default.

5) I like the idea of 'implicit angle brackets' to retrofit some existing headers. Depending on the parse algorithm we define, we could potentially fit a fair number of existing headers into this, although deriving the specific data types of things like parameter arguments is going to be difficult (or maybe impossible). Needs some investigation before we know whether this would be viable.

Cheers,




> On 1 Aug 2016, at 9:43 AM, Poul-Henning Kamp <[hidden email]> wrote:
>
> Based on discussions in email and at the workshop in Stockholm,
> JSON doesn't seem like a good fit for HTTP headers.
>
> A number of inputs came up in Stockholm which informs the process,
> Marks earlier attempt to classify header syntax into groups and the
> desire the for a efficient binary encoding in HTTP[3-6] (or HPACK++)
>
> My personal intuition was that we should find a binary serialization
> (like CORS), and base64 it into HTTP1-2:  Ie: design for the future
> and shoe-horn into the present.  But no obvious binary serialization
> seems to exist, CORS was panned by a number of people in the WS as
> too complicated, and gag-reflexes were triggered by ASN.1.
>
> Inspired by Marks HTTP-header classification, I spent the train-trip
> back home to Denmark pondering the opposite attack:  Is there a
> common data structure which (many) existing headers would fit into,
> which could serve our needs going forward?
>
> This document chronicles my deliberations, and the strawman I came
> up with:  Not only does it seem possible, it has some very interesting
> possibilities down the road.
>
> Disclaimer:  ABNF may not be perfect.
>
> Structure of headers
> ====================
>
> I surveyed current headers, and a very large fraction of them
> fit into this data structure:
>
> header: ordered sequence of named dictionaries
>
> The "ordered" constraint arises in two ways:  We have explicitly
> ordered headers like {Content|Transfer}-Encoding and we have headers
> which have order by their q=%f parameters.
>
> If we unserialize this model from RFC723x definitions, then ',' is
> the list separator and ';' the dictionary indicator and separator:
>
>     Accept: audio/*; q=0.2, audio/basic
>
> The "ordered ... named" combination does not map directly to most
> contemporary object models (JSON, python, ...) where dictionary
> order is undefined, so a definition list is required to represent
> this in JSON:
>
> [
>    [ "audio/*", { "q": 0.2 }],
>    [ "audio/basic", {}]
> ]
>
> It looks tempting to find a way to make the toplevel JSON a dictionary
> too, but given the use of wildcards in many of the keys ("text/*"),
> and the q=%f ordering, that would not be helpful.
>
> Next we want to give people the ability to have deeper structure,
> and we can either do that recursively (ie: nested ordered seq of
> dict) or restrict the deeper levels to only dict.
>
> That is probably a matter of taste more than anything, but the
> recursive design will probably appeal aesthetically to more than
> just me, and as we shall see shortly, it comes with certain economies.
>
> So let us use '<...>' to mark the recursion, since <> are shorter than
> [] and {} in HPACK/huffman.
>
> Here is a two level example:
>
> foobar: foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar
>
> Parsed into JSON that would be:
>
> [
>    [
> "foo",
> {
>    "p1": 1,
>    "p4": {},
>    "p3": [
> [
>    "x1",
>    {}
> ],
> [
>    "x2",
>    {}
> ],
> [
>    "x3",
>    {
> "y2": 2
> "y1": 1,
>    }
> ]
>    ],
>    "p2": "abc"
>        }
>    ],
>    [
> "bar",
> {}
>    ]
> ]
>
> (NB shuffled dictionary elements to show that JSON dicts are unordered)
>
> And now comes the recursion economy:
>
> First we wrap the entire *new* header in <...>:
>
> foobar: <foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar>
>
> This way, the first character of the header tells us that this header
> has "common structure".
>
> That explicit "common structure" signal means privately defined
> headers can use "common structure" as well, and middleware and
> frameworks will automatically Do The Right Thing with them.
>
> Next, we add a field to the IANA HTTP header registry (one can do
> that I hope ?) classifying their "angle-bracket status":
>
> A) not angle-brackets -- incompatible structure use topical parser
> Range
>
> B) implicit angle-brackets -- Has common structure but is not <> enclosed
> Accept
> Content-Encoding
> Transfer-Encoding
>
> C) explicit angle-brackets -- Has common structure and <> encloosed
> all new headers go here
>
> D) unknown status.
> As it says on the tin.
>
> Using this as whitelist, and given suitable schemas, a good number
> of existing headers can go into the common parser.
>
> And then for the final trick:   We can now define new variants of
> existing headers which "sidegrade" them into the common parser:
>
> Date: < 1469734833 >
>
> This obviously needs a signal/negotiation so we know the other side
> can grok them (HTTP2: SETTINGS, HTTP1: TE?)
>
> Next:
>
> Data Types
> ==========
>
> I think we need these fundamental data types, and subtypes:
>
> 1)   Unicode strings
>
> 2) ascii-string (maybe)
>
> 3) binary blob
>
> 4)   Token
>
> 5)   Qualified-token
>
> 6)   Number
>
> 7)      integer
>
> 8)   Timestamp
>
> In addition to these subtypes, schemas can constrain types
> further, for instance integer ranges, string lengths etc.
> more on this below.
>
> I will talk about each type in turn, but it goes without saying
> that we need to fit them all into RFC723x, in a way that is not
> going to break anything important and HPACK should not hate
> them either.
>
> In HTTP3+, they should be serialized intelligently, but that
> should be trivial and I will not cover that here.
>
> 1) Unicode string
> -----------------
>
> The first question is do we mean "unrestricted unicode" or do
> we want to try to sanitize it.
>
> An example of sanitation is RFC7230's "quoted-string" which bans
> control characters except forward horizontal white-space (=TAB).
>
> Another is I-JSON (RFC7493)'s:
>
>   MUST NOT include code points that identify Surrogates or
>   Noncharacters as defined by UNICODE.
>
> As far as I can tell, that means that you have to keep a full UNICODE
> table handy at all times, and update it whenever additions are made
> to unicode.  Not cool IMO.
>
> Imposing a RFC7230 like restriction on unicode gets totally
> roccoco:  What does "forward horizontal white-space" mean on
> a line which used both left-to-right and right-to-left alphabets ?
> What does it mean in alphabets which write vertically ?
>
> Let us absolve the parser from such intimate unicode scholarship
> and simply say that the data type "unicode string" is what it says,
> and use the schemas to sanitize its individual use.
>
> Encoding unicode strings in HTTP1+2 requires new syntax and
> for any number of reasons, I would like to minimize that
> and {re-|ab-}use quoted-strings.
>
> RFC7230 does not specify what %80-%FF means in quoted-string, but
> hints that it might be ISO8859.
>
> Now we want it to become UTF-8.
>
> My proposal at the workshop, to make the first three characters
> inside the quotes a UTF-8 BOM is quite pessimal in HPACK's huffman
> encoding:  It takes 68 bits.
>
> Encoding the BOM as '\ufeff' helps but still takes an unreasonable
> 48 bits in HPACK/huffman.
>
> In both H1 and H2 defining a new "\U" escape seems better.
>
> Since we want to carry unrestricted unicode, we also need escapes
> to put the <%20 codepoints back in.  I suggest "\u%%%%" like JSON.
>
> (We should not restict which codepoints may/should use \u%%%% until
> we have studied if \u%%%% may HPACK/huffman better than "raw" UTF-8
> in asian codepages.)
>
> The heuristic for parsing a quoted-string then becomes:
>
> 1) If the quoted-string first two characters are "\U"
> -> UTF-8
>
> 2)  If the quoted-string contains "\u%%%%" escape anywhere
> -> UTF-8
>
> 3)  If the quoted-string contains only %09-%7E
> -> UTF-8 (actually: ASCII)
>
> 4)  If the quoted-string contains any %7F-%8F
> -> UTF-8
>
> 5)  If header definition explicitly says ISO-8859
> -> ISO8859
>
> 6)  else
> -> UTF-8
>
> 2) Ascii strings
> ----------------
>
> I'm not sure if we need these or if they are even a good idea.
>
> The "pro" argument is if we insist they are also english text
> so we have something the entire world stands a chance to understand.
>
> The "contra" arguement is that some people will be upset about that.
>
> If we want them, they're quoted-strings from RFC723x without %7F-%FF.
>
> It is probably better the schema them from unicode strings.
>
> 3) Binary blobs
> ---------------
>
> Fitting binary blobs from crypto into RFC7230 should squeeze into
> quoted-string as well, since we cannot put any kinds of markers or
> escapes on tokens without breaking things.
>
> Proposal:
>
> Quoted-string with "\#" as first two chars indicates base64
> encoded binary blob.
>
> I chose "\#" because "#" is not in the base64 set, so if some
> nonconforming implementation eliminates the "unnecessary escape"
> it will be clearly visible (and likely recoverable) rather than
> munge up the content of the base64.
>
> Base64 is chosen because it is the densest well known encoding which
> works well with HPACK/huffman:  The b64 characters on average emit
> 6.46 bits.
>
> I have no idea how these blobs would look when parsed into JSON,
> probably as base64 ?  But in languages which can, they should
> probably become native byte-strings.
>
> 4) Token
> --------
>
> As we know it from RFC7230:
>
>   tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." /
>    "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA
>   token = 1*tchar
>
> 5) Qualified Token
> ------------------
>
>   qualified_token = token 0*1("/" token)
>
> All keys in all dictionaries are of this type.  (In JSON/python...
> the keys are strings)
>
> Schemas can restrict this further.
>
> 6 Numbers
> ---------
>
> These are signed decimal numbers which may have a fraction
>
> In HTTP1+2 we want them always on "%f" format and we want them to
> fit in IEEE754 64 bit floating point, which lead to the following
> definition:
>
> 0*1"-" DIGIT 0*nDIGIT 0*1("." 0*mDIGIT ) n+m < 15
>
> (15 digits fit in IEEE754 64 binary floating point.)
>
> These numbers can (also) be used for millisecond resolution absolute
> UNIX-epoch relative timestamps for all forseeable future.
>
> 7) Integers
> -----------
>
> 0*1"-" 1*15 DIGIT
>
> Same restriction as above to fit into IEEE 754.
>
> Range can & should be restricted by schemas as necessary.
>
> 8 Timestamps
> ------------
>
> I propose we do these as subtype of Numbers, as UNIX-epoch relative
> time.  That is somewhat human-hostile and is leap-second-challenged.
>
> If you know from the schema that a timestamp is coming, the parser
> can easily tell the difference between a RFC7231 IMF-fixdate or a
> Number-Date.
>
> Without guidance from a schema it becomes inefficient to determine
> if it is an IMF-fixdate, since the week day part looks like a token,
> but it is not impossible.
>
>
> Schemas
> =======
>
> There needs a "ABNF"-parallel to specify what is mandatory and
> allowed for these headers in "common structure".
>
> Ideally this should be in machine-readable format, so that
> validation tools and parser-code can be produced without
> (too much) human intervation.  I'm tempted to say we should
> make the schemas JSON, but then we need to write JSON schemas
> for our schemas :-/
>
> Since schemas basically restict what you are allowed to
> express, we need to examine and think about what restrictions
> we want to be able to impose, before we design the schema.
>
> This is the least thought about part of this document, since
> the train is now in Lund:
>
> Unicode strings:
> ----------------
>
> * Limit by (UTF-8) encoded length.
> Ie: a resource restriction, not a typographical restriction.
>
> * Limit by codepoints
> Example: Allow only "0-9" and "a-f"
> The specification of code-points should be list of codepoint
> ranges.  (Ascii strings could be defined this way)
>
> * Limit by allowed strings
> ie: Allow only "North", "South", "East" and "West"
>
> Tokens
> ------
>
> * Limit by codepoints
> Example: Allow only "A-Z"
>
> * Limit by length
> Example: Max 7 characters
>
> * Limit by pattern
> Example: "A-Z" "a-z" "-" "0-9" "0-9"
> (use ABNF to specify ?)
>
> * Limit by well known set
> Example: Token must be ISO3166-1 country code
> Example: Token must be in IANA FooBar registry
>
> Qualified Tokens
> ----------------
>
> * Limit each of the two component tokens as above.
>
> Binary Blob
> -----------
>
> * Limit by length in bytes
> Example: 128 bytes
> Example: 16-64 or 80 bytes
>
> Number
> ------
>
> * Limit resolution
> Example: exactly 3 decimal digits
>
> * Limit range
> Example: [2.716 ... 3.1415]
>
> Integer
> -------
>
> * Limit range
> Example [0 ... 65535]
>
> Timestamp
> ---------
>
> (I cant thing of usable restrictions here)
>
>
> Aaand... I'm in Copenhagen...
>
> Let me know if any of this looks usable...
>
> --
> Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
> [hidden email]         | TCP/IP since RFC 956
> FreeBSD committer       | BSD since 4.3-tahoe
> Never attribute to malice what can adequately be explained by incompetence.
>

--
Mark Nottingham   https://www.mnot.net/





Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Willy Tarreau-3
Hi Mark,

On Tue, Aug 02, 2016 at 01:33:39PM +0200, Mark Nottingham wrote:
> 1) Using the first character of the field-value as a signal that the encoding
> is in use is interesting. I was thinking of indicating it with a suffix on
> the header field name (e.g., Date-J). Either is viable, but I don't think
> it's a good idea to reuse existing header field names and rely on that signal
> to differentiate the value type; that seems like it would cause a lot of
> interop problems to me. Defining a new header field (whether it's Date-J or
> Date2 or whatever) seems much safer to me.

I had the same feeling initially but I retracted. I fear that having two
header fields will result in inconsistencies between the two (possibly
intentional when that may be used to benefit an attacker). We'd rather
avoid reproducing the Proxy-Connection vs Connection mess we've been seeing
for a decade, where both were sent "just in case".

However if we enumerate certain header fields that would deserve being
encoded differently and find a way to group them, we may think about
sending a composite, compact header field for transport/routing, another
one for the entity where available information are grouped when relevant.
Then maybe it could be decided that when one agent consumes such a field,
before passing the message it must delete occurences of the other ones,
and/or rebuild them from the composite one, in order to avoid inconsistency
issues.

We have more or less this regarding Transfer-Encoding which voids
Content-Length, and the Host header field which must always match the
authority part of the URI if present.

These are just thoughts, maybe they are stupid.

Willy

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Poul-Henning Kamp
In reply to this post by Mark Nottingham-2
--------
In message <[hidden email]>, Mark Nottingham wri
tes:

>A few thoughts come to mind:
>
>1) Using the first character of the field-value as a signal that the
>encoding is in use is interesting. I was thinking of indicating it with
>a suffix on the header field name (e.g., Date-J).

Yeah, that could work too, but I suspect it would be more cumbersome
to implement, and it creates a new class of mistakes which need to
be detected  - "Both Date and Date-J ??"
 
>2) Regardless of #1, using < as your indicator character is going to
>collide with the existing syntax of the Link header.

If Link is "<> blacklisted" in the IANA registry, that wouldn't be a
problem, and all currently defined headers will need to be checked
against some kind of white/black list, if we want them to use the
new "common structure".

I picked <> because they were a cheap balanced pair in HPACK/huffman
and I only found Link that might cause a false positive.

Strictly speaking, it doesn't have to be a balanced pair, it could
even be control-characters but HPACK/huffman punish those hard.

I didn't dare pick () even though it had even shorter HPACK/huffman.

Thinking about it now, I can't recall any headers starting with a '('
so () might be better than <> and thus avoid the special case of Link.

>3) I really, really wonder whether we need recursion beyond one level;

As do I.

However, if it is recursion, the implementation cost is very low,
and I would prefer to "deliver tools, not policy" and let people
recurse until they hurt if they want.

I particular do not want to impose complexity limits on private
headers, based on the simplicity of public headers, because my
experience is that private headers are more complex.

I would prefer a simple, general model, restricted by machine
readable schemas, rather than a complex model with built in
limitations.

>4) I agree with the sentiment that non-ascii strings in header field
>values are comparatively rare (since most headers are not intended for
>display), so while we should accommodate them, they shouldn't be the
>default.

That was the idea behind: \U  Make people explicitly tag UTF8

>5) I like the idea of 'implicit angle brackets' to retrofit some
>existing headers. Depending on the parse algorithm we define, we could
>potentially fit a fair number of existing headers into this, although
>deriving the specific data types of things like parameter arguments is
>going to be difficult (or maybe impossible). Needs some investigation
>before we know whether this would be viable.

Schemas!  Have I mentioned already how smart I think schemas usable
to build code with would be ?  :-)

PS: I had expected you to ask if was trying to sabotage your Key header :-)

--
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
[hidden email]         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Poul-Henning Kamp
In reply to this post by Willy Tarreau-3
--------
In message <[hidden email]>, Willy Tarreau writes:

>However if we enumerate certain header fields that would deserve being
>encoded differently and find a way to group them, we may think about
>sending a composite, compact header field for transport/routing, another
>one for the entity where available information are grouped when relevant.

>These are just thoughts, maybe they are stupid.

Nope, that's actually a very interesting idea.

--
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
[hidden email]         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Stefan Eissing
Please never, ever introduce "Date" and "Date-J" or something like that.

> Am 02.08.2016 um 14:05 schrieb Poul-Henning Kamp <[hidden email]>:
>
> --------
> In message <[hidden email]>, Willy Tarreau writes:
>
>> However if we enumerate certain header fields that would deserve being
>> encoded differently and find a way to group them, we may think about
>> sending a composite, compact header field for transport/routing, another
>> one for the entity where available information are grouped when relevant.
>
>> These are just thoughts, maybe they are stupid.
>
> Nope, that's actually a very interesting idea.
>
> --
> Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
> [hidden email]         | TCP/IP since RFC 956
> FreeBSD committer       | BSD since 4.3-tahoe    
> Never attribute to malice what can adequately be explained by incompetence.
>


Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Poul-Henning Kamp
--------
In message <[hidden email]>, Stefan Eissing
 writes:

>Please never, ever introduce "Date" and "Date-J" or something like that.

That was exactly _not_ what Willy suggested :-)

--
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
[hidden email]         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Stefan Eissing
Just stamping it as a thought-crime in advance...

> Am 02.08.2016 um 14:38 schrieb Poul-Henning Kamp <[hidden email]>:
>
> --------
> In message <[hidden email]>, Stefan Eissing
> writes:
>
>> Please never, ever introduce "Date" and "Date-J" or something like that.
>
> That was exactly _not_ what Willy suggested :-)
>
> --
> Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
> [hidden email]         | TCP/IP since RFC 956
> FreeBSD committer       | BSD since 4.3-tahoe    
> Never attribute to malice what can adequately be explained by incompetence.


Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Nicolas Mailhot
In reply to this post by Poul-Henning Kamp


----- Mail original -----
De: "Poul-Henning Kamp"

>>If you're ready to invent binary representations it's way
>> imple to specify utf8 as encoding than fall again on multiple
>> encoding trap which instead of helping anyone means everyone is
>> incompatible with everyone else in subtle way

>Please elaborate, I have no idea what your are talking about here.

I thouroughly mislike all the encoding dance and escaping one time it's UTF-8 another not.

It's too complex, people won't read it, just do their usual mess and assume it will work (just like they post garbage nowadays and add ISO-8859-1 boilerplate without checking anything. Encoding hints do not help, instead of wrong encoding failure you add combos of bad encoding hint + whatever failures)

Please use a simpler rule like
 1. Everything not explicitely encoded is UTF-8 with no escaping allowed (or escaping restricted to a shortlist, not generic escaping that people will apply to all codepoints several times over just in case, no BOM, no pseudo-BOM, if they write nonsensical UTF-8 it's *their* problem not the transport problem, as long as they respect basic UTF-8 rules)
 2. an UTF-8 header that does not validate in whatever version of the next hop unicode engine aborts the frame (if you don't want to learn UTF-8 restrict yourself to basic latin block in your headers, it's valid UTF-8 and no more complex than ASCII)
 3. anything that does not fit there must use binary with explicit binary flag

That is simple enough people will remember it, they will curse you but apply it, instead of fell-good rules that everyone follows approximatively with an explosion of special cases to handle all the approximation results

Regards,

--
Nicolas Mailhot

123