If not JSON, what then ?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
50 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Kari Hurtta
Willy Tarreau <[hidden email]>: (Wed Aug  3 09:46:33 2016)

> On Wed, Aug 03, 2016 at 09:37:30AM +0300, Kari hurtta wrote:
> > | 2) Regardless of #1, using < as your indicator character is going to collide with the existing syntax of the
> > | Link header.
> >
> > Or perhaps use ':' as indicator? Causes double '::' on HTTP/1
> >
> > Date::1470205476
> >
> > Is this viable ?
>
> It could but strictly speaking it will not be "::", it would just be ":"
> to start the value, because your field above parses as ":1470205476" for
> the value and will be rewritten like this along the path by many
> implementations :
>
>     Date: :1470205476


Yes, that is true.  

Another possibility is use indicator from ascii control
block. For example byte 01. Does not collide with existing use.

That is
Date:1470205476
on HTTP/1 (but probaly not visible; first character
after ':' is byte 01 (ctrl-A)).

( I think that someone suggested that already. )

( byte 01 seems to be valid TEXT on ABNF ? )


> Willy

/ Kari Hurtta

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Willy Tarreau-3
On Wed, Aug 03, 2016 at 12:37:51PM +0300, Kari hurtta wrote:

> Willy Tarreau <[hidden email]>: (Wed Aug  3 09:46:33 2016)
> > On Wed, Aug 03, 2016 at 09:37:30AM +0300, Kari hurtta wrote:
> > > | 2) Regardless of #1, using < as your indicator character is going to collide with the existing syntax of the
> > > | Link header.
> > >
> > > Or perhaps use ':' as indicator? Causes double '::' on HTTP/1
> > >
> > > Date::1470205476
> > >
> > > Is this viable ?
> >
> > It could but strictly speaking it will not be "::", it would just be ":"
> > to start the value, because your field above parses as ":1470205476" for
> > the value and will be rewritten like this along the path by many
> > implementations :
> >
> >     Date: :1470205476
>
>
> Yes, that is true.  
>
> Another possibility is use indicator from ascii control
> block. For example byte 01. Does not collide with existing use.
>
> That is
> Date:1470205476
> on HTTP/1 (but probaly not visible; first character
> after ':' is byte 01 (ctrl-A)).

It doesn't change anything. I have no problem with either proposals,
it's just that what you need to understand is that there is no such
"after ':'". The ":" is not part of the header field, it's the field
name delimitor. So before ':' you have the header field name. After
it you can have any amount of spaces (including zero) which are *not*
part of the value, and the first non-space character starts the value.

Thus, the following are exactly equivalent, though the last one is
deprecated :

    Date:1470205476
    Date: 1470205476
    Date:      1470205476
    Date:
      1470205476

> ( I think that someone suggested that already. )
>
> ( byte 01 seems to be valid TEXT on ABNF ? )

Such bytes are rare and will have a large huffman encoding in H2. Martin's
suggestion of '><' could be more efficient, though I haven't checked.

Regards,
Willy

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Martin J. Dürst
In reply to this post by Poul-Henning Kamp
On 2016/08/03 05:24, Poul-Henning Kamp wrote:

> --------
> In message <[hidden email]>, Mark Nottingham wri
> tes:
>
>> If containers are only allowed to contain simple types, the need for a
>> schema language diminishes quite a bit; headers can be defined pretty
>> easily in prose, perhaps with references to registries where
>> appropriate.
>
> It is not significantly harder to specify recursive structures than
> flat structures, but of course the work to do so will make many
> people want not to.

Also, I'd be afraid of the first time there is a real use case that is
recursive or has more levels than planned for; the separate syntax and
implementation will be ugly, or people might just go recursive the
obvious way but implementations will vary on how they take it.

Regards,   Martin.

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Martin Thomson-3
In reply to this post by Willy Tarreau-3
On 3 August 2016 at 19:45, Willy Tarreau <[hidden email]> wrote:
> Such bytes are rare and will have a large huffman encoding in H2. Martin's
> suggestion of '><' could be more efficient, though I haven't checked.

I tend to think that we should not let hpack drive this.  We should
maybe avoid Huffman encoding and then throw octets into the value
field.  Starting with a > or : or other octet is still probably useful
and might even be necessary.  In HTTP/1.1 we can use base64(url) as a
reasonable space/speed trade-off, again with the same demarc octet.
But we'd be defining a binary encoding.

Binary avoids the nastiness with character encoding (just use UTF-8),
makes numbers and dates much more numbery, and lets us tailor the
other types to our needs.

I think that PHK is perfectly right in recognizing that we don't have
complex needs.  I actually think that this is good.  Limitations are
empowering.

I said this privately to someone at the workshop, but my realization
was that we currently have schema-aware parsing with extremely limited
points of extension.  A revised system that supports that doesn't need
to be very complex.  Even a single level map of string key to
(optional) string value is more extensibility than we can sensibly
defend.

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Poul-Henning Kamp
--------
In message <CABkgnnWqHTinXDNXxM7Lw9SBGCCPb-j6BgKF=[hidden email]>, Martin Thomson writes:
>On 3 August 2016 at 19:45, Willy Tarreau <[hidden email]> wrote:

>> Such bytes are rare and will have a large huffman encoding in H2. Martin's
>> suggestion of '><' could be more efficient, though I haven't checked.
>
>I tend to think that we should not let hpack drive this.

Agreed to that, but if we can easily be huffman-friendly, we should.

>But we'd be defining a binary encoding.
>
>Binary avoids the nastiness with character encoding (just use UTF-8),
>makes numbers and dates much more numbery, and lets us tailor the
>other types to our needs.

What I'm trying to do here and now, is a data model and HTTP/1
serialization which by design overlaps as many existing headers
defined syntax as possible, to minimize the number of parsers
required now and in the future.

By leeching on existing H1 syntax we avoid the b64 hack and we don't
need to touch HPACK before this can go live.

For H3+'s HPACK-ism, we may decide to serialize these structured
headers as "native binary" rather than "text-compression", but until
we know more about H3, that is a premature decision.

The downside of this approach is that the RFC723x's "quoted string"
also carries "utf8-string" and "binary-blob" in this new syntax,
signposted by two new, backwards compatible escape sequences.

>I think that PHK is perfectly right in recognizing that we don't have
>complex needs.  I actually think that this is good.  Limitations are
>empowering.

Indeed.

>I said this privately to someone at the workshop, but my realization
>was that we currently have schema-aware parsing with extremely limited
>points of extension.  A revised system that supports that doesn't need
>to be very complex.  Even a single level map of string key to
>(optional) string value is more extensibility than we can sensibly
>defend.

As with HPACK/huffman, I don't see the need to heurstically restict
the data model, I would rather precisely restrict its use for
individual headers.

--
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
[hidden email]         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Mark Nottingham-2

> On 3 Aug 2016, at 3:07 PM, Poul-Henning Kamp <[hidden email]> wrote:
>
>> I said this privately to someone at the workshop, but my realization
>> was that we currently have schema-aware parsing with extremely limited
>> points of extension.  A revised system that supports that doesn't need
>> to be very complex.  Even a single level map of string key to
>> (optional) string value is more extensibility than we can sensibly
>> defend.
>
> As with HPACK/huffman, I don't see the need to heurstically restict
> the data model, I would rather precisely restrict its use for
> individual headers.

Keep in mind that we're not going to require new headers to use this thing; if their needs aren't met, they can do something else.

Looking at recently minted headers, I'm still sharing Martin's inclination to keep it simple and flat.


--
Mark Nottingham   https://www.mnot.net/





Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Poul-Henning Kamp
--------
In message <[hidden email]>, Mark Nottingham writes:

>Keep in mind that we're not going to require new headers to use this
>thing; if their needs aren't met, they can do something else.
>
>Looking at recently minted headers, I'm still sharing Martin's
>inclination to keep it simple and flat.

I fully agree, big fan of KISS here.

I just prefer the limitations be located in the per-header schema,
rather than in the data format itself.

Either way, as long as we paint ourselves into a corner which makes
it impossible to have more levels later, I'm fine with only
"releasing" one level initially.

--
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
[hidden email]         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Martin Thomson-3
In reply to this post by Poul-Henning Kamp
On 3 August 2016 at 15:07, Poul-Henning Kamp <[hidden email]> wrote:
>
> What I'm trying to do here and now, is a data model and HTTP/1
> serialization which by design overlaps as many existing headers
> defined syntax as possible, to minimize the number of parsers
> required now and in the future.

I'm skeptical that you will be able to do that without sacrificing
something.  And if the point of the exercise is to define the one true
format (or three or some small number) that is used hereafter, then I
don't see much inherent value in minimizing the distance between the
old thing and the new thing.  I'd rather sacrifice similarity than
lose (for example) decoding efficiency, or UTF-8, or any of the many
things that have been dreamed up.

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Poul-Henning Kamp
--------
In message <[hidden email]>, Martin Thomson writes:
>On 3 August 2016 at 15:07, Poul-Henning Kamp <[hidden email]> wrote:
>>
>> What I'm trying to do here and now, is a data model and HTTP/1
>> serialization which by design overlaps as many existing headers
>> defined syntax as possible, to minimize the number of parsers
>> required now and in the future.
>
>I'm skeptical that you will be able to do that without sacrificing
>something.

There are always tradeoffs.

For instance the HTTP/1 serialization should not be a security risk
if it pops up in a traditional application which does naïve
stringprocessing on HTTP headers.

It would also be nice if HTTP/1 can still be debugged by eye, without
needing b64 decoding, but that is much less important than not
blowing up all PHP or JS programs.

>And if the point of the exercise is to define the one true
>format (or three or some small number) that is used hereafter,

It doesn't have to be perfect, I'll be happy if we can come up with
a common structure which is so usable, that the exceptions become
rare and well considered.

>I don't see much inherent value in minimizing the distance between the
>old thing and the new thing.

Like the HPACK/huffman thing:  I fully agree.  But if the minimal
distance does not hurt us going forward, we would be silly to
increase it needlessly.

And please remember:  I deliberately approached this from the far
opposite side relative to adopting JSON, in order to another data
point on the size of the solution space.

I liked the outcome, it was surprisingly much better than I had
expected, but I would be very surprised if there isn't a better
solution once we ponder it more.

--
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
[hidden email]         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.

Reply | Threaded
Open this post in threaded view
|

Re: If not JSON, what then ?

Sam Johnston-4
In reply to this post by Poul-Henning Kamp
Shame to have missed this discussion as it starts to look like one we had a few years ago around using headers directly rather than trying to embed another envelope format in them (ala SOAP in the body):


I actually wrote drafts for "Category" and "Attribute" headers at the time; the latter is yet to see the light of day but the former is here: https://tools.ietf.org/html/draft-johnston-http-category-header-02

My view is that you should be able to have e.g. a photo with headers containing attributes like title & summary, categories like landscape, and links to e.g. author (which Mark has already standardised in RFC5988 Web Linking).

We got caught up with things like unicode, client library support, etc. at the time, but I expect some of these things have been resolved in the interim.

Sam


On Mon, Aug 1, 2016 at 8:43 AM, Poul-Henning Kamp <[hidden email]> wrote:
Based on discussions in email and at the workshop in Stockholm,
JSON doesn't seem like a good fit for HTTP headers.

A number of inputs came up in Stockholm which informs the process,
Marks earlier attempt to classify header syntax into groups and the
desire the for a efficient binary encoding in HTTP[3-6] (or HPACK++)

My personal intuition was that we should find a binary serialization
(like CORS), and base64 it into HTTP1-2:  Ie: design for the future
and shoe-horn into the present.  But no obvious binary serialization
seems to exist, CORS was panned by a number of people in the WS as
too complicated, and gag-reflexes were triggered by ASN.1.

Inspired by Marks HTTP-header classification, I spent the train-trip
back home to Denmark pondering the opposite attack:  Is there a
common data structure which (many) existing headers would fit into,
which could serve our needs going forward?

This document chronicles my deliberations, and the strawman I came
up with:  Not only does it seem possible, it has some very interesting
possibilities down the road.

Disclaimer:  ABNF may not be perfect.

Structure of headers
====================

I surveyed current headers, and a very large fraction of them
fit into this data structure:

        header: ordered sequence of named dictionaries

The "ordered" constraint arises in two ways:  We have explicitly
ordered headers like {Content|Transfer}-Encoding and we have headers
which have order by their q=%f parameters.

If we unserialize this model from RFC723x definitions, then ',' is
the list separator and ';' the dictionary indicator and separator:

     Accept: audio/*; q=0.2, audio/basic

The "ordered ... named" combination does not map directly to most
contemporary object models (JSON, python, ...) where dictionary
order is undefined, so a definition list is required to represent
this in JSON:

        [
            [ "audio/*", { "q": 0.2 }],
            [ "audio/basic", {}]
        ]

It looks tempting to find a way to make the toplevel JSON a dictionary
too, but given the use of wildcards in many of the keys ("text/*"),
and the q=%f ordering, that would not be helpful.

Next we want to give people the ability to have deeper structure,
and we can either do that recursively (ie: nested ordered seq of
dict) or restrict the deeper levels to only dict.

That is probably a matter of taste more than anything, but the
recursive design will probably appeal aesthetically to more than
just me, and as we shall see shortly, it comes with certain economies.

So let us use '<...>' to mark the recursion, since <> are shorter than
[] and {} in HPACK/huffman.

Here is a two level example:

        foobar: foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar

Parsed into JSON that would be:

        [
            [
                "foo",
                {
                    "p1": 1,
                    "p4": {},
                    "p3": [
                        [
                            "x1",
                            {}
                        ],
                        [
                            "x2",
                            {}
                        ],
                        [
                            "x3",
                            {
                                "y2": 2
                                "y1": 1,
                            }
                        ]
                    ],
                    "p2": "abc"
                }
            ],
            [
                "bar",
                {}
            ]
        ]

(NB shuffled dictionary elements to show that JSON dicts are unordered)

And now comes the recursion economy:

First we wrap the entire *new* header in <...>:

        foobar: <foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar>

This way, the first character of the header tells us that this header
has "common structure".

That explicit "common structure" signal means privately defined
headers can use "common structure" as well, and middleware and
frameworks will automatically Do The Right Thing with them.

Next, we add a field to the IANA HTTP header registry (one can do
that I hope ?) classifying their "angle-bracket status":

 A) not angle-brackets -- incompatible structure use topical parser
        Range

 B) implicit angle-brackets -- Has common structure but is not <> enclosed
        Accept
        Content-Encoding
        Transfer-Encoding

 C) explicit angle-brackets -- Has common structure and <> encloosed
        all new headers go here

 D) unknown status.
        As it says on the tin.

Using this as whitelist, and given suitable schemas, a good number
of existing headers can go into the common parser.

And then for the final trick:   We can now define new variants of
existing headers which "sidegrade" them into the common parser:

        Date: < 1469734833 >

This obviously needs a signal/negotiation so we know the other side
can grok them (HTTP2: SETTINGS, HTTP1: TE?)

Next:

Data Types
==========

I think we need these fundamental data types, and subtypes:

1)   Unicode strings

2)      ascii-string (maybe)

3)      binary blob

4)   Token

5)   Qualified-token

6)   Number

7)      integer

8)   Timestamp

In addition to these subtypes, schemas can constrain types
further, for instance integer ranges, string lengths etc.
more on this below.

I will talk about each type in turn, but it goes without saying
that we need to fit them all into RFC723x, in a way that is not
going to break anything important and HPACK should not hate
them either.

In HTTP3+, they should be serialized intelligently, but that
should be trivial and I will not cover that here.

1) Unicode string
-----------------

The first question is do we mean "unrestricted unicode" or do
we want to try to sanitize it.

An example of sanitation is RFC7230's "quoted-string" which bans
control characters except forward horizontal white-space (=TAB).

Another is I-JSON (RFC7493)'s:

   MUST NOT include code points that identify Surrogates or
   Noncharacters as defined by UNICODE.

As far as I can tell, that means that you have to keep a full UNICODE
table handy at all times, and update it whenever additions are made
to unicode.  Not cool IMO.

Imposing a RFC7230 like restriction on unicode gets totally
roccoco:  What does "forward horizontal white-space" mean on
a line which used both left-to-right and right-to-left alphabets ?
What does it mean in alphabets which write vertically ?

Let us absolve the parser from such intimate unicode scholarship
and simply say that the data type "unicode string" is what it says,
and use the schemas to sanitize its individual use.

Encoding unicode strings in HTTP1+2 requires new syntax and
for any number of reasons, I would like to minimize that
and {re-|ab-}use quoted-strings.

RFC7230 does not specify what %80-%FF means in quoted-string, but
hints that it might be ISO8859.

Now we want it to become UTF-8.

My proposal at the workshop, to make the first three characters
inside the quotes a UTF-8 BOM is quite pessimal in HPACK's huffman
encoding:  It takes 68 bits.

Encoding the BOM as '\ufeff' helps but still takes an unreasonable
48 bits in HPACK/huffman.

In both H1 and H2 defining a new "\U" escape seems better.

Since we want to carry unrestricted unicode, we also need escapes
to put the <%20 codepoints back in.  I suggest "\u%%%%" like JSON.

(We should not restict which codepoints may/should use \u%%%% until
we have studied if \u%%%% may HPACK/huffman better than "raw" UTF-8
in asian codepages.)

The heuristic for parsing a quoted-string then becomes:

        1) If the quoted-string first two characters are "\U"
                -> UTF-8

        2)  If the quoted-string contains "\u%%%%" escape anywhere
                -> UTF-8

        3)  If the quoted-string contains only %09-%7E
                -> UTF-8 (actually: ASCII)

        4)  If the quoted-string contains any %7F-%8F
                -> UTF-8

        5)  If header definition explicitly says ISO-8859
                -> ISO8859

        6)  else
                -> UTF-8

2) Ascii strings
----------------

I'm not sure if we need these or if they are even a good idea.

The "pro" argument is if we insist they are also english text
so we have something the entire world stands a chance to understand.

The "contra" arguement is that some people will be upset about that.

If we want them, they're quoted-strings from RFC723x without %7F-%FF.

It is probably better the schema them from unicode strings.

3) Binary blobs
---------------

Fitting binary blobs from crypto into RFC7230 should squeeze into
quoted-string as well, since we cannot put any kinds of markers or
escapes on tokens without breaking things.

Proposal:

        Quoted-string with "\#" as first two chars indicates base64
        encoded binary blob.

I chose "\#" because "#" is not in the base64 set, so if some
nonconforming implementation eliminates the "unnecessary escape"
it will be clearly visible (and likely recoverable) rather than
munge up the content of the base64.

Base64 is chosen because it is the densest well known encoding which
works well with HPACK/huffman:  The b64 characters on average emit
6.46 bits.

I have no idea how these blobs would look when parsed into JSON,
probably as base64 ?  But in languages which can, they should
probably become native byte-strings.

4) Token
--------

As we know it from RFC7230:

   tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." /
    "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA
   token = 1*tchar

5) Qualified Token
------------------

   qualified_token = token 0*1("/" token)

All keys in all dictionaries are of this type.  (In JSON/python...
the keys are strings)

Schemas can restrict this further.

6 Numbers
---------

These are signed decimal numbers which may have a fraction

In HTTP1+2 we want them always on "%f" format and we want them to
fit in IEEE754 64 bit floating point, which lead to the following
definition:

        0*1"-" DIGIT 0*nDIGIT 0*1("." 0*mDIGIT )        n+m < 15

(15 digits fit in IEEE754 64 binary floating point.)

These numbers can (also) be used for millisecond resolution absolute
UNIX-epoch relative timestamps for all forseeable future.

7) Integers
-----------

        0*1"-" 1*15 DIGIT

Same restriction as above to fit into IEEE 754.

Range can & should be restricted by schemas as necessary.

8 Timestamps
------------

I propose we do these as subtype of Numbers, as UNIX-epoch relative
time.  That is somewhat human-hostile and is leap-second-challenged.

If you know from the schema that a timestamp is coming, the parser
can easily tell the difference between a RFC7231 IMF-fixdate or a
Number-Date.

Without guidance from a schema it becomes inefficient to determine
if it is an IMF-fixdate, since the week day part looks like a token,
but it is not impossible.


Schemas
=======

There needs a "ABNF"-parallel to specify what is mandatory and
allowed for these headers in "common structure".

Ideally this should be in machine-readable format, so that
validation tools and parser-code can be produced without
(too much) human intervation.  I'm tempted to say we should
make the schemas JSON, but then we need to write JSON schemas
for our schemas :-/

Since schemas basically restict what you are allowed to
express, we need to examine and think about what restrictions
we want to be able to impose, before we design the schema.

This is the least thought about part of this document, since
the train is now in Lund:

Unicode strings:
----------------

* Limit by (UTF-8) encoded length.
        Ie: a resource restriction, not a typographical restriction.

* Limit by codepoints
        Example: Allow only "0-9" and "a-f"
        The specification of code-points should be list of codepoint
        ranges.  (Ascii strings could be defined this way)

* Limit by allowed strings
        ie: Allow only "North", "South", "East" and "West"

Tokens
------

* Limit by codepoints
        Example: Allow only "A-Z"

* Limit by length
        Example: Max 7 characters

* Limit by pattern
        Example: "A-Z" "a-z" "-" "0-9" "0-9"
        (use ABNF to specify ?)

* Limit by well known set
        Example: Token must be ISO3166-1 country code
        Example: Token must be in IANA FooBar registry

Qualified Tokens
----------------

* Limit each of the two component tokens as above.

Binary Blob
-----------

* Limit by length in bytes
        Example: 128 bytes
        Example: 16-64 or 80 bytes

Number
------

* Limit resolution
        Example: exactly 3 decimal digits

* Limit range
        Example: [2.716 ... 3.1415]

Integer
-------

* Limit range
        Example [0 ... 65535]

Timestamp
---------

(I cant thing of usable restrictions here)


Aaand... I'm in Copenhagen...

Let me know if any of this looks usable...

--
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
[hidden email]         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.


123