base64Binary lexical/octet length

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

base64Binary lexical/octet length

xmlplus custodians
Hi

The XSD1.1 DataTypes spec in the base64Binary section gives following pseudo-code for calculating octet length of a base64Binary encoded string.

---------------------------------------------------------------------------------
1) lex2   := killwhitespace(lexform)    -- remove whitespace characters
2) lex3   := strip_equals(lex2)         -- strip padding characters at end
3) length := floor (length(lex3) * 3 / 4)         -- calculate length

---------------------------------------------------------------------------------


My understanding is that, for a base64Binary encoded string, it's lexical length would be a multiple of 4 and it's octet length would be a multiple of 3. 

As an example if we take a base64Binary encoded string, which doesn't contain whitespaces or padding chars(=), so that lexform is same as lex3 in above code. Now let us take a lex3 of length 10 then, according to above code, the octet length would be 7(not a multiple of 4).
Are octet-lengths which are not multiple of 4, valid in case of base64Binary encoded string ?

Also, what should be the formulae for calculating lexical-length from the octet-length of a base64Binary string ?
Should it be something like this:

lexical-length := ceil( octet-length*4/3)

If we take an example with octet-length=10, the lexical-length is not a multiple of 4.
I am clueless here. Appreciate your help on the same. 

--
Best Regards,
Satya Prakash Tripathi


Reply | Threaded
Open this post in threaded view
|

Re: base64Binary lexical/octet length

C. M. Sperberg-McQueen-2

On Apr 9, 2011, at 2:49 PM, xmlplus custodians wrote:

> Hi
>
> The XSD1.1 DataTypes spec in the base64Binary section gives following pseudo-code for calculating octet length of a base64Binary encoded string.
>
> ---------------------------------------------------------------------------------
> 1) lex2   := killwhitespace(lexform)    -- remove whitespace characters
> 2) lex3   := strip_equals(lex2)         -- strip padding characters at end
> 3) length := floor (length(lex3) * 3 / 4)         -- calculate length
> ---------------------------------------------------------------------------------
>
>
> My understanding is that, for a base64Binary encoded string, it's lexical length would be a multiple of 4 and it's octet length would be a multiple of 3.

It's been a while since I read the base64 spec, but my recollection is that base64 encodes
octet sequences of any length, not just octet sequences whose length is a multiple of three.

The lexical length (ignoring whitespace) will indeed always be a multiple of four; the
padding characters are added at the end in order to ensure that this is so.  

>
> As an example if we take a base64Binary encoded string, which doesn't contain whitespaces or padding
> chars(=), so that lexform is same as lex3 in above code. Now let us take a lex3 of length 10 then,
> according to above code, the octet length would be 7(not a multiple of 4).

Yes, precisely.  If the lexical form, ignoring whitespace, is twelve characters long
and the last two characters are equals signs, then what you have is two
clusters of four characters, each of which encodes three octets, followed
by a final cluster of two non-padding characters, which encodes the final
octet.

> Are octet-lengths which are not multiple of 4, valid in case of base64Binary encoded string ?

Yes.

> Also, what should be the formulae for calculating lexical-length from the octet-length of a base64Binary string ?
> Should it be something like this:
>
> lexical-length := ceil( octet-length*4/3)
>
> If we take an example with octet-length=10, the lexical-length is not a multiple of 4.
> I am clueless here. Appreciate your help on the same.

In base64 encoding, any input octet stream is subdivided into 24-bit
(i.e. three-octet) groups, each of which is encoded in four base64
digitis.  If there are fewer than 24 bits in the final group of bits, then
padding characters are used.  So if you wish to calculate the minimum
length of the base64 encoding for an arbitrary sequence of octets (i.e.
the length of an encoding without any white space), then I think the
formula you want will be 4 * ceil( octet-length / 3).  It is a good idea,
though, to follow the recommendations in the RFC for adding
whitespace and newlines; it makes debugging problems easier, if
nothing else.

You may find it helpful to read RFC 3548, which is normatively referred
to from the XSD spec.

http://www.ietf.org/rfc/rfc3548.txt

I hope this helps.

 
--
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com 
* http://cmsmcq.com/mib                 
* http://balisage.net
****************************************************************





Reply | Threaded
Open this post in threaded view
|

Re: base64Binary lexical/octet length

Michael Kay

> You may find it helpful to read RFC 3548, which is normatively referred
> to from the XSD spec.
>
> http://www.ietf.org/rfc/rfc3548.txt
>

It may also be worth noting that XSD requires strict conformance to the
RFC, whereas most base64 implementations available "in the wild" are
liberal in what they accept, for example in areas such as the exact
number of trailing "=" signs.

Michael Kay
Saxonica

Reply | Threaded
Open this post in threaded view
|

Re: base64Binary lexical/octet length

xmlplus custodians
In reply to this post by C. M. Sperberg-McQueen-2
Hi Sperberg-McQueen,

The concept of 3 octets accommodating 4 base64 chars, is what gave me a wrong idea that octet-lengths have to be a multiple of 3. I guess, I had not accounted for use of padding chars while encoding to base64.  With the use of 1 or 2 padding chars, the base64 encoded string, when stripped-off of whitespaces, would always be of a length multiple of 4. It's clear now!   

Thanks for the detailed and patient reply!   It was both, very insightful and helpful.

-- 
Best Regards,
Satya Prakash Tripathi


On Mon, Apr 11, 2011 at 9:50 PM, C. M. Sperberg-McQueen <[hidden email]> wrote:

On Apr 9, 2011, at 2:49 PM, xmlplus custodians wrote:

> Hi
>
> The XSD1.1 DataTypes spec in the base64Binary section gives following pseudo-code for calculating octet length of a base64Binary encoded string.
>
> ---------------------------------------------------------------------------------
> 1) lex2   := killwhitespace(lexform)    -- remove whitespace characters
> 2) lex3   := strip_equals(lex2)         -- strip padding characters at end
> 3) length := floor (length(lex3) * 3 / 4)         -- calculate length
> ---------------------------------------------------------------------------------
>
>
> My understanding is that, for a base64Binary encoded string, it's lexical length would be a multiple of 4 and it's octet length would be a multiple of 3.

It's been a while since I read the base64 spec, but my recollection is that base64 encodes
octet sequences of any length, not just octet sequences whose length is a multiple of three.

The lexical length (ignoring whitespace) will indeed always be a multiple of four; the
padding characters are added at the end in order to ensure that this is so.

>
> As an example if we take a base64Binary encoded string, which doesn't contain whitespaces or padding
> chars(=), so that lexform is same as lex3 in above code. Now let us take a lex3 of length 10 then,
> according to above code, the octet length would be 7(not a multiple of 4).

Yes, precisely.  If the lexical form, ignoring whitespace, is twelve characters long
and the last two characters are equals signs, then what you have is two
clusters of four characters, each of which encodes three octets, followed
by a final cluster of two non-padding characters, which encodes the final
octet.

> Are octet-lengths which are not multiple of 4, valid in case of base64Binary encoded string ?

Yes.

> Also, what should be the formulae for calculating lexical-length from the octet-length of a base64Binary string ?
> Should it be something like this:
>
> lexical-length := ceil( octet-length*4/3)
>
> If we take an example with octet-length=10, the lexical-length is not a multiple of 4.
> I am clueless here. Appreciate your help on the same.

In base64 encoding, any input octet stream is subdivided into 24-bit
(i.e. three-octet) groups, each of which is encoded in four base64
digitis.  If there are fewer than 24 bits in the final group of bits, then
padding characters are used.  So if you wish to calculate the minimum
length of the base64 encoding for an arbitrary sequence of octets (i.e.
the length of an encoding without any white space), then I think the
formula you want will be 4 * ceil( octet-length / 3).  It is a good idea,
though, to follow the recommendations in the RFC for adding
whitespace and newlines; it makes debugging problems easier, if
nothing else.

You may find it helpful to read RFC 3548, which is normatively referred
to from the XSD spec.

http://www.ietf.org/rfc/rfc3548.txt

I hope this helps.


--
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com
* http://cmsmcq.com/mib
* http://balisage.net
****************************************************************









Reply | Threaded
Open this post in threaded view
|

Re: base64Binary lexical/octet length

xmlplus custodians
In reply to this post by Michael Kay
Mike,

I agree. It is evident that only 1 or 2 trailing "=" chars should be allowed.
I am assuming that, this fact can be deduced too, for those who prefer intuitive ideas. In case we end up with fewer than 24 bits (of octets) in the last group, ie. either 8 bits or 16 bits, then they would encode to 2 and 3 base64 chars respectively. Thus the need of exactly 2 and 1 padding(=) chars in those cases to make up for the 4-char base64 groups.

-- 
Best Regards,
Satya Prakash Tripathi

On Mon, Apr 11, 2011 at 11:03 PM, Michael Kay <[hidden email]> wrote:

You may find it helpful to read RFC 3548, which is normatively referred
to from the XSD spec.

http://www.ietf.org/rfc/rfc3548.txt


It may also be worth noting that XSD requires strict conformance to the RFC, whereas most base64 implementations available "in the wild" are liberal in what they accept, for example in areas such as the exact number of trailing "=" signs.

Michael Kay
Saxonica