HTTP Spec: PUT without data transfer, since hash of data is known to server

Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

HTTP Spec: PUT without data transfer, since hash of data is known to server

Thomas Güttler
I  have seen a lot of useless uploads when syncing a local file system with a remote WebDAV server.

I thought about this and asked on stackoverflow.

My idea is to have a PUT which uses ETAgs or a ETag like way, so that the
data-transfer can be omitted if the server already knows the hash-sum of
the data.

I got a really good answer from someone who knows the HTTP-specs much
better than I do:

http://stackoverflow.com/questions/32794863/http-spec-put-without-data-transfer-since-hash-of-data-is-known-to-server

Maybe my goal is too high .... but I don't want to implement this. I want a official spec :-)

What do you think?

Could something like this become an official recommendation?

BTW, I can't decide about the **how** to implement this. My knowledge
of the http spec is too low at the moment.

At this moment I want to ask:

 - Is it possible at all to create a spec for http put which ommits
   the data, if the server knows the hash-sum (like depulicating file systems)?

 - If yes, then what is the next step?

PS: Of course the spec should be optional. The server can support it, but don't need to.

Regards,
  Thomas Güttler

--
http://www.thomas-guettler.de/


Reply | Threaded
Open this post in threaded view
|

Re: HTTP Spec: PUT without data transfer, since hash of data is known to server

Ed McClanahan

Hmm... HTTP PATCH sounds like a problem then. Imagine that a previous PUT of some other resource included said hash. A later PATCH modifies a portion of that old resource. In order to be able to reference the new content of that old resource, a new hash for the entire resource needs to be recalculated. Not very practical for small PATCHes to large resources....

Still, it seems HTTP PATCH also provides an elegant solution.. Using PATCH, they payload could be a simple "the data for my new resource has this hash" rather than the data itself. The HTTP server could accept or reject the PATCH request based upon whether or not it has seen this hash before. If rejected, the client just does the normal PUT with unique data anyway.

Going further, some sort of rsync like HTTP PATCH payload could be used where blocks of the resource to be loaded are individually hashed. The PATCH response could be "OK, I have these blocks but not those". A subsequent PATCH could upload only those blocks that contain new data.

I would like to add that hashes aren't perfect - most notably MD5. False positives would seemingly be a problem. Some scheme might be needed to be able to detect false positives.

Finally, there is definitely a security question. The best example of it was once described to me this way:

1) I work at a company that archives the form letters containing all job offers differing only by the employee's name and salary.

2) I want to know John Smith's salary (i.e. I know his name but not his salary).

3) I compose a series of form letter offers each with John Smith's name but with varying salaries.

4) I try this dedupe-able PUT/PATCH operation for each such offer letter.

5) My HTTP client reports which one is dedupe-able.

The result of #5 reveals John Smith's salary. Oops!

Just wanted to throw out there my PATCH alternative.

On Oct 7, 2015 12:19 AM, "Thomas Güttler" <[hidden email]> wrote:
I  have seen a lot of useless uploads when syncing a local file system with a remote WebDAV server.

I thought about this and asked on stackoverflow.

My idea is to have a PUT which uses ETAgs or a ETag like way, so that the
data-transfer can be omitted if the server already knows the hash-sum of
the data.

I got a really good answer from someone who knows the HTTP-specs much
better than I do:

http://stackoverflow.com/questions/32794863/http-spec-put-without-data-transfer-since-hash-of-data-is-known-to-server

Maybe my goal is too high .... but I don't want to implement this. I want a official spec :-)

What do you think?

Could something like this become an official recommendation?

BTW, I can't decide about the **how** to implement this. My knowledge
of the http spec is too low at the moment.

At this moment I want to ask:

 - Is it possible at all to create a spec for http put which ommits
   the data, if the server knows the hash-sum (like depulicating file systems)?

 - If yes, then what is the next step?

PS: Of course the spec should be optional. The server can support it, but don't need to.

Regards,
  Thomas Güttler

--
http://www.thomas-guettler.de/


Reply | Threaded
Open this post in threaded view
|

Re: HTTP Spec: PUT without data transfer, since hash of data is known to server

Manfred Baedke
In reply to this post by Thomas Güttler
Hi Thomas,

This is the wrong mailing list - the topic is plain HTTP.
> I want a official spec
It's already there. PUT and ETags are defined by the HTTP spec, see
RFC7230-RFC7235.

Best regards,
Manfred

On 07.10.15 08:06, Thomas Güttler wrote:

> I  have seen a lot of useless uploads when syncing a local file system with a remote WebDAV server.
>
> I thought about this and asked on stackoverflow.
>
> My idea is to have a PUT which uses ETAgs or a ETag like way, so that the
> data-transfer can be omitted if the server already knows the hash-sum of
> the data.
>
> I got a really good answer from someone who knows the HTTP-specs much
> better than I do:
>
> http://stackoverflow.com/questions/32794863/http-spec-put-without-data-transfer-since-hash-of-data-is-known-to-server
>
> Maybe my goal is too high .... but I don't want to implement this. I want a official spec :-)
>
> What do you think?
>
> Could something like this become an official recommendation?
>
> BTW, I can't decide about the **how** to implement this. My knowledge
> of the http spec is too low at the moment.
>
> At this moment I want to ask:
>
>   - Is it possible at all to create a spec for http put which ommits
>     the data, if the server knows the hash-sum (like depulicating file systems)?
>
>   - If yes, then what is the next step?
>
> PS: Of course the spec should be optional. The server can support it, but don't need to.
>
> Regards,
>    Thomas Güttler
>

--
Manfred Baedke

<green/>bytes GmbH
Hafenweg 16
D-48155 MŸnster
Germany
Amtsgericht MŸnster: HRB5782


Reply | Threaded
Open this post in threaded view
|

Re: HTTP Spec: PUT without data transfer, since hash of data is known to server

Julian Reschke
In reply to this post by Ed McClanahan
On 2015-10-07 16:26, Ed McClanahan wrote:

> Hmm... HTTP PATCH sounds like a problem then. Imagine that a previous
> PUT of some other resource included said hash. A later PATCH modifies a
> portion of that old resource. In order to be able to reference the new
> content of that old resource, a new hash for the entire resource needs
> to be recalculated. Not very practical for small PATCHes to large
> resources....
>
> Still, it seems HTTP PATCH also provides an elegant solution.. Using
> PATCH, they payload could be a simple "the data for my new resource has
> this hash" rather than the data itself. The HTTP server could accept or
> reject the PATCH request based upon whether or not it has seen this hash
> before. If rejected, the client just does the normal PUT with unique
> data anyway.

Right, that's one way to do it that is easier to implement than
extending PUT. (Essentially a new Internet Media Type with
PATCH-specific semantics)

Another approach would be the use of a new Content-Coding...

> Going further, some sort of rsync like HTTP PATCH payload could be used
> where blocks of the resource to be loaded are individually hashed. The
> PATCH response could be "OK, I have these blocks but not those". A
> subsequent PATCH could upload only those blocks that contain new data.
>
> I would like to add that hashes aren't perfect - most notably MD5. False
> positives would seemingly be a problem. Some scheme might be needed to
> be able to detect false positives.
>
> Finally, there is definitely a security question. The best example of it
> was once described to me this way:

Right. Google for deduplication + security.

> 1) I work at a company that archives the form letters containing all job
> offers differing only by the employee's name and salary.
>
> 2) I want to know John Smith's salary (i.e. I know his name but not his
> salary).
>
> 3) I compose a series of form letter offers each with John Smith's name
> but with varying salaries.
>
> 4) I try this dedupe-able PUT/PATCH operation for each such offer letter.
>
> 5) My HTTP client reports which one is dedupe-able.
>
> The result of #5 reveals John Smith's salary. Oops!
>
> Just wanted to throw out there my PATCH alternative.
> ...


Best regards, Julian

Reply | Threaded
Open this post in threaded view
|

Re: HTTP Spec: PUT without data transfer, since hash of data is known to server

Thomas Güttler
In reply to this post by Ed McClanahan
Am 07.10.2015 um 16:26 schrieb Ed McClanahan:
> Hmm... HTTP PATCH sounds like a problem then. Imagine that a previous PUT
> of some other resource included said hash. A later PATCH modifies a portion
> of that old resource. In order to be able to reference the new content of
> that old resource, a new hash for the entire resource needs to be
> recalculated. Not very practical for small PATCHes to large resources...

Yes, a small PATCH to a big resource would result into a re-calculation
of the hash sum. This re-calculation would need to scan the whole resource,
although only a small part has changed. That's true.
But "that's live", I see no problem. At least in my environment PATCH is hardly used.
I see mostly this: Whole files get uploaded and downloaded.

> Still, it seems HTTP PATCH also provides an elegant solution. Using PATCH,
> they payload could be a simple "the data for my new resource has this hash"
> rather than the data itself. The HTTP server could accept or reject the
> PATCH request based upon whether or not it has seen this hash before. If
> rejected, the client just does the normal PUT with unique data anyway.

I am not sure if I can follow your thoughts.

Do you want to use PATCH to implement uploads without data transfer, or
do you want to use "sending data without transfer" for PATCH, too?

>From RFC:

 The PATCH method requests that a set of changes described in the
 request entity be applied to the resource identified by the Request-URI.

AFAIK you can only PATCH existing resources. My idea is to PUT new
resources. The same way could be used for PATCH, but I would like to
handle this later.

> Going further, some sort of rsync like HTTP PATCH payload could be used
> where blocks of the resource to be loaded are individually hashed. The
> PATCH response could be "OK, I have these blocks but not those". A
> subsequent PATCH could upload only those blocks that contain new data.

I would like to keep it simple during the first step and focus on whole uploads only.

> I would like to add that hashes aren't perfect - most notably MD5. False
> positives would seemingly be a problem. Some scheme might be needed to be
> able to detect false positives.

Yes, I know. Client and server need to agree on a hash method somehow.
If both want md5, they should do it. But I would not offer it, if I would
write a server.

> Finally, there is definitely a security question. The best example of it
> was once described to me this way:
>
> 1) I work at a company that archives the form letters containing all job
> offers differing only by the employee's name and salary.
>
> 2) I want to know John Smith's salary (i.e. I know his name but not his
> salary).
>
> 3) I compose a series of form letter offers each with John Smith's name but
> with varying salaries.
>
> 4) I try this dedupe-able PUT/PATCH operation for each such offer letter.
>
> 5) My HTTP client reports which one is dedupe-able.
>
> The result of #5 reveals John Smith's salary. Oops!


Yes, that's a security concern.

This could be a solution: If the data with the same hash value is
from a differen area, then the server should answer with "I have
the data for this hash-sum" only if the data was uploaded twice or more.

I can't answer next week.

I was told this list is wrong, since my topic is about http and not webdav.

I will write to the http list in the week of the 19. Oct.

I hope to see/read you there.

Thank you for reading and your interest in this topic.

Regards,
  Thomas Güttler


--
http://www.thomas-guettler.de/

Reply | Threaded
Open this post in threaded view
|

Re: HTTP Spec: PUT without data transfer, since hash of data is known to server

Julian Reschke
On 2015-10-08 19:47, Thomas Güttler wrote:

> Am 07.10.2015 um 16:26 schrieb Ed McClanahan:
>> Hmm... HTTP PATCH sounds like a problem then. Imagine that a previous PUT
>> of some other resource included said hash. A later PATCH modifies a portion
>> of that old resource. In order to be able to reference the new content of
>> that old resource, a new hash for the entire resource needs to be
>> recalculated. Not very practical for small PATCHes to large resources...
>
> Yes, a small PATCH to a big resource would result into a re-calculation
> of the hash sum. This re-calculation would need to scan the whole resource,
> although only a small part has changed. That's true.
> But "that's live", I see no problem. At least in my environment PATCH is hardly used.
> I see mostly this: Whole files get uploaded and downloaded.
>
>> Still, it seems HTTP PATCH also provides an elegant solution. Using PATCH,
>> they payload could be a simple "the data for my new resource has this hash"
>> rather than the data itself. The HTTP server could accept or reject the
>> PATCH request based upon whether or not it has seen this hash before. If
>> rejected, the client just does the normal PUT with unique data anyway.
>
> I am not sure if I can follow your thoughts.
>
> Do you want to use PATCH to implement uploads without data transfer, or
> do you want to use "sending data without transfer" for PATCH, too?
>
>>From RFC:
>
>   The PATCH method requests that a set of changes described in the
>   request entity be applied to the resource identified by the Request-URI.
>
> AFAIK you can only PATCH existing resources. My idea is to PUT new
> resources. The same way could be used for PATCH, but I would like to
> handle this later.

The resource always exists (it has a URI), it just might not have a
GETable representation yet.

So yes, you can use PATCH in that case; it all depends on the PATCH
semantics defined for the media type used in the request.

>...


Best regards