libwww and avoiding download of binary/unknown files

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

libwww and avoiding download of binary/unknown files

Bugzilla from silvan.calarco@mambasoft.it

Hi.
I'm writing my first app based on libwww, it aims to do something similar to
webbot but I'm facing a problem that I can't solve because of my limited
knowledge of the libwww architecture.
When a web site is scanned recursively using anchors and requests all the
files are downloaded including binary files. For these save file name is
prompted to the user (my app and webbot behave in the same manner), but I
don't want binary files to be downloaded at all. If I define the following
callback user is not prompted anymore but file is transferred from network to
the black hole thus generating unuseful traffic:

HTMIME_setSaveStream(HTBlackHoleConverter);

So my question is, can I detect the content type of a file (presumably letting
libwww read just a part of it) and then decide not to download it?How?

Thanks!

Silvan

--
mambaSoft di Silvan Calarco - http://www.mambasoft.it


Reply | Threaded
Open this post in threaded view
|

RE: libwww and avoiding download of binary/unknown files

Adam Mlodzinski

> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]]
> On Behalf Of Silvan Calarco
> Sent: Monday, September 04, 2006 6:55 AM
> To: [hidden email]
> Subject: libwww and avoiding download of binary/unknown files
>
>
> Hi.
> I'm writing my first app based on libwww, it aims to do
> something similar to webbot but I'm facing a problem that I
> can't solve because of my limited knowledge of the libwww
> architecture.
> When a web site is scanned recursively using anchors and
> requests


How do you accomplish the recursive scanning? Is this a feature of
libwww, or have you written your own code to do this?



> all the files are downloaded including binary files.


This is always tricky. What is a binary file? Is an image file binary.
Probaly, if it's a GIF or PNG, but what about an SVG file? Okay, easy
enough. But what about a PDF file, or files with no extension at all?
Everyone has their own ideas of what makes a binary file binary, and not
text/ASCII.


> For these save file name is prompted to the user (my app and
> webbot behave in the same manner), but I don't want binary
> files to be downloaded at all. If I define the following
> callback user is not prompted anymore but file is transferred
> from network to the black hole thus generating unuseful traffic:
>
> HTMIME_setSaveStream(HTBlackHoleConverter);
>
> So my question is, can I detect the content type of a file
> (presumably letting libwww read just a part of it) and then
> decide not to download it?How?

You have two options: use a HEAD request for each file during the
recursive scan (although I don't think all servers support HEAD requests
properly) instead of a GET - then decided whether you want the file
based on its MIME type (probably set up a filter to do that); OR, decide
if you want the file based solely on the file name and/or extension
(essentially what MIME does, only instead of asking the server, you
decide for yourself).

Keep in mind that file extensions don't always give away the file
contents - it's a nice convention used 99.9% of the time, but there's
nothing preventing anyone from naming a file, ASCII or binary, with any
extension they feel like. I know of (vaguely) a Perl script that can
tell you if a file is ASCII or binary by reading the first few bytes of
the file - but that requires the file to be present, an option you don't
have in your case.



--
Adam Mlodzinski

Reply | Threaded
Open this post in threaded view
|

Re: libwww and avoiding download of binary/unknown files

Bugzilla from silvan.calarco@mambasoft.it

Alle 01:17, venerdì 8 settembre 2006, Adam Mlodzinski ha scritto:
> How do you accomplish the recursive scanning? Is this a feature of
> libwww, or have you written your own code to do this?

I have defined a link callback function which gets any link from the top page
I request. In this function I perform a request for any internal link found,
then I wait for the event loop to end and I've got all the pages.

> > all the files are downloaded including binary files.
>
> This is always tricky. What is a binary file? Is an image file binary.
> Probaly, if it's a GIF or PNG, but what about an SVG file? Okay, easy
> enough. But what about a PDF file, or files with no extension at all?
> Everyone has their own ideas of what makes a binary file binary, and not
> text/ASCII.

By default libwww prompts for saving all the files it doesn't recognize or
considers binary, that's enough for me now, libwww does it, but maybe I need
to know better how it does it... I just want to get html pages and avoid
downloading any other file.

> You have two options: use a HEAD request for each file during the
> recursive scan (although I don't think all servers support HEAD requests
> properly) instead of a GET - then decided whether you want the file
> based on its MIME type (probably set up a filter to do that); OR, decide
> if you want the file based solely on the file name and/or extension
> (essentially what MIME does, only instead of asking the server, you
> decide for yourself).
>
> Keep in mind that file extensions don't always give away the file
> contents - it's a nice convention used 99.9% of the time, but there's
> nothing preventing anyone from naming a file, ASCII or binary, with any
> extension they feel like. I know of (vaguely) a Perl script that can
> tell you if a file is ASCII or binary by reading the first few bytes of
> the file - but that requires the file to be present, an option you don't
> have in your case.

I suppose the HEAD request will read only the beginning of a non html file and
return that the header is not recognized or is recognized with a MIME type.
If I can do that it's enough. I'll try to do what you suggest and let you
know.
Thanks.

Bye,
Silvan

--
mambaSoft di Silvan Calarco - http://www.mambasoft.it

Reply | Threaded
Open this post in threaded view
|

RE: libwww and avoiding download of binary/unknown files

Adam Mlodzinski
In reply to this post by Bugzilla from silvan.calarco@mambasoft.it

> -----Original Message-----
> From: Silvan Calarco [mailto:[hidden email]]
> Sent: Thursday, September 07, 2006 8:43 PM
> To: Adam Mlodzinski
> Cc: [hidden email]
> Subject: Re: libwww and avoiding download of binary/unknown files
>
> Alle 01:17, venerdì 8 settembre 2006, Adam Mlodzinski ha scritto:
> > How do you accomplish the recursive scanning? Is this a feature of
> > libwww, or have you written your own code to do this?
>
> I have defined a link callback function which gets any link
> from the top page I request. In this function I perform a
> request for any internal link found, then I wait for the
> event loop to end and I've got all the pages.

It sounds like the decision wether to download or not should be made here, in your link callback function. You are essentially telling libwww when it finds a 'link' to 'go and get this file'. Is the SRC of an <IMG> tag a link?
Or, are you hoping that if you just tell libwww to 'go and get this file' that it will respond with 'no, I don't think so - it's a binary file'?


 

> > > all the files are downloaded including binary files.
> >
> > This is always tricky. What is a binary file? Is an image
> file binary.
> > Probaly, if it's a GIF or PNG, but what about an SVG file?
> Okay, easy
> > enough. But what about a PDF file, or files with no
> extension at all?
> > Everyone has their own ideas of what makes a binary file
> binary, and
> > not text/ASCII.
>
> By default libwww prompts for saving all the files it doesn't
> recognize or considers binary, that's enough for me now,

Okay, so you want to decide yourself (with libwww's help) instead of asking the server.


> libwww does it, but maybe I need to know better how it does
> it... I just want to get html pages and avoid downloading any
> other file.

Well, it looks like libwww defines file extension mappings to binary file types in HTBInit.c. You might be able to use HTBind_getFormat in your link callback function to tell you information about the file type based on it's name.

 

> > You have two options: use a HEAD request for each file during the
> > recursive scan (although I don't think all servers support HEAD
> > requests
> > properly) instead of a GET - then decided whether you want the file
> > based on its MIME type (probably set up a filter to do that); OR,
> > decide if you want the file based solely on the file name and/or
> > extension (essentially what MIME does, only instead of asking the
> > server, you decide for yourself).
> >
> > Keep in mind that file extensions don't always give away the file
> > contents - it's a nice convention used 99.9% of the time,
> but there's
> > nothing preventing anyone from naming a file, ASCII or binary, with
> > any extension they feel like. I know of (vaguely) a Perl
> script that
> > can tell you if a file is ASCII or binary by reading the first few
> > bytes of the file - but that requires the file to be present, an
> > option you don't have in your case.
>
> I suppose the HEAD request will read only the beginning of a
> non html file and return that the header is not recognized or
> is recognized with a MIME type.

Probably depends on the webserver software - most of them (webservers) will use a file-extension to MIME-type mapping, though there might be some that do what you suggest.

Use a HEAD request if you want the webserver to tell you what type of file a link points to  - use the libwww HTBind_getFormat if you want to figure it out yourself. The latter doesn't even require a HEAD request, so network bandwidth is reduced even further.



> If I can do that it's enough. I'll try to do what you suggest
> and let you know.
> Thanks.
>
> Bye,
> Silvan
>
> --
> mambaSoft di Silvan Calarco - http://www.mambasoft.it
>