Download only http headers

N

Nguyen

Hi folks,

I'm writing a small app which needs to list all HTML pages (only HTML
pages, no image, no CSS, pdf, gif etc.) from a starting page. Given
any link, I use Net::HTTP to download the content and examine the
content-type header in order to determine whether it is a HTML page or
s/t else (I don't think pattern matching will work because there are
dynamic pages being able to return arbitrary resources [HTML, images
etc.]); however, I cannot find anyway to only read the headers without
reading the whole contents of that resource (e.g. a PDF). That would
make my app perform very slowly while all I want is just listing the
HTML pages.

Does anyone have any suggestion as to how this problem can be solved?

Thanks in advance

Nguyen
 
T

Tom Werner

Nguyen said:
I'm writing a small app which needs to list all HTML pages (only HTML
pages, no image, no CSS, pdf, gif etc.) from a starting page. Given
any link, I use Net::HTTP to download the content and examine the
content-type header in order to determine whether it is a HTML page or
s/t else (I don't think pattern matching will work because there are
dynamic pages being able to return arbitrary resources [HTML, images
etc.]); however, I cannot find anyway to only read the headers without
reading the whole contents of that resource (e.g. a PDF). That would
make my app perform very slowly while all I want is just listing the
HTML pages.

Net::HTTP has the ability to do HEAD commands in addition to GET. You
can get the headers that way.

Tom
 
N

Nguyen

Work like charm. Thanks a great deal, Tom.

Nguyen

Tom said:
Nguyen said:
I'm writing a small app which needs to list all HTML pages (only HTML
pages, no image, no CSS, pdf, gif etc.) from a starting page. Given
any link, I use Net::HTTP to download the content and examine the
content-type header in order to determine whether it is a HTML page or
s/t else (I don't think pattern matching will work because there are
dynamic pages being able to return arbitrary resources [HTML, images
etc.]); however, I cannot find anyway to only read the headers without
reading the whole contents of that resource (e.g. a PDF). That would
make my app perform very slowly while all I want is just listing the
HTML pages.

Net::HTTP has the ability to do HEAD commands in addition to GET. You
can get the headers that way.

Tom
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top