Distinguish text URLs from non-text URLs?

K

Kaidi

Hello,
I have a question when trying to use Java to program a crawler like
program.
As I only need the text (html) files, I am wondering whether anyone
know a
good way to distinguish text URLs (files such as html, htm, etc) from
non-text URLs?

What I want is: given a String type url, how can I decide whether this
URL
points a text file (.htm, etc) or not? We know text pages usually have
URLs
ending with .htm, .html, etc. But with many dynamic pages, such as in
http://www.amazon.com/exec/obidos/ASIN/B00006HXJ6/ref=nosim/fatwalletcom/002-5149236-2409652
this URL points to a html page, but its URL has no file extension.

I have tried to use the getContentType() from Class URLConnection, but
it
works so bad and it even consider many .pdf files as text. :-(

Anyone has any idea of it?
Thanks and happy new year!~
 
T

Tony Morris

The content type of the response is a web server setting.
If the server is responding with "Content-Type: text/plain" from a PDF file,
then it has not been configured correctly.

Server configuration is *usually* done with a mapping between file extension
and content type, so if you were to duplicate this functionality on the
client side, you may be prone to problems.

I'd configure the server correctly and use the getContentType() call.
 
T

Tor Iver Wilhelmsen

What I want is: given a String type url, how can I decide whether this
URL
points a text file (.htm, etc) or not?

Connect to the URL, do a HEAD request, and check the content type.
We know text pages usually have URLs ending with .htm, .html, etc.

Not necessarily.
I have tried to use the getContentType() from Class URLConnection,
but it works so bad and it even consider many .pdf files as text.
:-(

No, it's not the method URLConnection.getContentType() that is bad,
it's the web server sending the wrong content type. The API cannot fix
outside errors.
 
?

=?ISO-8859-1?Q?Daniel_Sj=F6blom?=

Kaidi said:
Hello,
I have a question when trying to use Java to program a crawler like
program.
As I only need the text (html) files, I am wondering whether anyone
know a
good way to distinguish text URLs (files such as html, htm, etc) from
non-text URLs?

You could try 'sniffing' the first few bytes of the files. If they start
with <!DOC or <html you can be pretty sure they're html files. Of
course, this isn't foolproof.
 
K

Kaidi

Thanks friends.
I will try the HEAD request as suggested above.
Currently, I 'sniff' the bytes, see if they have any html tages
such as HTML, HEAD, META, BODY, <p>, etc. It works OK for me, alghough
some kind of troublesome and a kind of "heuristic".
 
A

Andrew Thompson

Kaidi said:
Thanks friends.
I will try the HEAD request as suggested above.
Currently, I 'sniff' the bytes, see if they have any html tages
such as HTML, HEAD, META, BODY, <p>, etc. It works OK for me, alghough
some kind of troublesome and a kind of "heuristic".

The properly formed HTML documents will have a
string like this at the very top..
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">

So if the file _starts_ with '<!DOCTYPE HTML'
you can tell early that this is an HTML document.
[ Unfortunately, very few pages _are_ properly
formed. ]

Otherwise I would recommend searching for the strings
you mentioned, but with the opening '<', like..
'<head', or '<html'.

That reminds me of something else, make sure
you check them for either upper or lower case,
as either is valid.

HTH
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,438
Messages
2,571,699
Members
48,796
Latest member
Greg L.
Top