Distinguish text URLs from non-text URLs?

Kaidi · Dec 31, 2003

Hello,
I have a question when trying to use Java to program a crawler like
program.
As I only need the text (html) files, I am wondering whether anyone
know a
good way to distinguish text URLs (files such as html, htm, etc) from
non-text URLs?

What I want is: given a String type url, how can I decide whether this
URL
points a text file (.htm, etc) or not? We know text pages usually have
URLs
ending with .htm, .html, etc. But with many dynamic pages, such as in
http://www.amazon.com/exec/obidos/ASIN/B00006HXJ6/ref=nosim/fatwalletcom/002-5149236-2409652
this URL points to a html page, but its URL has no file extension.

I have tried to use the getContentType() from Class URLConnection, but
it
works so bad and it even consider many .pdf files as text. :-(

Anyone has any idea of it?
Thanks and happy new year!~

Tony Morris · Dec 31, 2003

The content type of the response is a web server setting.
If the server is responding with "Content-Type: text/plain" from a PDF file,
then it has not been configured correctly.

Server configuration is *usually* done with a mapping between file extension
and content type, so if you were to duplicate this functionality on the
client side, you may be prone to problems.

I'd configure the server correctly and use the getContentType() call.

Tor Iver Wilhelmsen · Jan 1, 2004

What I want is: given a String type url, how can I decide whether this
URL
points a text file (.htm, etc) or not?

Connect to the URL, do a HEAD request, and check the content type.

We know text pages usually have URLs ending with .htm, .html, etc.

Not necessarily.

I have tried to use the getContentType() from Class URLConnection,
but it works so bad and it even consider many .pdf files as text.
:-(

No, it's not the method URLConnection.getContentType() that is bad,
it's the web server sending the wrong content type. The API cannot fix
outside errors.

=?ISO-8859-1?Q?Daniel_Sj=F6blom?= · Jan 1, 2004

Kaidi said:
Hello,
I have a question when trying to use Java to program a crawler like
program.
As I only need the text (html) files, I am wondering whether anyone
know a
good way to distinguish text URLs (files such as html, htm, etc) from
non-text URLs?

You could try 'sniffing' the first few bytes of the files. If they start
with <!DOC or <html you can be pretty sure they're html files. Of
course, this isn't foolproof.

Kaidi · Jan 3, 2004

Thanks friends.
I will try the HEAD request as suggested above.
Currently, I 'sniff' the bytes, see if they have any html tages
such as HTML, HEAD, META, BODY, <p>, etc. It works OK for me, alghough
some kind of troublesome and a kind of "heuristic".

Andrew Thompson · Jan 4, 2004

Kaidi said:
Thanks friends.
I will try the HEAD request as suggested above.
Currently, I 'sniff' the bytes, see if they have any html tages
such as HTML, HEAD, META, BODY, <p>, etc. It works OK for me, alghough
some kind of troublesome and a kind of "heuristic".

The properly formed HTML documents will have a
string like this at the very top..
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">

So if the file _starts_ with '<!DOCTYPE HTML'
you can tell early that this is an HTML document.
[ Unfortunately, very few pages _are_ properly
formed. ]

Otherwise I would recommend searching for the strings
you mentioned, but with the opening '<', like..
'<head', or '<html'.

That reminds me of something else, make sure
you check them for either upper or lower case,
as either is valid.

HTH

URLs	3	Aug 1, 2005
Rich Text Format (RTF) Document Builder in C++: Code and Features	0	Sep 28, 2025
Can't wrap text around image and one more	1	Jul 25, 2025
applet security and external URLs	2	Jan 4, 2008
FAQ 9.5 How do I extract URLs?	0	Feb 18, 2011
Regex to find urls in text?	3	Jul 11, 2008
Servlets and Search Engine-Friendly URLs	1	Mar 9, 2005
Zoom Text Only	0	Jun 15, 2011

Distinguish text URLs from non-text URLs?

Kaidi

Tony Morris

Tor Iver Wilhelmsen

=?ISO-8859-1?Q?Daniel_Sj=F6blom?=

Kaidi

Andrew Thompson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads