K
Kaidi
Hello,
I have a question when trying to use Java to program a crawler like
program.
As I only need the text (html) files, I am wondering whether anyone
know a
good way to distinguish text URLs (files such as html, htm, etc) from
non-text URLs?
What I want is: given a String type url, how can I decide whether this
URL
points a text file (.htm, etc) or not? We know text pages usually have
URLs
ending with .htm, .html, etc. But with many dynamic pages, such as in
http://www.amazon.com/exec/obidos/ASIN/B00006HXJ6/ref=nosim/fatwalletcom/002-5149236-2409652
this URL points to a html page, but its URL has no file extension.
I have tried to use the getContentType() from Class URLConnection, but
it
works so bad and it even consider many .pdf files as text. :-(
Anyone has any idea of it?
Thanks and happy new year!~
I have a question when trying to use Java to program a crawler like
program.
As I only need the text (html) files, I am wondering whether anyone
know a
good way to distinguish text URLs (files such as html, htm, etc) from
non-text URLs?
What I want is: given a String type url, how can I decide whether this
URL
points a text file (.htm, etc) or not? We know text pages usually have
URLs
ending with .htm, .html, etc. But with many dynamic pages, such as in
http://www.amazon.com/exec/obidos/ASIN/B00006HXJ6/ref=nosim/fatwalletcom/002-5149236-2409652
this URL points to a html page, but its URL has no file extension.
I have tried to use the getContentType() from Class URLConnection, but
it
works so bad and it even consider many .pdf files as text. :-(
Anyone has any idea of it?
Thanks and happy new year!~