K
Kaidi
Hi,
(I did a google on this topic but still can not solve my problem. :-(
My problem basically is:
I am programming a crawler in Java and some sits are using cookies. As
Java does not handle cookies automatically, I find I can not access
some pages.
I read some articles such as from:
http://martin.nobilitas.com/java/cookies.html
http://www.informit.com/isapi/product_id~{1DF8B22B-055F-48DB-BD36-20B8017E9956}/content/index.asp
Basically I can see that we need to do is to get the set-cookie
header,
then write it back next time when needed.
However, when I did my test on bestbuy's home page, it seems not
working well.
Some pages seems do not ask for store cookies, but without cookie,
they can
not be accessed. One example is:
http://www.bestbuy.com/site/olspage.jsp?j=1&id=cat12074&type=page&categoryRep=cat02000
When I try to crawl this page using my Java program, it only returns a
page
saying that my brower does not support cookis. :-(
(Using IE can access it properly. In IE's option, I deleted the
cookies
before trying the above page, still works.)
Any one has any idea of this? Thanks a lot.
PS: the code I am using is from end of this page:
http://www.hccp.org/java-net-cookie-how-to.html
http://www.hccp.org/cvs/org/hccp/net/CookieManager.java
In the above code, I add a print line in storeCookies so that I can
see all the header:
.........
for (int i=1; (headerName = conn.getHeaderFieldKey(i)) !=
null; i++) {
System.out.println("In storeCookies,
"+headerName+"-->"+conn.getHeaderField(i));
.........
The headers I can see only have:
In storeCookies, Server-->Apache
In storeCookies, Last-Modified-->Mon, 24 Nov 2003 15:19:52 GMT
In storeCookies, ETag-->"b0da7d-14ee-3fc22198"
In storeCookies, Accept-Ranges-->bytes
In storeCookies, Content-Length-->5358
In storeCookies, Content-Type-->text/html
In storeCookies, Date-->Fri, 16 Jan 2004 09:37:10 GMT
In storeCookies, Connection-->keep-alive
{bestbuy.com={}}
So, since it does not have set cookies, why my Java program can not
crawl it?
For page crawling, I am using this code:
--------------
try {
// try opening the URL
URL url = new URL(url_string);
URLConnection urlConnection = url.openConnection();
urlConnection.setAllowUserInteraction(false);
InputStream urlStream = url.openStream();
// search the input stream for links
// first, read in the entire URL
byte b[] = new byte[1000];
int numRead = urlStream.read(b);
String content;
if (numRead > 0)
content = new String(b, 0, numRead);
else
content = new String("");
// String content = new String(b, 0, numRead);
while ((numRead != -1) && (content.length() < MAXSIZE)) {
numRead = urlStream.read(b);
if (numRead != -1) {
String newContent = new String(b, 0, numRead);
content += newContent;
}
}
return content;
--------------
(I did a google on this topic but still can not solve my problem. :-(
My problem basically is:
I am programming a crawler in Java and some sits are using cookies. As
Java does not handle cookies automatically, I find I can not access
some pages.
I read some articles such as from:
http://martin.nobilitas.com/java/cookies.html
http://www.informit.com/isapi/product_id~{1DF8B22B-055F-48DB-BD36-20B8017E9956}/content/index.asp
Basically I can see that we need to do is to get the set-cookie
header,
then write it back next time when needed.
However, when I did my test on bestbuy's home page, it seems not
working well.
Some pages seems do not ask for store cookies, but without cookie,
they can
not be accessed. One example is:
http://www.bestbuy.com/site/olspage.jsp?j=1&id=cat12074&type=page&categoryRep=cat02000
When I try to crawl this page using my Java program, it only returns a
page
saying that my brower does not support cookis. :-(
(Using IE can access it properly. In IE's option, I deleted the
cookies
before trying the above page, still works.)
Any one has any idea of this? Thanks a lot.
PS: the code I am using is from end of this page:
http://www.hccp.org/java-net-cookie-how-to.html
http://www.hccp.org/cvs/org/hccp/net/CookieManager.java
In the above code, I add a print line in storeCookies so that I can
see all the header:
.........
for (int i=1; (headerName = conn.getHeaderFieldKey(i)) !=
null; i++) {
System.out.println("In storeCookies,
"+headerName+"-->"+conn.getHeaderField(i));
.........
The headers I can see only have:
In storeCookies, Server-->Apache
In storeCookies, Last-Modified-->Mon, 24 Nov 2003 15:19:52 GMT
In storeCookies, ETag-->"b0da7d-14ee-3fc22198"
In storeCookies, Accept-Ranges-->bytes
In storeCookies, Content-Length-->5358
In storeCookies, Content-Type-->text/html
In storeCookies, Date-->Fri, 16 Jan 2004 09:37:10 GMT
In storeCookies, Connection-->keep-alive
{bestbuy.com={}}
So, since it does not have set cookies, why my Java program can not
crawl it?
For page crawling, I am using this code:
--------------
try {
// try opening the URL
URL url = new URL(url_string);
URLConnection urlConnection = url.openConnection();
urlConnection.setAllowUserInteraction(false);
InputStream urlStream = url.openStream();
// search the input stream for links
// first, read in the entire URL
byte b[] = new byte[1000];
int numRead = urlStream.read(b);
String content;
if (numRead > 0)
content = new String(b, 0, numRead);
else
content = new String("");
// String content = new String(b, 0, numRead);
while ((numRead != -1) && (content.length() < MAXSIZE)) {
numRead = urlStream.read(b);
if (numRead != -1) {
String newContent = new String(b, 0, numRead);
content += newContent;
}
}
return content;
--------------