Webcrawle / Search engine

N

Noni

Hi

Need help !!

Iv'e embarked on a journey to develop a Webcrawler / Search engine.
It will eventually get late holiday deals and info from the web and
display in an orderly fashion.

I have to implement using Java.
Since I have very little exprience in Java.... I dont even know where
to start.
Please help.... my lively hood depends on it.

Regards
 
A

Andrew Thompson

Need help !! .....
Since I have very little exprience in Java....
..I dont even know where
to start.

Both those statements scream..
Please help.... my lively hood depends on it.

That sounds like a definite problem.

I estimate it might take 6 months,
depending upon your proficiency at
taking in information, and prior
knowledge of OO design, to successfully
attempt the project you describe.
 
N

Noni

Andrew Thompson said:
Both those statements scream..


That sounds like a definite problem.

I estimate it might take 6 months,
depending upon your proficiency at
taking in information, and prior
knowledge of OO design, to successfully
attempt the project you describe.

I dont have 6 months. I have untill the 20th of April!!!
Please please please please HELP !!!!
 
C

Christophe Vanfleteren

Noni said:
I dont have 6 months. I have untill the 20th of April!!!
Please please please please HELP !!!!

Not to be pessimistic or anything, but that just isn't going to happen by
then.
 
C

Chris Uppal

Christophe Vanfleteren wrote in reply to Noni:
Not to be pessimistic or anything, but that just isn't going to happen by
then.

To be more accurate, someone who knew Java well could write a fully functional
prototype in a few days -- say, a week. To do that they would use Java's
networking classes like java.net.URL and java.net.HttpURLConnection to do
the downloading, and javax.swing.text.html.parser.DocumentParser to parse the
webpages and find the URLs embedded in them.

Of course, it would *only* be a prototype -- it would not handle all valid
HTML, for example (let alone the *masses* of invalid HTML you find). It would
not even start to handle links that were generated by / embedded in JavaScript
(been there, done that, got the scars). It would probably not scale very well
to downloading large numbers of pages at once. It would lack many features
that a production-quality crawler would want (e.g. management interface). Etc,
etc. But it *would* be a start.

To Noni:

The problem with that is that you have to know Java first. I'm sorry, but I
have to agree with the others that you are unlikely to be able to learn Java
well enough to write a fairly sophisticated program, *and* use it to do so, in
12 days.

This is the point where you go discuss matters with your manager. They have
given you an impossible job, and that's not *your* fault. The important thing
is to make constructive suggestions instead of just saying "I can't do it in
the time". E.g. would your time be better spent trying to find an
off-the-shelf solution ? Is there someone else you could do it while you take
over their job for a while ? And so on...

Do you know any other programming languages ? Scripting languages such as
Perl/Python/Ruby would be good. If you already know one of those then you
could almost certainly hack together a simple web crawler in that faster than I
could in Java. Maybe that would do for the time being ?

(BTW, if this is course work, rather than a real job, then much the same points
hold -- if you've reached the point where you know can't complete your
assignment, then the sooner you ask for help from your teachers the better.
The earlier you ask for help the more likely they are to be able to *give* you
help, and the more likely you are to get credit for the work you *have* been
able to do)

-- chris
 
N

Noni

Chris Uppal said:
Christophe Vanfleteren wrote in reply to Noni:


To be more accurate, someone who knew Java well could write a fully functional
prototype in a few days -- say, a week. To do that they would use Java's
networking classes like java.net.URL and java.net.HttpURLConnection to do
the downloading, and javax.swing.text.html.parser.DocumentParser to parse the
webpages and find the URLs embedded in them.

Of course, it would *only* be a prototype -- it would not handle all valid
HTML, for example (let alone the *masses* of invalid HTML you find). It would
not even start to handle links that were generated by / embedded in JavaScript
(been there, done that, got the scars). It would probably not scale very well
to downloading large numbers of pages at once. It would lack many features
that a production-quality crawler would want (e.g. management interface). Etc,
etc. But it *would* be a start.

To Noni:

The problem with that is that you have to know Java first. I'm sorry, but I
have to agree with the others that you are unlikely to be able to learn Java
well enough to write a fairly sophisticated program, *and* use it to do so, in
12 days.

This is the point where you go discuss matters with your manager. They have
given you an impossible job, and that's not *your* fault. The important thing
is to make constructive suggestions instead of just saying "I can't do it in
the time". E.g. would your time be better spent trying to find an
off-the-shelf solution ? Is there someone else you could do it while you take
over their job for a while ? And so on...

Do you know any other programming languages ? Scripting languages such as
Perl/Python/Ruby would be good. If you already know one of those then you
could almost certainly hack together a simple web crawler in that faster than I
could in Java. Maybe that would do for the time being ?

(BTW, if this is course work, rather than a real job, then much the same points
hold -- if you've reached the point where you know can't complete your
assignment, then the sooner you ask for help from your teachers the better.
The earlier you ask for help the more likely they are to be able to *give* you
help, and the more likely you are to get credit for the work you *have* been
able to do)

-- chris

Guys the above is much appreciated.

Chris, Your right that this is my course work. I have left it to the
last minute by no ones but my own fault.

How about I not build a full crawler but something that will go to
maybe a couple of sites and grab what i need and display it ???

I cannot stress how much I appreciate the help !!!!

Thank you
Regards
Noni
 
A

Andrew Thompson

Your right that this is my course work. I have left it to the
last minute by no ones but my own fault.

How about I not build a full crawler but something that will go to
maybe a couple of sites and grab what i need and display it ???

How about you vacate the position at
the educational institution of which
you are obviously wasting their (and
your) time, so that someone who deserves
the position might have the opportunity.
I cannot stress how much I appreciate the help !!!!

....is that supposed to get us beyond the
fact that you are lying to people publicly,
cheating on your homework, and lazy besides???
 
C

Chris Uppal

Noni,
How about I not build a full crawler but something that will go to
maybe a couple of sites and grab what i need and display it ???

I think you'll find that you have to write *more* code to restrict the area
that the crawler will trawl.

I'd break it down into three bits. Think about them separately (the order
doesn't matter), and see how far you can get with each. I don't know your
course or your teachers, but I imagine that you'd get *some* credit for solving
any one of the bits, and if you manage all three then you're home and dry.

+ If you've been given an URL like "http://java.sun.com/" how do you download
it from the net ?

+ If you have a String (or a file) containing the text of a webpage, how do
you parse it to find the URLs in it ?

+ If you imagine that you've magically solved the first two problems, how do
you structure your code to make a whole crawler ? You'll need to keep a list
of URLs to download as you loop: {take an URL off the list; download the
webpage; find the URLs in that, add them to the list; repeat}

My earlier post had pointers to the Java library classes which will help you do
the first two things, or maybe you've already done something in your classes
which will help.

But I repeat: go talk to your teachers. They *want* to help you learn (or they
wouldn't be teaching). Also they know *what* help you need (I can only guess),
and can give it without cheating, whereas I think *I've* said all that I can
without cheating.

So. Don't panic (very important!). Break the problem down into small bits.
Do as much as you can of as many of the bits as you can manage. Ask for help
in the right place (school).

And stop wasting time on Usenet ;-)

-- chris
 
N

Noni

Andrew Thompson said:
How about you vacate the position at
the educational institution of which
you are obviously wasting their (and
your) time, so that someone who deserves
the position might have the opportunity.


...is that supposed to get us beyond the
fact that you are lying to people publicly,
cheating on your homework, and lazy besides???


I guess people such as Andrew Thompson are here to kick people while
there down and laugh at them, rather than lending a helping hand !!!!

Andrew you dont know anything about me, Its rude to judge people like
you have above!!!

Everybody else!! Is it safe for me to turn away from here or do Ihave
any chance of some assistance ??

Regards
Noni
 
A

Andrew Thompson

I guess people such as Andrew Thompson are here to kick people while
there down and laugh at them, rather than lending a helping hand !!!!

Your suppositions, like your efforts
thus far, are piss-poor.

You ignored the advice I gave you in..
<http://groups.google.com/groups?th=caf1c7b9b30e3d2b>
advising you to head on over to c.l.j.h.,
which is where I advised you to go when
I realised you were yet another student
who wanted us to do your homework.

Posters on c.l.j.h. get some of the same people
answering qns as appear here or on c.l.j.gui,
but those people are altogether more patient
than when they reply on the other groups.
Andrew you dont know anything about me, Its rude to judge people like
you have above!!!

It's rude to lie to people.
Everybody else!! Is it safe for me to turn away from here or do Ihave
any chance of some assistance ??

Yes, ideed turn away from here!

Get yourself over to c.l.j.help.

But do that _only_ if you actually intend to put
some effort in, hack out some code, however bad,
and actually try to learn Java.

Otherwise you will achieve nothing (nobody there
intends to do your homework for you) and waste
everybody's time and bandwidth.

It is up to _you_.
 
C

Christophe Vanfleteren

Noni said:
Everybody else!! Is it safe for me to turn away from here or do Ihave
any chance of some assistance ??

You'll have assistance if you ask specific questions.
You've already been told what classes are usefull for implementing what you
need. I suggest you try to start working with them and see how far you get.
If you have a specific question, feel free to ask, but just don't expect
that anyone will do your homework for you.
 
M

mromarkhan

import java.net.*;
import java.io.*;
import java.util.regex.*;
import java.util.*;
public class PullUrl3
{
final static boolean DEBUG=false;
static Hashtable urls = new Hashtable();
public static void main(String [] args)
{
String rootString = "http://etext.lib.virginia.edu/koran.html";
ArrayList baseListing = getLinks(rootString,rootString);
if(!baseListing.isEmpty())
{
Driller(rootString, baseListing);
}
System.out.println("Done");
}

public static void Driller(String thebase, ArrayList urlListing)
{
for(Iterator c = urlListing.iterator();c.hasNext();)
{
String singleURL="";
String newBaseString = "";
singleURL=(String) c.next();
Pattern pattern = Pattern.compile("http://.*?/", Pattern.DOTALL);
Matcher matcher = pattern.matcher(singleURL);
if(matcher.find())
{
newBaseString = matcher.group();
//System.out.println("newBaseString" + newBaseString);
}
else
{
continue;
}
ArrayList newBase = getLinks(newBaseString, singleURL);
if(!newBase.isEmpty())
{
//System.out.println("newBaseString" + newBaseString);
//System.out.println(singleURL);
Driller(newBaseString, newBase);
}
else
{
//System.out.println("newBaseString" + newBaseString);
//System.out.println(singleURL);
}
//if have listing get it and pass back to driller
//if does not have listing leave alone and show
}

}

public static ArrayList getLinks(String baseString, String theurl)
{
ArrayList returnThis = new ArrayList();
StringBuffer strbuffer = new StringBuffer();
try
{
URL u = new URL( baseString);
HttpURLConnection huc = (HttpURLConnection) u.openConnection();
huc.setRequestMethod("GET");
huc.setDoInput(true);
huc.setDoOutput(false);
huc.setUseCaches(false);
huc.connect();
InputStream inputStream = huc.getInputStream();
BufferedInputStream bis = new BufferedInputStream(inputStream);
while(true)
{
int cint = bis.read();
if(cint == -1)
{
break;
}
strbuffer.append((char)cint);
}
huc.disconnect();
Pattern pattern = Pattern.compile("href=\".*?\"", Pattern.DOTALL);
Matcher matcher = pattern.matcher(strbuffer);
String fullUrl = "";
while(matcher.find())
{
fullUrl = fullURL(baseString, removeHref(matcher.group()));
//System.out.println(fullUrl);
// check if anchor
if(fullUrl.indexOf('#') == -1)
{
// check if in database
if(urls.put(fullUrl, fullUrl) == null)
{
System.out.println(fullUrl);
returnThis.add(fullUrl);
}
else
{
//System.out.println(fullUrl + ": already there");
}
}
}
}
catch (IOException e)
{
System.out.println("Error : "+e);
}
return returnThis;
}



public static String fullURL(String baseString, String value)
{
// case # anchor in page - # at char 0
// case relateive url - virtual directory ~ - remove ~.*?/
// case relative url - /at the beginning
// case full url - http:// at the beginning
// case non http protocol urls - mailto ftp
// make sure to check if slash at end of string before appending
// if find url foundation/blah.html should check to see if
// - contains forward slash
baseString = (baseString.charAt(baseString.length()-1) == '/') ? baseString:baseString+"/";
String returnVal = "";
value = value.trim();
if(value.length() > 1)
{
switch(value.charAt(0))
{
case '#':
System.out.print(((DEBUG) ? "#\n" :"" ));
break;
case '/':
if(value.charAt(1)=='~')
{
Pattern patternVirtual = Pattern.compile("/~.*?/", Pattern.DOTALL);
Matcher matcherVirtual = patternVirtual.matcher(value);
value = matcherVirtual.replaceFirst("");
returnVal = baseString+value;
System.out.print(((DEBUG) ? "/1\n" :"" ));
break;
}
if(value.charAt(value.length()- 1) == '/')
{
System.out.print(((DEBUG) ? "/2\n" :"" ));
returnVal = baseString+value.substring(1,value.length());
}
else
{
System.out.print(((DEBUG) ? "/3\n" :"" ));
returnVal = baseString+value.substring(1,value.length());
}
break;
case 'h':
if(value.startsWith("http://"))
{
returnVal = value;
System.out.print(((DEBUG) ? "http\n" :"" ));
break;
}
case '~':
Pattern patternVirtual = Pattern.compile("~.*?/", Pattern.DOTALL);
Matcher matcherVirtual = patternVirtual.matcher(value);
value = matcherVirtual.replaceFirst("");
returnVal = baseString+value;
System.out.print(((DEBUG) ? "~\n" :"" ));
break;
default:
if(value.charAt(value.length()- 1) == '/')
{
System.out.print(((DEBUG) ? "~def1\n" :"" ));
returnVal = baseString+value.substring(1,value.length());
}
else
{
System.out.print(((DEBUG) ? "def2\n" :"" ));
returnVal = baseString+value;
}
}
}
return returnVal;
}

public static String removeHref(String value)
{
return value.substring(6,value.length() - 1 );
}
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,430
Messages
2,571,676
Members
48,796
Latest member
Greg L.

Latest Threads

Top