Web Spider

Discussion in 'Java' started by Chase Preuninger, Mar 4, 2008.

  1. If I was parsing a web page and extracting data from it in order to
    make a search engine, what should I extract?
     
    Chase Preuninger, Mar 4, 2008
    #1
    1. Advertising

  2. Chase Preuninger

    Lord Zoltar Guest

    On Mar 4, 11:10 am, Chase Preuninger <>
    wrote:
    > If I was parsing a web page and extracting data from it in order to
    > make a search engine, what should I extract?


    That depends on what you are interested in.
     
    Lord Zoltar, Mar 4, 2008
    #2
    1. Advertising

  3. Chase Preuninger

    Daniel Pitts Guest

    Lord Zoltar wrote:
    > On Mar 4, 11:10 am, Chase Preuninger <>
    > wrote:
    >> If I was parsing a web page and extracting data from it in order to
    >> make a search engine, what should I extract?

    >
    > That depends on what you are interested in.

    Instead of a "that depends" answer...

    You extract the information you want!

    --
    Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>
     
    Daniel Pitts, Mar 4, 2008
    #3
  4. Chase Preuninger

    Jeff Higgins Guest

    Chase Preuninger wrote:
    > If I was parsing a web page and extracting data from it in order to
    > make a search engine, what should I extract?


    <http://en.wikipedia.org/wiki/Search_engine_%28computing%29>
     
    Jeff Higgins, Mar 4, 2008
    #4
  5. On Mar 4, 4:23 pm, "Jeff Higgins" <> wrote:
    > Chase Preuninger wrote:
    > > If I was parsing a web page and extracting data from it in order to
    > > make a search engine, what should I extract?

    >
    > <http://en.wikipedia.org/wiki/Search_engine_%28computing%29>


    just usefull stuff for a web search
     
    Chase Preuninger, Mar 4, 2008
    #5
  6. Chase Preuninger

    Jeff Higgins Guest

    "Chase Preuninger" <> wrote in message
    news:...
    On Mar 4, 4:23 pm, "Jeff Higgins" <> wrote:
    > Chase Preuninger wrote:
    > > If I was parsing a web page and extracting data from it in order to
    > > make a search engine, what should I extract?

    >
    > <http://en.wikipedia.org/wiki/Search_engine_%28computing%29>


    just usefull stuff for a web search

    <http://en.wikipedia.org/wiki/Sitemaps>

    Both the links I've provided,
    I've found using a web search engine
    and the search terms: search, engine, wiki.

    You could try searching on:
    "most frequent search query", or
    "most interesting search query", or
    "most useful data for a web search engine".
     
    Jeff Higgins, Mar 4, 2008
    #6
  7. Chase Preuninger

    timjowers Guest

    On Mar 4, 5:30 pm, "Jeff Higgins" <> wrote:
    > "Chase Preuninger" <> wrote in message
    >
    > news:...
    > On Mar 4, 4:23 pm, "Jeff Higgins" <> wrote:
    >
    > > Chase Preuninger wrote:
    > > > If I was parsing a web page and extracting data from it in order to
    > > > make a search engine, what should I extract?

    >
    > > <http://en.wikipedia.org/wiki/Search_engine_%28computing%29>

    >
    > just usefull stuff for a web search
    >
    > <http://en.wikipedia.org/wiki/Sitemaps>
    >
    > Both the links I've provided,
    > I've found using a web search engine
    > and the search terms: search, engine, wiki.
    >
    > You could try searching on:
    > "most frequent search query", or
    > "most interesting search query", or
    > "most useful data for a web search engine".


    Chase. The basic is the words. Then you correlate the words into
    clusters. Historically these are called Information Retrieval Systems
    ("IRS" oh my). The simplest idea is pages with words in common must be
    like one another and like your topic. Imagine what would happen if you
    had one list with all URLs. One list with all words. And one list
    connecting the two. Then you could lookup all matching URL's for each
    word. These lists might be large though! Then you could find the set
    matching the search phrase by intersecting each set for each word.

    Second thing to know is words have forms so maybe you'd work off of
    all lower case and reduce all words to a base form. Well, what about
    "Farenheit 451"? Do you also store numbers? What about "(...)"? Can
    you also search on computerese? So, it starts to get complicated. One
    idea is the "edit distance" or number of changes to get from the word
    entered to a base word. That might tell if it might be the same word.
    What about synonyms (I haven't seen a search engine do this). What
    about bigrams and n-grams? That is, multi-word combinations. If one
    types super computer then maybe any occurrences of "super computer"
    should be matched higher than a page with just the word super or
    computer.

    OK, so a real search engine uses ranking and bases this on many
    things. Things like how long the site has been up. How many other
    sites link to them. How stuffed full of links their pages are. Maybe
    if they buy ads from teh search engine? Nah, that wouldn't b
    right. :) Etc. Also, by clustering a person's past searches or areas
    of interest then you can greatly increase your precision.

    In 2001 I took an IR course and we studied MSN, Google, and Yahoo.
    Everyone found Google to have about the same recall (document
    universe) but superior precision (accuracy). Now if I'd had the common
    sense to buy stock!!!!
     
    timjowers, Mar 5, 2008
    #7
  8. Chase Preuninger

    timjowers Guest

    timjowers, Mar 5, 2008
    #8
  9. Chase Preuninger

    Roedy Green Guest

    On Tue, 4 Mar 2008 08:10:53 -0800 (PST), Chase Preuninger
    <> wrote, quoted or indirectly quoted someone
    who said :

    >If I was parsing a web page and extracting data from it in order to
    >make a search engine, what should I extract?

    you DON'T want the html tags
    You DON'T want the header.
    you DON'T want header/footer info common to all pages at a website.
    you DON'T want common words like the that then is a ...
    you DON'T want URLs
    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
     
    Roedy Green, Mar 6, 2008
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. baroque Chou

    how google spider access my web site?

    baroque Chou, Jan 26, 2006, in forum: ASP .Net
    Replies:
    7
    Views:
    3,906
    Alan Silver
    Feb 2, 2006
  2. JeepGary
    Replies:
    2
    Views:
    476
    Roedy Green
    Oct 21, 2003
  3. Thomas Lindgaard

    Web Spider

    Thomas Lindgaard, Jul 6, 2004, in forum: Python
    Replies:
    3
    Views:
    595
    Peter Hansen
    Jul 7, 2004
  4. jdonnell
    Replies:
    5
    Views:
    564
    Peter Hansen
    Feb 17, 2005
  5. Replies:
    0
    Views:
    390
Loading...

Share This Page