Java Basic Search engine

I

ibrahimover

Hi

I have almost 100 html pages on my local disk and first i have to
index them then i have to make simple word search on that index to
find which pages include that word


i dont know how to index i mean which algorithm b+tree or smthng else
and what to index {word, page url, ...}

does anybody have simple code to understand what to index and how to
index , how to search writen with java

any help would be great for me

Thank You
 
L

Luc The Perverse

Hi

I have almost 100 html pages on my local disk and first i have to
index them then i have to make simple word search on that index to
find which pages include that word


i dont know how to index i mean which algorithm b+tree or smthng else
and what to index {word, page url, ...}

does anybody have simple code to understand what to index and how to
index , how to search writen with java

any help would be great for me

There are existing tools out there that are better than anything you are
going to be able to write.

If this is a learning exercise - then any kind of balanced tree is going to
be fine unless you wanted to try some kind of hashing to try to save some
time.

If you are hashing and sorting you will probably benefit from converting
your search queries into numbers early on through some kind of secure
hashing algorithm (SHA-1 for example) then you could choose an arbitrary
numbers of bits to divide into separate hash groups (if you choose 8 bits
then you would have 256) each of which contains a tree which would search
for your 128 bit entry. Every search query points back to 0 to many HTML
pages . . . each hit corresponds to 1 to many HTML pages.

Repeat for every item in your list, and then compute your "best" results as
you see fit.
 
C

Chris Uppal

I have almost 100 html pages on my local disk and first i have to
index them then i have to make simple word search on that index to
find which pages include that word

If you aren't doing this as some sort of exercise, then you would probably find
it easier to use a pre-packaged search/indexing engine such as Apache Lucene.

If you /are/ doing it as an exercise, then text searching and indexing is a big
topic and I don't really know where to start. Perhaps you should ask your
teacher (if you have one) for guidance and more detail about what you are
supposed to do.

-- chris
 
I

ibrahimover

Chris said:
If you aren't doing this as some sort of exercise, then you would probably find
it easier to use a pre-packaged search/indexing engine such as Apache Lucene.

If you /are/ doing it as an exercise, then text searching and indexing is a big
topic and I don't really know where to start. Perhaps you should ask your
teacher (if you have one) for guidance and more detail about what you are
supposed to do.

-- chris


Thanks for answer

Im doing for exercise infect this isnt my exercise.. but if i succes
this most part will be done other parts are just usual reports etc..

i hope i can find someone to guide me here
 
L

Luc The Perverse

Thanks for answer

Im doing for exercise infect this isnt my exercise.. but if i succes
this most part will be done other parts are just usual reports etc..

i hope i can find someone to guide me here

He did . . . He suggested Apache Lucene
 
J

Julian Treadwell

Luc said:
He did . . . He suggested Apache Lucene
If you're doing it as an exercise, then you will need to follow these
basic steps:

(1) Write a program that will extract all appropriate words from your
web pages (exclude HTML tags and short words like "a") and build a
cross-reference table of these words against the pages they reside in.
This table should be in some database, doesn't much matter whether it's
XML or MySQL or whatever. In the real world this program should be run
at least nightly to keep the database up-to-date.

(2) Create a search function on your main page to this database which
will allow the user to check this table for a particular word and then
allow him to link to any pages found.

Good luck,

Julian
 
I

ibrahimover

Hi all..

Thanx for all answers first of all i looked for all tools that u
suggested Swishe-e looks great im supposed to do very simple one
like swish- so they are good example for me

As Julian said i started that steps

First Parse Html then exclude tags and unwanted words than index
them
the question is how to index onemore point is i dont know how to
explain mybe this example helps. { if that page has a word like
investigation i have a tool which seperate that word to
investigation-investigate-investigate and i will index that link to
this words }
im planning such a structure so that if someone search investigation
first results will be investigation then investigate....)

hope im clear
till now everything is ok the question is indexing algorithm and
what else to index? it shouldnt be too complicated maybe one more
importand thing i should index again i will give example

if we search "simple investigation" in first results the pages which
has "simple investigation" should came and then
"simple,,,,,,,,,,,,,,,,,,,,,,,,,,,,, investigation"

so only this two criteria is importand for me thats why i should find
a such a kind of indexing algorithm

Thank You
 
J

Julian Treadwell

Hi all..

Thanx for all answers first of all i looked for all tools that u
suggested Swishe-e looks great im supposed to do very simple one
like swish- so they are good example for me

As Julian said i started that steps

First Parse Html then exclude tags and unwanted words than index
them
the question is how to index onemore point is i dont know how to
explain mybe this example helps. { if that page has a word like
investigation i have a tool which seperate that word to
investigation-investigate-investigate and i will index that link to
this words }
im planning such a structure so that if someone search investigation
first results will be investigation then investigate....)

hope im clear
till now everything is ok the question is indexing algorithm and
what else to index? it shouldnt be too complicated maybe one more
importand thing i should index again i will give example

if we search "simple investigation" in first results the pages which
has "simple investigation" should came and then
"simple,,,,,,,,,,,,,,,,,,,,,,,,,,,,, investigation"

so only this two criteria is importand for me thats why i should find
a such a kind of indexing algorithm

Thank You
One way to allow phrase searching would be to include the word position
in your index table.

So the table structure would be:

field1: word (key)
field2: page # (multi-value)
field3: position (multi-value,linked to page)

So if the user searches for "simple investigation" and your search
program found "simple" on page 100 at position 32 and "investigation" at
page 100 at position 33 it could decide there's a phrase match and list
page 100 at the top of the list.
 
I

ibrahimover

Hi i forget to say there is a problem

im not alloved to use any DB so i have to solev this issue by text
files

im planning to make an index file which has smthng like

investigate|5
investigation|7
field|56
....

smthng like this i should orders this words in some order like b tree
than search "investigation " on that index when i find that i will get
the poineter "7"
but the problem is i dont want to build btree everytime so i guess i
have to know how to implement btree over text file how to
add/delete/search instead of in memory but im not sure just
thought with my little knowladge

than in onother object file the structure about "investigation" like
page#,position,.. will be in 7 th object so that with one search i can
go directly to 7th object and get informations about it


another way that i thought is dictionary isting idont know much about
it but i think its smthng like
invest
---igate(5)
---igation(7)
---igator(88)
etc so that first indexing like this would be hard but later its easy
to search but this time i dont know how to save that indexing on file


i guess im confused :(
 
A

Andrew Thompson

Hi i forget to say there is a problem

im not alloved to use any DB so i have to solev this issue by text
files

Huh?

Do you mean your boss said "Don't use a database!"
Why would the boss care, so long as it does
not cost anything?

Or, is it that you are teaching yourself Java, and
set the (arbitrary) rule that this code would not
use a database?

OTOH, if this is a college assignment, just how
much do you expect to learn by asking...
"does anybody have simple code to understand
what to index and how to index , how to search
writen with java "?

Something else, stranger altogether??

Andrew T.
 
I

ibrahimover

Hi thanx for answer even doesnt seems helpfull to me

as i said befor

"Im doing for exercise infect this isnt my exercise.. but if i
succes
this most part will be done other parts are just usual reports etc..
"
as u can guess im student and if you read my last post i have some idea
to do but im not an expert and i dont want to waste lot time by trying
useless or worst algorithms im not asking for b tree code or smthng
else just want to find the best way to do and i guess asking helps me
to find

if its not the way how it goes here im very sory i just thought i may
get some ideas some guide
 
J

Julian Treadwell

Hi thanx for answer even doesnt seems helpfull to me

as i said befor

"Im doing for exercise infect this isnt my exercise.. but if i
succes
this most part will be done other parts are just usual reports etc..
"
as u can guess im student and if you read my last post i have some idea
to do but im not an expert and i dont want to waste lot time by trying
useless or worst algorithms im not asking for b tree code or smthng
else just want to find the best way to do and i guess asking helps me
to find

if its not the way how it goes here im very sory i just thought i may
get some ideas some guide
You can use an alphabetically ordered text file instead of a database to
store your word index but you'll have to read through it sequentially
each time you do a search instead of doing a direct read. But with
modern computer speeds that won't be noticeable. You'll need to
have a line for each occurrence of each word.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,479
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top