ranking texts against a white list

M

Mario Protto

hi all,

I have many small texts (200-1000 chars), I have a white list (100 words), I
have to evaluate any text with its relevancy against the word list.
Now I'm using a very simple alg like
_______________________
in text there is at least 1 word from list?
yes --> rank = 1
no --> rank = 0
_______________________

but I'd like rank to be a real number between 0 and 1, I have think
something like count how many differnt word there are in test and normalize
to 1 but perhaps there is some other, most intelligent...;), way to do that
.....any suggest?

thx

Mario
www.mario-online.com
 
A

Arndt Jonasson

Mario Protto said:
I have many small texts (200-1000 chars), I have a white list (100 words), I
have to evaluate any text with its relevancy against the word list.
Now I'm using a very simple alg like
_______________________
in text there is at least 1 word from list?
yes --> rank = 1
no --> rank = 0
_______________________

but I'd like rank to be a real number between 0 and 1, I have think
something like count how many differnt word there are in test and normalize
to 1 but perhaps there is some other, most intelligent...;), way to do that
....any suggest?

This question doesn't have anything to do with Perl, until there is
a particular implementation problem you want help with, so this is
not the proper news group for it.

If you don't know what the meaning of the relevancy number is, how
can anyone else? It's easy to start speculating, but before even doing
that I would want to know how the number is to be used.

If you search with google using some of the words "rank text white list",
you may find more information. Another source of ideas is documentation
(and source) of existing text search and ranking tools. 'Glimpse' comes
to mind, but there are probably many.

There's probably a proper news group dealing with such questions, but
I don't know what it might be called.
 
M

Mario Protto

I have many small texts (200-1000 chars), I have a white list (100
This question doesn't have anything to do with Perl, until there is
a particular implementation problem you want help with, so this is
not the proper news group for it.

Ehm...sorry but I forgot to tell that this function is embedded in a Perl
project that start fetching text in a various way, putting it in a
Postgresql db and, via a PHP front-end, permit to human operators to filter
and show the contents.
If you don't know what the meaning of the relevancy number is, how
can anyone else? It's easy to start speculating, but before even doing
that I would want to know how the number is to be used.

Well, the relevancy number could be something like "how much this document
talk about my terms", I know it could be almost a theoric question but it
seems to me as a common needed for perl programmer managing text...isn't it?
If you search with google using some of the words "rank text white list",
you may find more information. Another source of ideas is documentation
(and source) of existing text search and ranking tools. 'Glimpse' comes
to mind, but there are probably many.

of course I've done some Cpan and Google search before my post, also (for
who is interested) in italian newsgroup about Perl Stefano Rodighiero
suggest a very interesting article:
* "Building a Vector Space Search Engine in Perl"
http://www.perl.com/pub/a/2003/02/19/engine.html
There's probably a proper news group dealing with such questions, but
I don't know what it might be called.

me too...:)

Mario
 
M

Mark Clements

Mario said:
hi all,

I have many small texts (200-1000 chars), I have a white list (100 words), I
have to evaluate any text with its relevancy against the word list.
Now I'm using a very simple alg like
_______________________
in text there is at least 1 word from list?
yes --> rank = 1
no --> rank = 0
_______________________

but I'd like rank to be a real number between 0 and 1, I have think
something like count how many differnt word there are in test and normalize
to 1 but perhaps there is some other, most intelligent...;), way to do that
....any suggest?
Hi

check out

http://www.perl.com/pub/a/2003/02/19/engine.html

is an article on building vector-space searches. May be what you are after.

Mark
 
A

A. Sinan Unur

"Mario Protto" <mario AT mario-online DOT
(e-mail address removed)> wrote in
....
....


Ehm...sorry but I forgot to tell that this function is embedded in a
Perl project that start fetching text in a various way,

Still irrelevant.

To get a better idea of what types of topics are relevant here, you should
read the posting guidelines for this group. They are posted here regularly
or you can Google for them on the web.

Sinan
 
D

David K. Wall

A. Sinan Unur said:
"Mario Protto" <mario AT mario-online DOT
(e-mail address removed)> wrote in


Still irrelevant.

Maybe comp.programming? It seems like it might be a better place to
discuss an algorithm without caring about what language it's
implemented in.

To get a better idea of what types of topics are relevant here,
you should read the posting guidelines for this group. They are
posted here regularly or you can Google for them on the web.

I bet Google hates the use of their trademarked name as a generic
verb.... :)
 
T

Tad McClellan

David K. Wall said:
I bet Google hates the use of their trademarked name as a generic
verb.... :)


I hope the smiley means you mean just the opposite...?

I would think they _love_ it.
 
C

Chris Mattern

Tad said:
I hope the smiley means you mean just the opposite...?

I would think they _love_ it.
Er, no. Because that's how you lose trademarks. Ask Bayer,
for whom aspirin used to be a trademark. Also escalator,
linoleum, zipper and yo-yo, all of which used to be brand
names, and were lost to their owners because they became
generic terms.

--
Christopher Mattern

"Which one you figure tracked us?"
"The ugly one, sir."
"...Could you be more specific?"
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top