best method to perform operations on word lists

F

Francois Massion

Hi folks,

I am rather bad at perl and would like some advice on the best
methodology to do the following:

I have a list of approx 20,000 terms extracted from a database. The
list is sorted alphabetically. The entries look like this:

überzeugt
überzeugt,
überzogen
überzogen,
überzogen.
üblich
übliche
üblichen
üblicherweise

I want to eliminate the variants of a basic word. In the example above
I want to end up with:
-überzeugt
-überzogen
-üblich
-üblicherweise

I have thought of the following:

(i) I read the list in a hash made of an index and the term
1 ==> überzeugt
2 ==> überzeugt,
etc.

(ii) I compare each term with its followers

(iii) if the following condition is not met, I delete the entry
(key+value) with "delete"

$term ist a substring of next term AND
the length difference is, say, below 3 (to avoid deleting
"üblicherweise" which is a different term)

I am not sure it is the right methodology. I don't like so much the
idea of creating artificially the index list (1 ==> Term1).

I wonder if I should work with references but it is sort of a blackbox
to me.

Any comments are appreciated.

Francois
 
D

David Squire

Francois said:
Hi folks,

I am rather bad at perl and would like some advice on the best
methodology to do the following:

I have a list of approx 20,000 terms extracted from a database. The
list is sorted alphabetically. The entries look like this:

überzeugt
überzeugt,
überzogen
überzogen,
überzogen.
üblich
übliche
üblichen
üblicherweise

I want to eliminate the variants of a basic word. In the example above
I want to end up with:
-überzeugt
-überzogen
-üblich
-üblicherweise

[snip]

At first I thought you just wanted to remove trailing punctuation, which
is easy, but then I saw that you want to treat üblich, übliche, üblichen
as equivalent. This is hard to do properly.

In information retrieval, this is known as stemming. In linguistics it
is sometimes called lemmatization. To do this well requires a model of
the language in question. There two basic approaches: algorithmic
stemming, which is approximate (e.g. the famous Porter stemmer for
English), and look-up tables, that map all word variants to the
appropriate stem.

I know where to point you for some English stemmers, but not for German.
I have thought of the following:

(i) I read the list in a hash made of an index and the term
1 ==> überzeugt
2 ==> überzeugt,
etc.

(ii) I compare each term with its followers

(iii) if the following condition is not met, I delete the entry
(key+value) with "delete"

$term ist a substring of next term AND
the length difference is, say, below 3 (to avoid deleting
"üblicherweise" which is a different term)

Well, this is the beginning of an stemming algorithm, but I foresee
problems. What about plurals created by changing vowel sounds in the
middle of a word with an umlaut?

If you go to cpan and search for stemmer, you will find quite a few
modules for stemming, including some that do German, but I do not know
about their internals/quality.

Regards,

DS
 
D

Dr.Ruud

Francois Massion schreef:
I have a list of approx 20,000 terms extracted from a database. The
list is sorted alphabetically. The entries look like this:

überzeugt
überzeugt,
überzogen
überzogen,
überzogen.
üblich
übliche
üblichen
üblicherweise

You can first clean it up by removing the punctuations at the end of the
line, and then pipe it through uniq:

perl -ple 's/[.,]$//' infile | uniq > infile-1

I want to eliminate the variants of a basic word. In the example above
I want to end up with:
-überzeugt
-überzogen
-üblich
-üblicherweise

You are bound to loose more than you win.

If you are not in a hurry and have plenty of memory, you can slurp the
whole file in, and then do

1 while s/ \n (.+) \n \1 (?:e|en|t) \n /\n$1\n/x ;

but a while-loop that remembers the previous line is far more efficient.
 
F

Francois Massion

Hi David,
Dag Mijnheer/Mevrouw Ruud,

That was a quick reply! Stemming and lemmatization are a known concept,
but I want to make this exercise language independant (the project I am
working on involves French, German and Polish as a matter of fact).
Therefore I cannot start to cut known characters from the end of a
word, it would be too time-consuming and unsafe.

I'll proceed first with the method described above which is probably
slow and not the most elegant but with some luck it'll work. I'll
report on the result...

Francois
 
B

Bart Van der Donck

Francois said:
That was a quick reply! Stemming and lemmatization are a known concept,
but I want to make this exercise language independant (the project I am
working on involves French, German and Polish as a matter of fact).
Therefore I cannot start to cut known characters from the end of a
word, it would be too time-consuming and unsafe.

I'll proceed first with the method described above which is probably
slow and not the most elegant but with some luck it'll work. I'll
report on the result...

I'm afraid there is no definable algorithm that you can use. What about
the following sample entries (Dutch words):

bi<->bis
de<->den
re<->ree
do<->doen
keer<->keren
etc...

I think you would need access to some dictionary files anyhow.
 
F

Francois Massion

Thanks Bart,

Well, the issue is really a matter of pragmatism. If I do the work
manually or with some VBA macros it will take ages. The situation I am
trying to address is not so uncommon to people working on glossary
issues. Therefore I am trying to find a language-independant solution
which works, say, for 90% of the words. It won't work in situations
with irregular plurals like the one you mention (or e.g. French for
"Work/works": "Travail / travaux", German for "House/Houses":
"Haus/Häuser") or with character swaps in suffixes but at least it
would reduce substantially the number of cases to deal with.

I can define the length of a suffix with something like this:
for ($suffix=0 ; $suffix <= 1 ; $suffix++) {
or as a length difference between 2 words

I can also find out what is the root and the suffix of terms with
something like this:
$wordend = substr ($term,-$suffix);
$startposition = rindex ($term,$wordend); # position of suffix from
the end
$root = substr ($term,0,$startposition);
But for the moment I am struggling getting the value of one term and
the next one in order to compare them...Hope Dies Last !

Francois
 
D

David Squire

Francois said:
Thanks Bart,

Well, the issue is really a matter of pragmatism. If I do the work
manually or with some VBA macros it will take ages. The situation I am
trying to address is not so uncommon to people working on glossary
issues. Therefore I am trying to find a language-independant solution
which works, say, for 90% of the words. It won't work in situations
with irregular plurals like the one you mention (or e.g. French for
"Work/works": "Travail / travaux", German for "House/Houses":
"Haus/Häuser") or with character swaps in suffixes but at least it
would reduce substantially the number of cases to deal with.

I can define the length of a suffix with something like this:
for ($suffix=0 ; $suffix <= 1 ; $suffix++) {
or as a length difference between 2 words

I can also find out what is the root and the suffix of terms with
something like this:
$wordend = substr ($term,-$suffix);
$startposition = rindex ($term,$wordend); # position of suffix from
the end
$root = substr ($term,0,$startposition);
But for the moment I am struggling getting the value of one term and
the next one in order to compare them...Hope Dies Last !

Are your three languages mixed together, or do they occur in separate
files? If separate, even if the files are not tagged with the language
it should be possible to estimate it quickly with spell-check dictionary
look-ups, and then to use the appropriate language-specific stemmer.

Dictionaries, spell-checkers and stemmers for many languages are
available from CPAN.

DS
 
F

Francois Massion

Hi David,

The languages are not mixed. There are good stemmers for the most
"important" languages. A while ago I found some of them on the
following site:

http://snowball.tartarus.org/

But it is probably too complicated for my purpose.

Francois
 
D

David Squire

Francois said:
Hi David,

Thanks for the salutation, but it is not appropriate here. You are not
replying to me, you are posting to the whole comp.lang.perl.misc community.

Also, please quote context and retain attribution(s) (thus obviating the
need for a salutation) when replying, as has everyone who has replied to
your posts. See the posting guidelines for this group, which are posted
here twice a week.
The languages are not mixed. There are good stemmers for the most
"important" languages. A while ago I found some of them on the
following site:

http://snowball.tartarus.org/

But it is probably too complicated for my purpose.

Well, there is a Perl module providing an interface to the snowball
stemmers available from CPAN, so it should not be at all complicated to
use them, and a much better long-term solution.

DS
 
B

Bart Van der Donck

Francois said:
Well, the issue is really a matter of pragmatism. If I do the work
manually or with some VBA macros it will take ages. The situation I am
trying to address is not so uncommon to people working on glossary
issues. Therefore I am trying to find a language-independant solution
which works, say, for 90% of the words.

If you can afford such an error margin, here is a brute approach:

#!perl
use strict; use warnings;
my $list =
"überzeugt
überzeugt,
überzogen
überzogen,
überzogen.
üblich
übliche
üblichen
üblicherweise";
my @terms = split /\n/, $list;
my $prev = 'nonesuch584685542256RANOM58544';
s/(\.|,|e|en|e,|en,|e\.|en\.)$// for @terms;
@terms = grep($_ ne $prev && ($prev = $_), sort @terms);
print $_."\n" for @terms;

FWIW,
 
D

Dr.Ruud

Bart Van der Donck schreef:
s/(\.|,|e|en|e,|en,|e\.|en\.)$// for @terms;


Weaker alternative:

(1) s/ (?:e|en)? [.,]? $//x for @terms ;


or even

(2) s/ (?:en?)? [.,]? $//x for @terms ;

It's weaker, because the regex matches the empty string as well.


Proper alternative:

(3) s/(?: en? [.,]? | [.,] )$//x for @terms ;


because it matches at least /e$/ or /[.,]$/.


(untested)
 
F

Francois Massion

Bart said:
If you can afford such an error margin, here is a brute approach:

#!perl
use strict; use warnings;
my $list =
"überzeugt
überzeugt,
überzogen
überzogen,
überzogen.
üblich
übliche
üblichen
üblicherweise";
my @terms = split /\n/, $list;
my $prev = 'nonesuch584685542256RANOM58544';

This didn't modify the list. Maybe the reason is the $prev definition.

s/(\.|,|e|en|e,|en,|e\.|en\.)$// for @terms;

I also tried Dr. Ruud's regex but it would have to be rewritten for
each language. Here a Polish list example:
zeliwa
zeliwa.
zeliwa,
zeliwna
zeliwnej
zeliwny
zeliwo
zelu
zurawia
Zurawie
zuraw
 
B

Bart Van der Donck

Francois said:
[...]
#!perl
use strict; use warnings;
my $list =
"überzeugt
überzeugt,
überzogen
überzogen,
überzogen.
üblich
übliche
üblichen
üblicherweise";
my @terms = split /\n/, $list;
my $prev = 'nonesuch584685542256RANOM58544';

This didn't modify the list.

I didn't mean to modify $list; the new content is in @terms. If you
want $list to contain the new words, you can use something like this at
the end of the program.

$list = join "\n", @terms;
Maybe the reason is the $prev definition.

$prev has no direct importance here, it's only required that it should
not be present in @terms, because it is used to delete double entries
from @terms.
I also tried Dr. Ruud's regex but it would have to be rewritten for
each language.

That is correct, hence my thoughts about language files. My code is a
very brute algorithm - it only strips out the following from the end of
each line:

. , e en e en, e. en.

If you are planning to use this for different languages, you would
obviously need to modify those patterns each time.
 
F

Francois Massion

I have come up with something which seems to work (partially). For my
current purpose it'll do the trick but any suggestion for optimization
is welcome:

(...)
chomp(my $prev = <TERMS>);

my @reducedlist = $prev;

while ( <TERMS> ) {
chomp;
push @reducedlist, $_ unless (/^$prev/ && length ($_)- length($prev)<3)
; # I can set the maximum lenth of a suffix here
$prev = $_;
}
print "$_\n" for @reducedlist;

[Amazing to see how much time people can invest in a few lines of code
when they are no professionals ;-)! ]

Francois
Francois said:
[...]
#!perl
use strict; use warnings;
my $list =
"überzeugt
überzeugt,
überzogen
überzogen,
überzogen.
üblich
übliche
üblichen
üblicherweise";
my @terms = split /\n/, $list;
my $prev = 'nonesuch584685542256RANOM58544';

This didn't modify the list.

I didn't mean to modify $list; the new content is in @terms. If you
want $list to contain the new words, you can use something like this at
the end of the program.

$list = join "\n", @terms;
Maybe the reason is the $prev definition.

$prev has no direct importance here, it's only required that it should
not be present in @terms, because it is used to delete double entries
from @terms.
I also tried Dr. Ruud's regex but it would have to be rewritten for
each language.

That is correct, hence my thoughts about language files. My code is a
very brute algorithm - it only strips out the following from the end of
each line:

. , e en e en, e. en.

If you are planning to use this for different languages, you would
obviously need to modify those patterns each time.
 
D

Dr.Ruud

Francois Massion schreef:
I have come up with something which seems to work (partially). For my
current purpose it'll do the trick but any suggestion for optimization
is welcome:

(...)
chomp(my $prev = <TERMS>);

my @reducedlist = $prev;

while ( <TERMS> ) {
chomp;
push @reducedlist, $_ unless (/^$prev/ && length ($_)-
length($prev)<3) ; # I can set the maximum lenth of a suffix here
$prev = $_;
}
print "$_\n" for @reducedlist;

[Amazing to see how much time people can invest in a few lines of code
when they are no professionals ;-)! ]


#!/usr/bin/perl
use strict ;
use warnings ;

my @reducedlist ;
my $prev = quotemeta '......' ;

while ( <> )
{
s/\W+$// ; # chop-chop

next if / \A $prev .{0,3} \z /x ;

push @reducedlist, $_ ;
$prev = quotemeta ;
}
print "$_\n" for @reducedlist;

(untested)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top