best method to perform operations on word lists

Francois Massion · Jun 10, 2006

Hi folks,

I am rather bad at perl and would like some advice on the best
methodology to do the following:

I have a list of approx 20,000 terms extracted from a database. The
list is sorted alphabetically. The entries look like this:

überzeugt
überzeugt,
überzogen
überzogen,
überzogen.
üblich
übliche
üblichen
üblicherweise

I want to eliminate the variants of a basic word. In the example above
I want to end up with:
-überzeugt
-überzogen
-üblich
-üblicherweise

I have thought of the following:

(i) I read the list in a hash made of an index and the term
1 ==> überzeugt
2 ==> überzeugt,
etc.

(ii) I compare each term with its followers

(iii) if the following condition is not met, I delete the entry
(key+value) with "delete"

$term ist a substring of next term AND
the length difference is, say, below 3 (to avoid deleting
"üblicherweise" which is a different term)

I am not sure it is the right methodology. I don't like so much the
idea of creating artificially the index list (1 ==> Term1).

I wonder if I should work with references but it is sort of a blackbox
to me.

Any comments are appreciated.

Francois

David Squire · Jun 10, 2006

Francois said:
Hi folks,

I am rather bad at perl and would like some advice on the best
methodology to do the following:

I have a list of approx 20,000 terms extracted from a database. The
list is sorted alphabetically. The entries look like this:

überzeugt
überzeugt,
überzogen
überzogen,
überzogen.
üblich
übliche
üblichen
üblicherweise

I want to eliminate the variants of a basic word. In the example above
I want to end up with:
-überzeugt
-überzogen
-üblich
-üblicherweise

[snip]

At first I thought you just wanted to remove trailing punctuation, which
is easy, but then I saw that you want to treat üblich, übliche, üblichen
as equivalent. This is hard to do properly.

In information retrieval, this is known as stemming. In linguistics it
is sometimes called lemmatization. To do this well requires a model of
the language in question. There two basic approaches: algorithmic
stemming, which is approximate (e.g. the famous Porter stemmer for
English), and look-up tables, that map all word variants to the
appropriate stem.

I know where to point you for some English stemmers, but not for German.

I have thought of the following:

(i) I read the list in a hash made of an index and the term
1 ==> überzeugt
2 ==> überzeugt,
etc.

(ii) I compare each term with its followers

(iii) if the following condition is not met, I delete the entry
(key+value) with "delete"

$term ist a substring of next term AND
the length difference is, say, below 3 (to avoid deleting
"üblicherweise" which is a different term)

Well, this is the beginning of an stemming algorithm, but I foresee
problems. What about plurals created by changing vowel sounds in the
middle of a word with an umlaut?

If you go to cpan and search for stemmer, you will find quite a few
modules for stemming, including some that do German, but I do not know
about their internals/quality.

Regards,

DS

Dr.Ruud · Jun 10, 2006

Francois Massion schreef:

I have a list of approx 20,000 terms extracted from a database. The
list is sorted alphabetically. The entries look like this:

überzeugt
überzeugt,
überzogen
überzogen,
überzogen.
üblich
übliche
üblichen
üblicherweise

You can first clean it up by removing the punctuations at the end of the
line, and then pipe it through uniq:

perl -ple 's/[.,]$//' infile | uniq > infile-1

I want to eliminate the variants of a basic word. In the example above
I want to end up with:
-überzeugt
-überzogen
-üblich
-üblicherweise

You are bound to loose more than you win.

If you are not in a hurry and have plenty of memory, you can slurp the
whole file in, and then do

1 while s/ \n (.+) \n \1 (?:e|en|t) \n /\n$1\n/x ;

but a while-loop that remembers the previous line is far more efficient.

Francois Massion · Jun 10, 2006

Hi David,
Dag Mijnheer/Mevrouw Ruud,

That was a quick reply! Stemming and lemmatization are a known concept,
but I want to make this exercise language independant (the project I am
working on involves French, German and Polish as a matter of fact).
Therefore I cannot start to cut known characters from the end of a
word, it would be too time-consuming and unsafe.

I'll proceed first with the method described above which is probably
slow and not the most elegant but with some luck it'll work. I'll
report on the result...

Francois

Bart Van der Donck · Jun 10, 2006

Francois said:
That was a quick reply! Stemming and lemmatization are a known concept,
but I want to make this exercise language independant (the project I am
working on involves French, German and Polish as a matter of fact).
Therefore I cannot start to cut known characters from the end of a
word, it would be too time-consuming and unsafe.

I'll proceed first with the method described above which is probably
slow and not the most elegant but with some luck it'll work. I'll
report on the result...

I'm afraid there is no definable algorithm that you can use. What about
the following sample entries (Dutch words):

bi<->bis
de<->den
re<->ree
do<->doen
keer<->keren
etc...

I think you would need access to some dictionary files anyhow.

Francois Massion · Jun 10, 2006

Thanks Bart,

Well, the issue is really a matter of pragmatism. If I do the work
manually or with some VBA macros it will take ages. The situation I am
trying to address is not so uncommon to people working on glossary
issues. Therefore I am trying to find a language-independant solution
which works, say, for 90% of the words. It won't work in situations
with irregular plurals like the one you mention (or e.g. French for
"Work/works": "Travail / travaux", German for "House/Houses":
"Haus/Häuser") or with character swaps in suffixes but at least it
would reduce substantially the number of cases to deal with.

I can define the length of a suffix with something like this:
for ($suffix=0 ; $suffix <= 1 ; $suffix++) {
or as a length difference between 2 words

I can also find out what is the root and the suffix of terms with
something like this:
$wordend = substr ($term,-$suffix);
$startposition = rindex ($term,$wordend); # position of suffix from
the end
$root = substr ($term,0,$startposition);
But for the moment I am struggling getting the value of one term and
the next one in order to compare them...Hope Dies Last !

Francois

David Squire · Jun 10, 2006

Francois said:
Thanks Bart,

Well, the issue is really a matter of pragmatism. If I do the work
manually or with some VBA macros it will take ages. The situation I am
trying to address is not so uncommon to people working on glossary
issues. Therefore I am trying to find a language-independant solution
which works, say, for 90% of the words. It won't work in situations
with irregular plurals like the one you mention (or e.g. French for
"Work/works": "Travail / travaux", German for "House/Houses":
"Haus/Häuser") or with character swaps in suffixes but at least it
would reduce substantially the number of cases to deal with.

I can define the length of a suffix with something like this:
for ($suffix=0 ; $suffix <= 1 ; $suffix++) {
or as a length difference between 2 words

I can also find out what is the root and the suffix of terms with
something like this:
$wordend = substr ($term,-$suffix);
$startposition = rindex ($term,$wordend); # position of suffix from
the end
$root = substr ($term,0,$startposition);
But for the moment I am struggling getting the value of one term and
the next one in order to compare them...Hope Dies Last !

Are your three languages mixed together, or do they occur in separate
files? If separate, even if the files are not tagged with the language
it should be possible to estimate it quickly with spell-check dictionary
look-ups, and then to use the appropriate language-specific stemmer.

Dictionaries, spell-checkers and stemmers for many languages are
available from CPAN.

DS

Francois Massion · Jun 10, 2006

Hi David,

The languages are not mixed. There are good stemmers for the most
"important" languages. A while ago I found some of them on the
following site:

http://snowball.tartarus.org/

But it is probably too complicated for my purpose.

Francois

David Squire · Jun 10, 2006

Francois said:
Hi David,

Thanks for the salutation, but it is not appropriate here. You are not
replying to me, you are posting to the whole comp.lang.perl.misc community.

Also, please quote context and retain attribution(s) (thus obviating the
need for a salutation) when replying, as has everyone who has replied to
your posts. See the posting guidelines for this group, which are posted
here twice a week.

The languages are not mixed. There are good stemmers for the most
"important" languages. A while ago I found some of them on the
following site:

http://snowball.tartarus.org/

But it is probably too complicated for my purpose.

Well, there is a Perl module providing an interface to the snowball
stemmers available from CPAN, so it should not be at all complicated to
use them, and a much better long-term solution.

DS

Bart Van der Donck · Jun 10, 2006

Francois said:
Well, the issue is really a matter of pragmatism. If I do the work
manually or with some VBA macros it will take ages. The situation I am
trying to address is not so uncommon to people working on glossary
issues. Therefore I am trying to find a language-independant solution
which works, say, for 90% of the words.

If you can afford such an error margin, here is a brute approach:

#!perl
use strict; use warnings;
my $list =
"überzeugt
überzeugt,
überzogen
überzogen,
überzogen.
üblich
übliche
üblichen
üblicherweise";
my @terms = split /\n/, $list;
my $prev = 'nonesuch584685542256RANOM58544';
s/(\.|,|e|en|e,|en,|e\.|en\.)$// for @terms;
@terms = grep($_ ne $prev && ($prev = $_), sort @terms);
print $_."\n" for @terms;

FWIW,

Dr.Ruud · Jun 10, 2006

Bart Van der Donck schreef:

s/(\.|,|e|en|e,|en,|e\.|en\.)$// for @terms;

Weaker alternative:

(1) s/ (?:e|en)? [.,]? $//x for @terms ;

or even

(2) s/ (?:en?)? [.,]? $//x for @terms ;

It's weaker, because the regex matches the empty string as well.

Proper alternative:

(3) s/(?: en? [.,]? | [.,] )$//x for @terms ;

because it matches at least /e$/ or /[.,]$/.

(untested)

Francois Massion · Jun 10, 2006

Bart said:
If you can afford such an error margin, here is a brute approach:

#!perl
use strict; use warnings;
my $list =
"überzeugt
überzeugt,
überzogen
überzogen,
überzogen.
üblich
übliche
üblichen
üblicherweise";
my @terms = split /\n/, $list;
my $prev = 'nonesuch584685542256RANOM58544';

This didn't modify the list. Maybe the reason is the $prev definition.

s/(\.|,|e|en|e,|en,|e\.|en\.)$// for @terms;

I also tried Dr. Ruud's regex but it would have to be rewritten for
each language. Here a Polish list example:
zeliwa
zeliwa.
zeliwa,
zeliwna
zeliwnej
zeliwny
zeliwo
zelu
zurawia
Zurawie
zuraw

Bart Van der Donck · Jun 12, 2006

Francois said:
[...]

#!perl
use strict; use warnings;
my $list =
"überzeugt
überzeugt,
überzogen
überzogen,
überzogen.
üblich
übliche
üblichen
üblicherweise";
my @terms = split /\n/, $list;
my $prev = 'nonesuch584685542256RANOM58544';

Click to expand...

This didn't modify the list.

I didn't mean to modify $list; the new content is in @terms. If you
want $list to contain the new words, you can use something like this at
the end of the program.

$list = join "\n", @terms;

Maybe the reason is the $prev definition.

$prev has no direct importance here, it's only required that it should
not be present in @terms, because it is used to delete double entries
from @terms.

I also tried Dr. Ruud's regex but it would have to be rewritten for
each language.

That is correct, hence my thoughts about language files. My code is a
very brute algorithm - it only strips out the following from the end of
each line:

. , e en e en, e. en.

If you are planning to use this for different languages, you would
obviously need to modify those patterns each time.

Francois Massion · Jun 12, 2006

I have come up with something which seems to work (partially). For my
current purpose it'll do the trick but any suggestion for optimization
is welcome:

(...)
chomp(my $prev = <TERMS>);

my @reducedlist = $prev;

while ( <TERMS> ) {
chomp;
push @reducedlist, $_ unless (/^$prev/ && length ($_)- length($prev)<3)
; # I can set the maximum lenth of a suffix here
$prev = $_;
}
print "$_\n" for @reducedlist;

[Amazing to see how much time people can invest in a few lines of code
when they are no professionals ;-)! ]

Francois

Francois said:
Francois said:

[...]

#!perl
use strict; use warnings;
my $list =
"überzeugt
überzeugt,
überzogen
überzogen,
überzogen.
üblich
übliche
üblichen
üblicherweise";
my @terms = split /\n/, $list;
my $prev = 'nonesuch584685542256RANOM58544';

Click to expand...

This didn't modify the list.

Click to expand...

I didn't mean to modify $list; the new content is in @terms. If you
want $list to contain the new words, you can use something like this at
the end of the program.

$list = join "\n", @terms;

Maybe the reason is the $prev definition.

Click to expand...

$prev has no direct importance here, it's only required that it should
not be present in @terms, because it is used to delete double entries
from @terms.

I also tried Dr. Ruud's regex but it would have to be rewritten for
each language.

Click to expand...

That is correct, hence my thoughts about language files. My code is a
very brute algorithm - it only strips out the following from the end of
each line:

. , e en e en, e. en.

If you are planning to use this for different languages, you would
obviously need to modify those patterns each time.

Dr.Ruud · Jun 12, 2006

Francois Massion schreef:

I have come up with something which seems to work (partially). For my
current purpose it'll do the trick but any suggestion for optimization
is welcome:

(...)
chomp(my $prev = <TERMS>);

my @reducedlist = $prev;

while ( <TERMS> ) {
chomp;
push @reducedlist, $_ unless (/^$prev/ && length ($_)-
length($prev)<3) ; # I can set the maximum lenth of a suffix here
$prev = $_;
}
print "$_\n" for @reducedlist;

[Amazing to see how much time people can invest in a few lines of code
when they are no professionals ;-)! ]

#!/usr/bin/perl
use strict ;
use warnings ;

my @reducedlist ;
my $prev = quotemeta '......' ;

while ( <> )
{
s/\W+$// ; # chop-chop

next if / \A $prev .{0,3} \z /x ;

push @reducedlist, $_ ;
$prev = quotemeta ;
}
print "$_\n" for @reducedlist;

(untested)

I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Where to perform immutable objects logic?	2	Oct 5, 2005
Best way to merge/sort two sorted lists?...	11	Dec 6, 2007
Problem with list.remove() method	7	Nov 20, 2012
converting lists to strings to lists	10	Apr 12, 2006
Password on code - what's the best way to obfuscate it?	8	Sep 11, 2009
Where do two linked lists merge?	0	Jan 26, 2008
add indexes on the fly to lists	3	Jan 24, 2005

best method to perform operations on word lists

Francois Massion

David Squire

Dr.Ruud

Francois Massion

Bart Van der Donck

Francois Massion

David Squire

Francois Massion

David Squire

Bart Van der Donck

Dr.Ruud

Francois Massion

Bart Van der Donck

Francois Massion

Dr.Ruud

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads