best method to perform operations on word lists

Discussion in 'Perl Misc' started by Francois Massion, Jun 10, 2006.

  1. Hi folks,

    I am rather bad at perl and would like some advice on the best
    methodology to do the following:

    I have a list of approx 20,000 terms extracted from a database. The
    list is sorted alphabetically. The entries look like this:

    überzeugt
    überzeugt,
    überzogen
    überzogen,
    überzogen.
    üblich
    übliche
    üblichen
    üblicherweise

    I want to eliminate the variants of a basic word. In the example above
    I want to end up with:
    -überzeugt
    -überzogen
    -üblich
    -üblicherweise

    I have thought of the following:

    (i) I read the list in a hash made of an index and the term
    1 ==> überzeugt
    2 ==> überzeugt,
    etc.

    (ii) I compare each term with its followers

    (iii) if the following condition is not met, I delete the entry
    (key+value) with "delete"

    $term ist a substring of next term AND
    the length difference is, say, below 3 (to avoid deleting
    "üblicherweise" which is a different term)

    I am not sure it is the right methodology. I don't like so much the
    idea of creating artificially the index list (1 ==> Term1).

    I wonder if I should work with references but it is sort of a blackbox
    to me.

    Any comments are appreciated.

    Francois
     
    Francois Massion, Jun 10, 2006
    #1
    1. Advertising

  2. Francois Massion

    David Squire Guest

    Francois Massion wrote:
    > Hi folks,
    >
    > I am rather bad at perl and would like some advice on the best
    > methodology to do the following:
    >
    > I have a list of approx 20,000 terms extracted from a database. The
    > list is sorted alphabetically. The entries look like this:
    >
    > überzeugt
    > überzeugt,
    > überzogen
    > überzogen,
    > überzogen.
    > üblich
    > übliche
    > üblichen
    > üblicherweise
    >
    > I want to eliminate the variants of a basic word. In the example above
    > I want to end up with:
    > -überzeugt
    > -überzogen
    > -üblich
    > -üblicherweise


    [snip]

    At first I thought you just wanted to remove trailing punctuation, which
    is easy, but then I saw that you want to treat üblich, übliche, üblichen
    as equivalent. This is hard to do properly.

    In information retrieval, this is known as stemming. In linguistics it
    is sometimes called lemmatization. To do this well requires a model of
    the language in question. There two basic approaches: algorithmic
    stemming, which is approximate (e.g. the famous Porter stemmer for
    English), and look-up tables, that map all word variants to the
    appropriate stem.

    I know where to point you for some English stemmers, but not for German.

    > I have thought of the following:
    >
    > (i) I read the list in a hash made of an index and the term
    > 1 ==> überzeugt
    > 2 ==> überzeugt,
    > etc.
    >
    > (ii) I compare each term with its followers
    >
    > (iii) if the following condition is not met, I delete the entry
    > (key+value) with "delete"
    >
    > $term ist a substring of next term AND
    > the length difference is, say, below 3 (to avoid deleting
    > "üblicherweise" which is a different term)


    Well, this is the beginning of an stemming algorithm, but I foresee
    problems. What about plurals created by changing vowel sounds in the
    middle of a word with an umlaut?

    If you go to cpan and search for stemmer, you will find quite a few
    modules for stemming, including some that do German, but I do not know
    about their internals/quality.

    Regards,

    DS
     
    David Squire, Jun 10, 2006
    #2
    1. Advertising

  3. Francois Massion

    Dr.Ruud Guest

    Francois Massion schreef:

    > I have a list of approx 20,000 terms extracted from a database. The
    > list is sorted alphabetically. The entries look like this:
    >
    > überzeugt
    > überzeugt,
    > überzogen
    > überzogen,
    > überzogen.
    > üblich
    > übliche
    > üblichen
    > üblicherweise


    You can first clean it up by removing the punctuations at the end of the
    line, and then pipe it through uniq:

    perl -ple 's/[.,]$//' infile | uniq > infile-1


    > I want to eliminate the variants of a basic word. In the example above
    > I want to end up with:
    > -überzeugt
    > -überzogen
    > -üblich
    > -üblicherweise


    You are bound to loose more than you win.

    If you are not in a hurry and have plenty of memory, you can slurp the
    whole file in, and then do

    1 while s/ \n (.+) \n \1 (?:e|en|t) \n /\n$1\n/x ;

    but a while-loop that remembers the previous line is far more efficient.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Jun 10, 2006
    #3
  4. Hi David,
    Dag Mijnheer/Mevrouw Ruud,

    That was a quick reply! Stemming and lemmatization are a known concept,
    but I want to make this exercise language independant (the project I am
    working on involves French, German and Polish as a matter of fact).
    Therefore I cannot start to cut known characters from the end of a
    word, it would be too time-consuming and unsafe.

    I'll proceed first with the method described above which is probably
    slow and not the most elegant but with some luck it'll work. I'll
    report on the result...

    Francois
     
    Francois Massion, Jun 10, 2006
    #4
  5. Francois Massion wrote:

    > That was a quick reply! Stemming and lemmatization are a known concept,
    > but I want to make this exercise language independant (the project I am
    > working on involves French, German and Polish as a matter of fact).
    > Therefore I cannot start to cut known characters from the end of a
    > word, it would be too time-consuming and unsafe.
    >
    > I'll proceed first with the method described above which is probably
    > slow and not the most elegant but with some luck it'll work. I'll
    > report on the result...


    I'm afraid there is no definable algorithm that you can use. What about
    the following sample entries (Dutch words):

    bi<->bis
    de<->den
    re<->ree
    do<->doen
    keer<->keren
    etc...

    I think you would need access to some dictionary files anyhow.

    --
    Bart
     
    Bart Van der Donck, Jun 10, 2006
    #5
  6. Thanks Bart,

    Well, the issue is really a matter of pragmatism. If I do the work
    manually or with some VBA macros it will take ages. The situation I am
    trying to address is not so uncommon to people working on glossary
    issues. Therefore I am trying to find a language-independant solution
    which works, say, for 90% of the words. It won't work in situations
    with irregular plurals like the one you mention (or e.g. French for
    "Work/works": "Travail / travaux", German for "House/Houses":
    "Haus/Häuser") or with character swaps in suffixes but at least it
    would reduce substantially the number of cases to deal with.

    I can define the length of a suffix with something like this:
    for ($suffix=0 ; $suffix <= 1 ; $suffix++) {
    or as a length difference between 2 words

    I can also find out what is the root and the suffix of terms with
    something like this:
    $wordend = substr ($term,-$suffix);
    $startposition = rindex ($term,$wordend); # position of suffix from
    the end
    $root = substr ($term,0,$startposition);
    But for the moment I am struggling getting the value of one term and
    the next one in order to compare them...Hope Dies Last !

    Francois
     
    Francois Massion, Jun 10, 2006
    #6
  7. Francois Massion

    David Squire Guest

    Francois Massion wrote:
    > Thanks Bart,
    >
    > Well, the issue is really a matter of pragmatism. If I do the work
    > manually or with some VBA macros it will take ages. The situation I am
    > trying to address is not so uncommon to people working on glossary
    > issues. Therefore I am trying to find a language-independant solution
    > which works, say, for 90% of the words. It won't work in situations
    > with irregular plurals like the one you mention (or e.g. French for
    > "Work/works": "Travail / travaux", German for "House/Houses":
    > "Haus/Häuser") or with character swaps in suffixes but at least it
    > would reduce substantially the number of cases to deal with.
    >
    > I can define the length of a suffix with something like this:
    > for ($suffix=0 ; $suffix <= 1 ; $suffix++) {
    > or as a length difference between 2 words
    >
    > I can also find out what is the root and the suffix of terms with
    > something like this:
    > $wordend = substr ($term,-$suffix);
    > $startposition = rindex ($term,$wordend); # position of suffix from
    > the end
    > $root = substr ($term,0,$startposition);
    > But for the moment I am struggling getting the value of one term and
    > the next one in order to compare them...Hope Dies Last !


    Are your three languages mixed together, or do they occur in separate
    files? If separate, even if the files are not tagged with the language
    it should be possible to estimate it quickly with spell-check dictionary
    look-ups, and then to use the appropriate language-specific stemmer.

    Dictionaries, spell-checkers and stemmers for many languages are
    available from CPAN.

    DS
     
    David Squire, Jun 10, 2006
    #7
  8. Hi David,

    The languages are not mixed. There are good stemmers for the most
    "important" languages. A while ago I found some of them on the
    following site:

    http://snowball.tartarus.org/

    But it is probably too complicated for my purpose.

    Francois
     
    Francois Massion, Jun 10, 2006
    #8
  9. Francois Massion

    David Squire Guest

    Francois Massion wrote:
    > Hi David,


    Thanks for the salutation, but it is not appropriate here. You are not
    replying to me, you are posting to the whole comp.lang.perl.misc community.

    Also, please quote context and retain attribution(s) (thus obviating the
    need for a salutation) when replying, as has everyone who has replied to
    your posts. See the posting guidelines for this group, which are posted
    here twice a week.

    >
    > The languages are not mixed. There are good stemmers for the most
    > "important" languages. A while ago I found some of them on the
    > following site:
    >
    > http://snowball.tartarus.org/
    >
    > But it is probably too complicated for my purpose.


    Well, there is a Perl module providing an interface to the snowball
    stemmers available from CPAN, so it should not be at all complicated to
    use them, and a much better long-term solution.

    DS
     
    David Squire, Jun 10, 2006
    #9
  10. Francois Massion wrote:

    > Well, the issue is really a matter of pragmatism. If I do the work
    > manually or with some VBA macros it will take ages. The situation I am
    > trying to address is not so uncommon to people working on glossary
    > issues. Therefore I am trying to find a language-independant solution
    > which works, say, for 90% of the words.


    If you can afford such an error margin, here is a brute approach:

    #!perl
    use strict; use warnings;
    my $list =
    "überzeugt
    überzeugt,
    überzogen
    überzogen,
    überzogen.
    üblich
    übliche
    üblichen
    üblicherweise";
    my @terms = split /\n/, $list;
    my $prev = 'nonesuch584685542256RANOM58544';
    s/(\.|,|e|en|e,|en,|e\.|en\.)$// for @terms;
    @terms = grep($_ ne $prev && ($prev = $_), sort @terms);
    print $_."\n" for @terms;

    FWIW,

    --
    Bart
     
    Bart Van der Donck, Jun 10, 2006
    #10
  11. Francois Massion

    Dr.Ruud Guest

    Bart Van der Donck schreef:

    > s/(\.|,|e|en|e,|en,|e\.|en\.)$// for @terms;



    Weaker alternative:

    (1) s/ (?:e|en)? [.,]? $//x for @terms ;


    or even

    (2) s/ (?:en?)? [.,]? $//x for @terms ;

    It's weaker, because the regex matches the empty string as well.


    Proper alternative:

    (3) s/(?: en? [.,]? | [.,] )$//x for @terms ;


    because it matches at least /e$/ or /[.,]$/.


    (untested)

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Jun 10, 2006
    #11
  12. Bart Van der Donck schrieb:

    > Francois Massion wrote:
    >
    > > Well, the issue is really a matter of pragmatism. If I do the work
    > > manually or with some VBA macros it will take ages. The situation I am
    > > trying to address is not so uncommon to people working on glossary
    > > issues. Therefore I am trying to find a language-independant solution
    > > which works, say, for 90% of the words.

    >
    > If you can afford such an error margin, here is a brute approach:
    >
    > #!perl
    > use strict; use warnings;
    > my $list =
    > "überzeugt
    > überzeugt,
    > überzogen
    > überzogen,
    > überzogen.
    > üblich
    > übliche
    > üblichen
    > üblicherweise";
    > my @terms = split /\n/, $list;
    > my $prev = 'nonesuch584685542256RANOM58544';


    This didn't modify the list. Maybe the reason is the $prev definition.


    > s/(\.|,|e|en|e,|en,|e\.|en\.)$// for @terms;


    I also tried Dr. Ruud's regex but it would have to be rewritten for
    each language. Here a Polish list example:
    zeliwa
    zeliwa.
    zeliwa,
    zeliwna
    zeliwnej
    zeliwny
    zeliwo
    zelu
    zurawia
    Zurawie
    zuraw


    > @terms = grep($_ ne $prev && ($prev = $_), sort @terms);
    > print $_."\n" for @terms;
    >
    > FWIW,
    >
    > --
    > Bart
     
    Francois Massion, Jun 10, 2006
    #12
  13. Francois Massion wrote:

    > [...]
    > > #!perl
    > > use strict; use warnings;
    > > my $list =
    > > "überzeugt
    > > überzeugt,
    > > überzogen
    > > überzogen,
    > > überzogen.
    > > üblich
    > > übliche
    > > üblichen
    > > üblicherweise";
    > > my @terms = split /\n/, $list;
    > > my $prev = 'nonesuch584685542256RANOM58544';

    >
    > This didn't modify the list.


    I didn't mean to modify $list; the new content is in @terms. If you
    want $list to contain the new words, you can use something like this at
    the end of the program.

    $list = join "\n", @terms;

    > Maybe the reason is the $prev definition.


    $prev has no direct importance here, it's only required that it should
    not be present in @terms, because it is used to delete double entries
    from @terms.

    > > s/(\.|,|e|en|e,|en,|e\.|en\.)$// for @terms;

    > I also tried Dr. Ruud's regex but it would have to be rewritten for
    > each language.


    That is correct, hence my thoughts about language files. My code is a
    very brute algorithm - it only strips out the following from the end of
    each line:

    . , e en e en, e. en.

    If you are planning to use this for different languages, you would
    obviously need to modify those patterns each time.

    --
    Bart
     
    Bart Van der Donck, Jun 12, 2006
    #13
  14. I have come up with something which seems to work (partially). For my
    current purpose it'll do the trick but any suggestion for optimization
    is welcome:

    (...)
    chomp(my $prev = <TERMS>);

    my @reducedlist = $prev;

    while ( <TERMS> ) {
    chomp;
    push @reducedlist, $_ unless (/^$prev/ && length ($_)- length($prev)<3)
    ; # I can set the maximum lenth of a suffix here
    $prev = $_;
    }
    print "$_\n" for @reducedlist;

    [Amazing to see how much time people can invest in a few lines of code
    when they are no professionals ;-)! ]

    Francois

    Bart Van der Donck schrieb:

    > Francois Massion wrote:
    >
    > > [...]
    > > > #!perl
    > > > use strict; use warnings;
    > > > my $list =
    > > > "überzeugt
    > > > überzeugt,
    > > > überzogen
    > > > überzogen,
    > > > überzogen.
    > > > üblich
    > > > übliche
    > > > üblichen
    > > > üblicherweise";
    > > > my @terms = split /\n/, $list;
    > > > my $prev = 'nonesuch584685542256RANOM58544';

    > >
    > > This didn't modify the list.

    >
    > I didn't mean to modify $list; the new content is in @terms. If you
    > want $list to contain the new words, you can use something like this at
    > the end of the program.
    >
    > $list = join "\n", @terms;
    >
    > > Maybe the reason is the $prev definition.

    >
    > $prev has no direct importance here, it's only required that it should
    > not be present in @terms, because it is used to delete double entries
    > from @terms.
    >
    > > > s/(\.|,|e|en|e,|en,|e\.|en\.)$// for @terms;

    > > I also tried Dr. Ruud's regex but it would have to be rewritten for
    > > each language.

    >
    > That is correct, hence my thoughts about language files. My code is a
    > very brute algorithm - it only strips out the following from the end of
    > each line:
    >
    > . , e en e en, e. en.
    >
    > If you are planning to use this for different languages, you would
    > obviously need to modify those patterns each time.
    >
    > --
    > Bart
     
    Francois Massion, Jun 12, 2006
    #14
  15. Francois Massion

    Dr.Ruud Guest

    Francois Massion schreef:

    > I have come up with something which seems to work (partially). For my
    > current purpose it'll do the trick but any suggestion for optimization
    > is welcome:
    >
    > (...)
    > chomp(my $prev = <TERMS>);
    >
    > my @reducedlist = $prev;
    >
    > while ( <TERMS> ) {
    > chomp;
    > push @reducedlist, $_ unless (/^$prev/ && length ($_)-
    > length($prev)<3) ; # I can set the maximum lenth of a suffix here
    > $prev = $_;
    > }
    > print "$_\n" for @reducedlist;
    >
    > [Amazing to see how much time people can invest in a few lines of code
    > when they are no professionals ;-)! ]



    #!/usr/bin/perl
    use strict ;
    use warnings ;

    my @reducedlist ;
    my $prev = quotemeta '......' ;

    while ( <> )
    {
    s/\W+$// ; # chop-chop

    next if / \A $prev .{0,3} \z /x ;

    push @reducedlist, $_ ;
    $prev = quotemeta ;
    }
    print "$_\n" for @reducedlist;

    (untested)

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Jun 12, 2006
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jesus M. Salvo Jr.
    Replies:
    2
    Views:
    4,269
    robert
    Feb 11, 2006
  2. =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==

    List of lists of lists of lists...

    =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==, May 8, 2006, in forum: Python
    Replies:
    5
    Views:
    413
    =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==
    May 15, 2006
  3. Hollywood
    Replies:
    1
    Views:
    348
    Michael Mair
    Feb 19, 2006
  4. Replies:
    2
    Views:
    1,823
  5. Replies:
    10
    Views:
    508
Loading...

Share This Page