F
Francois Massion
Hi folks,
I am rather bad at perl and would like some advice on the best
methodology to do the following:
I have a list of approx 20,000 terms extracted from a database. The
list is sorted alphabetically. The entries look like this:
überzeugt
überzeugt,
überzogen
überzogen,
überzogen.
üblich
übliche
üblichen
üblicherweise
I want to eliminate the variants of a basic word. In the example above
I want to end up with:
-überzeugt
-überzogen
-üblich
-üblicherweise
I have thought of the following:
(i) I read the list in a hash made of an index and the term
1 ==> überzeugt
2 ==> überzeugt,
etc.
(ii) I compare each term with its followers
(iii) if the following condition is not met, I delete the entry
(key+value) with "delete"
$term ist a substring of next term AND
the length difference is, say, below 3 (to avoid deleting
"üblicherweise" which is a different term)
I am not sure it is the right methodology. I don't like so much the
idea of creating artificially the index list (1 ==> Term1).
I wonder if I should work with references but it is sort of a blackbox
to me.
Any comments are appreciated.
Francois
I am rather bad at perl and would like some advice on the best
methodology to do the following:
I have a list of approx 20,000 terms extracted from a database. The
list is sorted alphabetically. The entries look like this:
überzeugt
überzeugt,
überzogen
überzogen,
überzogen.
üblich
übliche
üblichen
üblicherweise
I want to eliminate the variants of a basic word. In the example above
I want to end up with:
-überzeugt
-überzogen
-üblich
-üblicherweise
I have thought of the following:
(i) I read the list in a hash made of an index and the term
1 ==> überzeugt
2 ==> überzeugt,
etc.
(ii) I compare each term with its followers
(iii) if the following condition is not met, I delete the entry
(key+value) with "delete"
$term ist a substring of next term AND
the length difference is, say, below 3 (to avoid deleting
"üblicherweise" which is a different term)
I am not sure it is the right methodology. I don't like so much the
idea of creating artificially the index list (1 ==> Term1).
I wonder if I should work with references but it is sort of a blackbox
to me.
Any comments are appreciated.
Francois