Text::Levenshtein and utf8 woes

G

Guest

Hello utf8 wizards,

My series of utf8-related problem reports continues. Please have a look
at the output of the following script:

#!/usr/bin/perl # uncommenting one or either of these
# binmode(STDOUT,":utf8"); # lines changes output significantly!
# use utf8;
use Text::Levenshtein qw(distance);
$lemma="Å tein";# the first letter is capital S with hacek, or U+0160
@candidates=("stein","Stein","steïn","Steïn","štein","šteïn");
for $candidate (@candidates) {
print "$lemma -> $candidate: ".
distance($lemma,$candidate)."\n";

}

Please note the edit distances; apparently the Text::Levenshtein module
works bytewise and not characterwise. To make things even more complicated,
the return values of 'distance' change with the settings of the first
lines. Again, perl is v.5.8.5 on a Linux box, everything utf8-enabled.

Best regards,
Oliver.
 
P

Peter J. Holzer

My series of utf8-related problem reports continues. Please have a
look at the output of the following script:

#!/usr/bin/perl # uncommenting one or either of these
# binmode(STDOUT,":utf8"); # lines changes output significantly!
# use utf8;

Well, that is to be expected. binmode determines how your strings are
printed, and utf8 determines the source charset of your script.
So if you change them, of course the output changes.
use Text::Levenshtein qw(distance);
$lemma="Å tein";# the first letter is capital S with hacek, or U+0160
@candidates=("stein","Stein","steïn","Steïn","štein","šteïn");

The source code is obviously in UTF-8, so you need the "use ut8" pragma.
If you omit it, $lemma will contain a string with 6 characters, the
first two of which are "Å" (char 0xC5, Latin capital letter A with ring
above) and " " (char 0xA0, Non-breaking space).
Please note the edit distances; apparently the Text::Levenshtein
module works bytewise and not characterwise.

This is not at all apparent to me. The distance between Å tein and stein
is 1, as would be expected in character semantics. However, the distance
between Štein and šteïn is 4 (instead of 2), and the distance between
Štein and Šteïn is also 4 (instead of 1), which is quite puzzling.
Looking at the source-code I see that it does use character semantics,
but I don't really understand the code. The algorithm differs from the
one on the web page referenced in the POD, and I am unsure if this is a
valid optimization or a bug. Since it doesn't produce the expected
results, I am inclined to think it's a bug.

This also seems to be completely independent on whether you use utf-8 or
non-ascii characters. Consider the following changes to your script:

$lemma="stein";
@candidates=("xtein","sxein","stxan","stexn","steix");

All candidates differ by exactly 1 character from $lemma. Yet the script
prints:

stein -> xtein: 1
stein -> sxein: 2
stein -> stxan: 4
stein -> stexn: 4
stein -> steix: 1

To make things even more complicated, the return values of 'distance'
change with the settings of the first lines.

Not "lines". They change exactly with the presence or absence of the
utf8 pragma, since this determines how your source code is parsed.

hp
 
D

Donald King

Hello utf8 wizards,

My series of utf8-related problem reports continues. Please have a look
at the output of the following script:

#!/usr/bin/perl # uncommenting one or either of these
# binmode(STDOUT,":utf8"); # lines changes output significantly!
# use utf8;
use Text::Levenshtein qw(distance);
$lemma="Ã… tein";# the first letter is capital S with hacek, or U+0160
@candidates=("stein","Stein","steïn","Steïn","štein","šteïn");
for $candidate (@candidates) {
print "$lemma -> $candidate: ".
distance($lemma,$candidate)."\n";

}

Please note the edit distances; apparently the Text::Levenshtein module
works bytewise and not characterwise. To make things even more complicated,
the return values of 'distance' change with the settings of the first
lines. Again, perl is v.5.8.5 on a Linux box, everything utf8-enabled.

Best regards,
Oliver.

Once I discombobulated your newsreader's "creative" idea of posting in
UTF-8, I uncommented the "use utf8" and "binmode" lines, ran the
program, and got the following result:

Å tein -> stein: 1
Å tein -> Stein: 1
Štein -> steïn: 4
Štein -> Steïn: 4
Å tein -> Å¡tein: 1
Štein -> šteïn: 4

(I assume that's correct -- U+0160 S w/ caron, U+0161 s w/ caron, U+00EF
i w/ diaeresis.)

While those 4's are wrong, I think it's a general bug in
Text::Levenshtein unrelated to UTF-8, since similar all-ASCII strings
give the same results. (If the T, E, or I change, the result is 2, 3,
or 4 respectively. If the N changes, the result is 1.)

Anyways, as a debugging aid, you might try the following program:

#!/usr/bin/perl
use utf8;
printf "U+%04X\n", ord "Å ";

Which should, of course, output "U+0160". If you get anything else,
something fishy is going on with your terminal settings or editor.
 
G

Guest

: Once I discombobulated your newsreader's "creative" idea of posting in
: UTF-8, I uncommented the "use utf8" and "binmode" lines, ran the
: program, and got the following result:

I ask for apologies for any inconvenience caused.

This is tin, running on a Unix box, seen via ssh in a Mac OS X terminal.
Where do I start tweaking?

: While those 4's are wrong, I think it's a general bug in
: Text::Levenshtein unrelated to UTF-8, since similar all-ASCII strings
: give the same results. (If the T, E, or I change, the result is 2, 3,
: or 4 respectively. If the N changes, the result is 1.)

Right; I had trouble with some pure-ASCII strings but since I always
had accented material in my data I never paid attention to the mangled
ASCII stuff :-(

: Anyways, as a debugging aid, you might try the following program:

: #!/usr/bin/perl
: use utf8;
: printf "U+%04X\n", ord "Å ";

: Which should, of course, output "U+0160". If you get anything else,
: something fishy is going on with your terminal settings or editor.

: --
: Donald King, a.k.a. Chronos Tachyon
: http://chronos-tachyon.net/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top