Normalize a polish L

P

Peter Bengtsson

In UTF8, \u0141 is a capital L with a little dash through it as can be
seen in this image:
http://static.peterbe.com/lukasz.png

I tried this:''

I was hoping it would convert it it 'L' because that's what it
visually looks like. And I've seen it becoming a normal ascii L before
in other programs such as Thunderbird.

I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
none of them helped.

What am I doing wrong?
 
T

Thorsten Kampe

* Peter Bengtsson (Mon, 15 Oct 2007 16:33:26 -0000)
In UTF8, \u0141 is a capital L with a little dash through it as can be
seen in this image:
http://static.peterbe.com/lukasz.png
I tried this:
''

I was hoping it would convert it it 'L' because that's what it
visually looks like. And I've seen it becoming a normal ascii L before
in other programs such as Thunderbird.

The 'L' is actually pronounced like the English "w"...
I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
none of them helped.
'0043 0327'
''
 
B

Bjoern Schliessmann

Thorsten said:
The 'L' is actually pronounced like the English "w"...

'?' originally comes from "L" (<http://en.wikipedia.org/wiki/?>) and
is AFAIK transcribed so.

Also, a friend of mine writes himself "Lukas" (pronounced L-) even
though in Polish his name is ?ukas (short Wh-).

Regards,


Björn
 
R

Rob Wolfe

Peter Bengtsson said:
In UTF8, \u0141 is a capital L with a little dash through it as can be
seen in this image:
http://static.peterbe.com/lukasz.png

I tried this:
''

I was hoping it would convert it it 'L' because that's what it
visually looks like. And I've seen it becoming a normal ascii L before
in other programs such as Thunderbird.

I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
none of them helped.

What am I doing wrong?

I had the same problem and my little research revealed that the problem
is caused by unicode standard itself. I don't know why
but characters with stroke don't have canonical equivalent.
I looked into this file:
http://unicode.org/Public/UNIDATA/UnicodeData.txt

and compared two positions:

1.
<UnicodeData.txt>
0142;LATIN SMALL LETTER L WITH STROKE;Ll;0;L;;;;;N;LATIN SMALL LETTER L SLASH \
;;0141;;0141
0141;LATIN CAPITAL LETTER L WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER L SLASH \
;;;0142;
</UnicodeData.txt>

2.
<UnicodeData.txt>
0105;LATIN SMALL LETTER A WITH OGONEK;Ll;0;L;0061 0328;;;;N;LATIN SMALL LETTER A OGONEK \
;;0104;;0104
</UnicodeData.txt>

In the second position there is in the 6-th field canonical equivalent
but in the 1-st there is nothing. I don't know what justification
is behind that, but probably there is something. ;)


Regards,
Rob
 
T

Thorsten Kampe

* Bjoern Schliessmann (Mon, 15 Oct 2007 21:51:54 +0200)
'?' originally comes from "L" (<http://en.wikipedia.org/wiki/?>) and
is AFAIK transcribed so.

There are lots of possible transcriptions for "LATIN CAPITAL LETTER L
WITH STROKE". Transcription is language dependent so the English and
German transcriptions of Polish names are different.
Also, a friend of mine writes himself "Lukas" (pronounced L-) even
though in Polish his name is ?ukas (short Wh-).

Why do you try to use characters in a character set that does not
contain these characters? That doesn't make any sense.


Thorsten
 
J

John Machin

In UTF8, \u0141 is a capital L with a little dash through it as can be
seen in this image:http://static.peterbe.com/lukasz.png

I tried this:>>> import unicodedata

''

I was hoping it would convert it it 'L' because that's what it
visually looks like. And I've seen it becoming a normal ascii L before
in other programs such as Thunderbird.

I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
none of them helped.

What am I doing wrong?

The character in question is NOT composed (in the way that Unicode
means) of an 'L' and a little slash; hence the concepts of
"normalization" and "decomposition" don't apply.

To "asciify" such text, you need to build a look-up table that suits
your purpose. unicodedata.decomposition() is (accidentally) useful in
providing *some* of the entries for such a table.
 
B

Bjoern Schliessmann

Thorsten said:
Why do you try to use characters in a character set that does not
contain these characters? That doesn't make any sense.

I thought KNode was smart enough to switch to UTF-8; obviously, it
isn't.

Regards,


Björn
 
B

Bjoern Schliessmann

Thorsten said:
The 'L' is actually pronounced like the English "w"...

'?' originally comes from "L" (<http://en.wikipedia.org/wiki/?>) and
is AFAIK transcribed so.

Also, a friend of mine writes himself "Lukas" (pronounced L-) even
though in Polish his name is Åukas (short Wh-).

Regards,


Björn
 
P

Peter Bengtsson

The character in question is NOT composed (in the way that Unicode
means) of an 'L' and a little slash; hence the concepts of
"normalization" and "decomposition" don't apply.

To "asciify" such text, you need to build a look-up table that suits
your purpose. unicodedata.decomposition() is (accidentally) useful in
providing *some* of the entries for such a table.

Thank you! That explains it.
 
R

Roberto Bonvallet

To "asciify" such text, you need to build a look-up table that suits
your purpose. unicodedata.decomposition() is (accidentally) useful in
providing *some* of the entries for such a table.

This is the only approach that can actually work, because every
language has different conventions on how to represent text without
diacritics.

For example, in Spanish, "ü" (u with umlaut) should be represented as
"u", but in German, it should be represented as "ue".

pingüino -> pinguino
Frühstück -> Fruehstueck

I'd like that web applications (e.g. blogs) took into account these
conventions when creating URLs from the title of an article.
 
M

Mike Orr

For example, in Spanish, "ü" (u with umlaut) should be represented as
"u", but in German, it should be represented as "ue".

pingüino -> pinguino
Frühstück -> Fruehstueck

I'd like that web applications (e.g. blogs) took into account these
conventions when creating URLs from the title of an article.

Well, that gets into official vs unofficial conversions. Does the
Spanish Academy really say 'ü' should be converted to 'u'? In
German,'ü' -> 'ue' is an official standard used by Germans themselves.
In contrast, I've heard that Swedish unlike German prefers 'o' rather
than 'oe' for 'ö', and Norwegian prefers 'o' for 'ö', even though
they're all etymologically the same letter as the German 'ö'. Russian
has some four common ways to romanize/ASCII'ify their alphabet (sylniy
or sylnyj or silnii? schi or shchi? byt' or bit' -- the latter
creates a false homograph with bit'. s"yest'?) Yes, on my US-ASCII
keyboard I simply drop the accents unless I know there's a standard
conversion (German 'ß' to 'ss'). But whether that should be hardcoded
into a blog URL library is different matter, and if it is there should
probably be plugin tables for different preferred standards.

--Mike
 
R

Roberto Bonvallet

Well, that gets into official vs unofficial conversions. Does the
Spanish Academy really say 'ü' should be converted to 'u'?

No, but it's the only conversion that makes sense. The only Spanish
letter that doesn't have a standard common conversion by convention
is 'ñ', which is usually ASCIIfied as n, nn, gn, nh, ni, ny, ~n, n~,
or N, with all of them being frequently seen on the Internet.
But whether that should be hardcoded
into a blog URL library is different matter, and if it is there should
probably be plugin tables for different preferred standards.

Actually there is a hardcoded conversion, that is dropping all
accented letters altogether, which is IMHO the worst possible
convention. I have a gallery of pictures of Valparaíso and Viña del
Mar whose URL is .../ValparaSoViADelMar. And if I wrote a blog entry
about pingüinos and ñandúes, it would appear probably as .../ping-inos-
and-and-es. Ugly and off-topic :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top