Normalize a polish L

Discussion in 'Python' started by Peter Bengtsson, Oct 15, 2007.

  1. In UTF8, \u0141 is a capital L with a little dash through it as can be
    seen in this image:
    http://static.peterbe.com/lukasz.png

    I tried this:
    >>> import unicodedata
    >>> unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')

    ''

    I was hoping it would convert it it 'L' because that's what it
    visually looks like. And I've seen it becoming a normal ascii L before
    in other programs such as Thunderbird.

    I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
    none of them helped.

    What am I doing wrong?
    Peter Bengtsson, Oct 15, 2007
    #1
    1. Advertising

  2. * Peter Bengtsson (Mon, 15 Oct 2007 16:33:26 -0000)
    > In UTF8, \u0141 is a capital L with a little dash through it as can be
    > seen in this image:
    > http://static.peterbe.com/lukasz.png
    > I tried this:
    > >>> import unicodedata
    > >>> unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')

    > ''
    >
    > I was hoping it would convert it it 'L' because that's what it
    > visually looks like. And I've seen it becoming a normal ascii L before
    > in other programs such as Thunderbird.


    The 'L' is actually pronounced like the English "w"...

    > I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
    > none of them helped.


    >>> unicodedata.decomposition(u'\N{LATIN CAPITAL LETTER C WITH CEDILLA}')

    '0043 0327'

    >>> unicodedata.normalize('NFKD', u'\N{LATIN CAPITAL LETTER C WITH CEDILLA}').encode('ascii','ignore')

    'C'

    >>> unicodedata.decomposition(u'\N{LATIN CAPITAL LETTER L WITH STROKE}')

    ''
    Thorsten Kampe, Oct 15, 2007
    #2
    1. Advertising

  3. Thorsten Kampe wrote:

    > The 'L' is actually pronounced like the English "w"...


    '?' originally comes from "L" (<http://en.wikipedia.org/wiki/?>) and
    is AFAIK transcribed so.

    Also, a friend of mine writes himself "Lukas" (pronounced L-) even
    though in Polish his name is ?ukas (short Wh-).

    Regards,


    Björn

    --
    BOFH excuse #126:

    it has Intel Inside
    Bjoern Schliessmann, Oct 15, 2007
    #3
  4. Peter Bengtsson

    Rob Wolfe Guest

    Peter Bengtsson <> writes:

    > In UTF8, \u0141 is a capital L with a little dash through it as can be
    > seen in this image:
    > http://static.peterbe.com/lukasz.png
    >
    > I tried this:
    >>>> import unicodedata
    >>>> unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')

    > ''
    >
    > I was hoping it would convert it it 'L' because that's what it
    > visually looks like. And I've seen it becoming a normal ascii L before
    > in other programs such as Thunderbird.
    >
    > I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
    > none of them helped.
    >
    > What am I doing wrong?


    I had the same problem and my little research revealed that the problem
    is caused by unicode standard itself. I don't know why
    but characters with stroke don't have canonical equivalent.
    I looked into this file:
    http://unicode.org/Public/UNIDATA/UnicodeData.txt

    and compared two positions:

    1.
    <UnicodeData.txt>
    0142;LATIN SMALL LETTER L WITH STROKE;Ll;0;L;;;;;N;LATIN SMALL LETTER L SLASH \
    ;;0141;;0141
    0141;LATIN CAPITAL LETTER L WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER L SLASH \
    ;;;0142;
    </UnicodeData.txt>

    2.
    <UnicodeData.txt>
    0105;LATIN SMALL LETTER A WITH OGONEK;Ll;0;L;0061 0328;;;;N;LATIN SMALL LETTER A OGONEK \
    ;;0104;;0104
    </UnicodeData.txt>

    In the second position there is in the 6-th field canonical equivalent
    but in the 1-st there is nothing. I don't know what justification
    is behind that, but probably there is something. ;)


    Regards,
    Rob
    Rob Wolfe, Oct 15, 2007
    #4
  5. * Bjoern Schliessmann (Mon, 15 Oct 2007 21:51:54 +0200)
    > Thorsten Kampe wrote:
    > > The 'L' is actually pronounced like the English "w"...

    >
    > '?' originally comes from "L" (<http://en.wikipedia.org/wiki/?>) and
    > is AFAIK transcribed so.


    There are lots of possible transcriptions for "LATIN CAPITAL LETTER L
    WITH STROKE". Transcription is language dependent so the English and
    German transcriptions of Polish names are different.

    > Also, a friend of mine writes himself "Lukas" (pronounced L-) even
    > though in Polish his name is ?ukas (short Wh-).


    Why do you try to use characters in a character set that does not
    contain these characters? That doesn't make any sense.


    Thorsten
    Thorsten Kampe, Oct 15, 2007
    #5
  6. Peter Bengtsson

    John Machin Guest

    On Oct 16, 2:33 am, Peter Bengtsson <> wrote:
    > In UTF8, \u0141 is a capital L with a little dash through it as can be
    > seen in this image:http://static.peterbe.com/lukasz.png
    >
    > I tried this:>>> import unicodedata
    > >>> unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')

    >
    > ''
    >
    > I was hoping it would convert it it 'L' because that's what it
    > visually looks like. And I've seen it becoming a normal ascii L before
    > in other programs such as Thunderbird.
    >
    > I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
    > none of them helped.
    >
    > What am I doing wrong?


    The character in question is NOT composed (in the way that Unicode
    means) of an 'L' and a little slash; hence the concepts of
    "normalization" and "decomposition" don't apply.

    To "asciify" such text, you need to build a look-up table that suits
    your purpose. unicodedata.decomposition() is (accidentally) useful in
    providing *some* of the entries for such a table.
    John Machin, Oct 15, 2007
    #6
  7. Thorsten Kampe wrote:

    > Why do you try to use characters in a character set that does not
    > contain these characters? That doesn't make any sense.


    I thought KNode was smart enough to switch to UTF-8; obviously, it
    isn't.

    Regards,


    Björn

    --
    BOFH excuse #121:

    halon system went off and killed the operators.
    Bjoern Schliessmann, Oct 15, 2007
    #7
  8. Thorsten Kampe wrote:

    > The 'L' is actually pronounced like the English "w"...


    '?' originally comes from "L" (<http://en.wikipedia.org/wiki/?>) and
    is AFAIK transcribed so.

    Also, a friend of mine writes himself "Lukas" (pronounced L-) even
    though in Polish his name is Åukas (short Wh-).

    Regards,


    Björn

    --
    BOFH excuse #126:

    it has Intel Inside
    Bjoern Schliessmann, Oct 15, 2007
    #8
  9. On Oct 15, 10:57 pm, John Machin <> wrote:
    > On Oct 16, 2:33 am, Peter Bengtsson <> wrote:
    >
    >
    >
    > > In UTF8, \u0141 is a capital L with a little dash through it as can be
    > > seen in this image:http://static.peterbe.com/lukasz.png

    >
    > > I tried this:>>> import unicodedata
    > > >>> unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')

    >
    > > ''

    >
    > > I was hoping it would convert it it 'L' because that's what it
    > > visually looks like. And I've seen it becoming a normal ascii L before
    > > in other programs such as Thunderbird.

    >
    > > I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
    > > none of them helped.

    >
    > > What am I doing wrong?

    >
    > The character in question is NOT composed (in the way that Unicode
    > means) of an 'L' and a little slash; hence the concepts of
    > "normalization" and "decomposition" don't apply.
    >
    > To "asciify" such text, you need to build a look-up table that suits
    > your purpose. unicodedata.decomposition() is (accidentally) useful in
    > providing *some* of the entries for such a table.


    Thank you! That explains it.
    Peter Bengtsson, Oct 16, 2007
    #9
  10. On Oct 15, 6:57 pm, John Machin <> wrote:
    > To "asciify" such text, you need to build a look-up table that suits
    > your purpose. unicodedata.decomposition() is (accidentally) useful in
    > providing *some* of the entries for such a table.


    This is the only approach that can actually work, because every
    language has different conventions on how to represent text without
    diacritics.

    For example, in Spanish, "ü" (u with umlaut) should be represented as
    "u", but in German, it should be represented as "ue".

    pingüino -> pinguino
    Frühstück -> Fruehstueck

    I'd like that web applications (e.g. blogs) took into account these
    conventions when creating URLs from the title of an article.
    --
    Roberto Bonvallet
    Roberto Bonvallet, Oct 16, 2007
    #10
  11. Peter Bengtsson

    Mike Orr Guest

    On Oct 16, 9:51 am, Roberto Bonvallet <> wrote:
    > For example, in Spanish, "ü" (u with umlaut) should be represented as
    > "u", but in German, it should be represented as "ue".
    >
    > pingüino -> pinguino
    > Frühstück -> Fruehstueck
    >
    > I'd like that web applications (e.g. blogs) took into account these
    > conventions when creating URLs from the title of an article.


    Well, that gets into official vs unofficial conversions. Does the
    Spanish Academy really say 'ü' should be converted to 'u'? In
    German,'ü' -> 'ue' is an official standard used by Germans themselves.
    In contrast, I've heard that Swedish unlike German prefers 'o' rather
    than 'oe' for 'ö', and Norwegian prefers 'o' for 'ö', even though
    they're all etymologically the same letter as the German 'ö'. Russian
    has some four common ways to romanize/ASCII'ify their alphabet (sylniy
    or sylnyj or silnii? schi or shchi? byt' or bit' -- the latter
    creates a false homograph with bit'. s"yest'?) Yes, on my US-ASCII
    keyboard I simply drop the accents unless I know there's a standard
    conversion (German 'ß' to 'ss'). But whether that should be hardcoded
    into a blog URL library is different matter, and if it is there should
    probably be plugin tables for different preferred standards.

    --Mike
    Mike Orr, Oct 22, 2007
    #11
  12. On Oct 22, 7:50 pm, Mike Orr <> wrote:
    > Well, that gets into official vs unofficial conversions. Does the
    > Spanish Academy really say 'ü' should be converted to 'u'?


    No, but it's the only conversion that makes sense. The only Spanish
    letter that doesn't have a standard common conversion by convention
    is 'ñ', which is usually ASCIIfied as n, nn, gn, nh, ni, ny, ~n, n~,
    or N, with all of them being frequently seen on the Internet.

    > But whether that should be hardcoded
    > into a blog URL library is different matter, and if it is there should
    > probably be plugin tables for different preferred standards.


    Actually there is a hardcoded conversion, that is dropping all
    accented letters altogether, which is IMHO the worst possible
    convention. I have a gallery of pictures of Valparaíso and Viña del
    Mar whose URL is .../ValparaSoViADelMar. And if I wrote a blog entry
    about pingüinos and ñandúes, it would appear probably as .../ping-inos-
    and-and-es. Ugly and off-topic :)

    --
    Roberto Bonvallet
    Roberto Bonvallet, Oct 23, 2007
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. arnold
    Replies:
    1
    Views:
    590
    arnold
    Mar 5, 2006
  2. Christos TZOTZIOY Georgiou

    unicodedata . normalize (NFD - NFC) inconsistency

    Christos TZOTZIOY Georgiou, Nov 8, 2004, in forum: Python
    Replies:
    3
    Views:
    873
    Christos TZOTZIOY Georgiou
    Nov 10, 2004
  3. AndyL
    Replies:
    6
    Views:
    414
    John Machin
    May 25, 2006
  4. =?iso-8859-1?B?TWF0dGlhcyBCcuRuZHN0cvZt?=

    Vector, matrix, normalize, rotate. What package?

    =?iso-8859-1?B?TWF0dGlhcyBCcuRuZHN0cvZt?=, Feb 27, 2007, in forum: Python
    Replies:
    5
    Views:
    6,289
  5. Mike
    Replies:
    0
    Views:
    395
Loading...

Share This Page