Re: Unicode support

Discussion in 'Python' started by Richy2004, Aug 6, 2004.

  1. Richy2004

    Richy2004 Guest

    code:
    import sys,codecs
    file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16")
    print (file.readline())

    output:
    File "./test.py", line 5, in ?
    print (file.readline())
    File "C:\Python23\lib\codecs.py", line 384, in readline
    return self.reader.readline(size)
    File "c:\Python23\lib\encodings\utf_16.py", line 57, in readline
    raise NotImplementedError, '.readline() is not implemented for
    UTF-16'
    NotImplementedError: .readline() is not implemented for UTF-16

    ======================================================
    code:
    import sys, codecs
    file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16")
    print (file.read())

    output:
    Traceback (most recent call last):
    File "./test.py", line 5, in ?
    print (file.read())
    File "c:\Python23\lib\encodings\cp850.py", line 18, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
    UnicodeEncodeError: 'charmap' codec can't encode characters in position
    0-2: character maps to <undefined>

    ======================================================
    code:
    import sys, codecs
    file = codecs.open("accountmgr_words_arb.txt", "rb", "utf-16")
    lines = file.readlines()
    print lines

    this works !, output:
    [u'\u0646\u0648\u0639 \u062d\u0633\u0627\u0628 \u062c\u062f\u064a\u062f
    \u0645\u062e\u062a\u0627\u0631.\r\n']

    if I add these lines:
    line = lines[0]
    tokens = line.split("\\u")
    print tokens[0]

    I get this: :(
    Traceback (most recent call last):
    File "./test.py", line 8, in ?
    print tokens[0]
    File "c:\Python23\lib\encodings\cp850.py", line 18, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
    UnicodeEncodeError: 'charmap' codec can't encode characters in position
    0-2: character maps to <undefined>

    Thanks,
    Richard
     
    Richy2004, Aug 6, 2004
    #1
    1. Advertising

  2. Richy2004 wrote:

    > code:
    > import sys,codecs
    > file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16")
    > print (file.readline())
    >
    > output:
    > File "./test.py", line 5, in ?
    > print (file.readline())
    > File "C:\Python23\lib\codecs.py", line 384, in readline
    > return self.reader.readline(size)
    > File "c:\Python23\lib\encodings\utf_16.py", line 57, in readline
    > raise NotImplementedError, '.readline() is not implemented for
    > UTF-16'
    > NotImplementedError: .readline() is not implemented for UTF-16
    >
    > ======================================================
    > code:
    > import sys, codecs
    > file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16")
    > print (file.read())
    >
    > output:
    > Traceback (most recent call last):
    > File "./test.py", line 5, in ?
    > print (file.read())
    > File "c:\Python23\lib\encodings\cp850.py", line 18, in encode
    > return codecs.charmap_encode(input,errors,encoding_map)
    > UnicodeEncodeError: 'charmap' codec can't encode characters in position
    > 0-2: character maps to <undefined>
    >
    > ======================================================
    > code:
    > import sys, codecs
    > file = codecs.open("accountmgr_words_arb.txt", "rb", "utf-16")
    > lines = file.readlines()
    > print lines


    > this works !, output:
    > [u'\u0646\u0648\u0639 \u062d\u0633\u0627\u0628 \u062c\u062f\u064a\u062f
    > \u0645\u062e\u062a\u0627\u0631.\r\n']


    You understand this is just one line, and not multiple lines? Just
    checking. The reason why it works is that you are getting a
    representation of the list.

    > line = lines[0]
    > tokens = line.split("\\u")

    This line doesn't make sense. Do you want to split up the line into a
    list of individual characters as in:
    >> tokens = list(lines[0])
    >> print tokens

    [u'\u0646', u'\u0648', u'\u0639', u'\u062d', u'\u0633', u'\u0627',
    u'\u0628', u'\u062c', u'\u062f', u'\u064a', u'\u062f', u'\u0645',
    u'\u062e', u'\u062a', u'\u0627', u'\u0631', u'.', u'\r', u'\n']


    > print tokens[0]
    >
    > I get this: :(
    > Traceback (most recent call last):
    > File "./test.py", line 8, in ?
    > print tokens[0]
    > File "c:\Python23\lib\encodings\cp850.py", line 18, in encode
    > return codecs.charmap_encode(input,errors,encoding_map)
    > UnicodeEncodeError: 'charmap' codec can't encode characters in position
    > 0-2: character maps to <undefined>


    Anyway, you are trying to print to the console window. AFAIK, Python 2.3
    guesses the console encoding, which in your case is cp850.py, and uses
    it as single- byte encoding to encode your unicode characters before
    writing them to stdout. Unfortunately, you cannot print which I believe
    are Arabic characters to a CP850 encoded console (as a matter of fact,
    you can't print any of the so-called 'complex scripts' to any windows
    console, but that is a different matter).

    If you run the same script in a lets say, IDLE you won't have that
    problem. In other words, if you need to print these characters, you have
    to either print them as unicode characters to a unicode-savy output, or
    encode them in an appropriate single-byte encoding (e.g. "cp1256") and
    output them to an output window that nows how to deal with it.

    --
    Vincent Wehren
    >
    > Thanks,
    > Richard
    >
     
    vincent wehren, Aug 6, 2004
    #2
    1. Advertising

  3. Richy2004 wrote:
    > NotImplementedError: .readline() is not implemented for UTF-16


    As it says: this is, unfortunately, not implemented. Use readlines
    instead.

    > print (file.read())

    [...]
    > UnicodeEncodeError: 'charmap' codec can't encode characters in position
    > 0-2: character maps to <undefined>


    The the .read works perfectly. Don't try to print it, though!
    You can only print when the terminal actually supports the characters,
    which your terminal doesn't. Try

    print repr(file.read())

    instead.

    > print tokens[0]

    [...]
    > UnicodeEncodeError: 'charmap' codec can't encode characters in position
    > 0-2: character maps to <undefined>


    Same issue: As Vincent explains, you can't print ARABIC LETTER NOON
    to your terminal, as your terminal simply cannot display that character.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Aug 6, 2004
    #3
  4. On 6 Aug 2004 07:57:44 -0700, Richy2004 <> wrote:
    > code:
    > import sys,codecs
    > file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16")
    > print (file.readline())
    >
    > output:
    > File "./test.py", line 5, in ?
    > print (file.readline())
    > File "C:\Python23\lib\codecs.py", line 384, in readline
    > return self.reader.readline(size)
    > File "c:\Python23\lib\encodings\utf_16.py", line 57, in readline
    > raise NotImplementedError, '.readline() is not implemented for
    > UTF-16'
    > NotImplementedError: .readline() is not implemented for UTF-16
    >


    UTF-16 readline is being supported by CJKCodecs 1.1. :)

    >>> import codecs
    >>> codecs.open("u16test", "r", "cjkcodecs.utf-16")

    <open file 'u16test', mode 'rb' at 0x81ab7e0>
    >>> _.readline()

    u'\u25ce \ud30c\uc774\uc36c(Python)\uc740 \ubc30\uc6b0\uae30
    \uc27d\uace0, \uac15\ub825\ud55c \ud504\ub85c\uadf8\ub798\ubc0d
    \uc5b8\uc5b4\uc785\ub2c8\ub2e4. \ud30c\uc774\uc36c\uc740\n'


    Hye-Shik
     
    Hye-Shik Chang, Aug 7, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.

Share This Page