helping with unicode

Discussion in 'Python' started by self.python, Jul 3, 2012.

  1. self.python

    self.python Guest

    it's a simple source view program.

    the codec of the target website is utf-8
    so I read it and print the decoded

    --------------------------------------------------------------
    #-*-coding:utf8-*-
    import urllib2

    rf=urllib2.urlopen(r"http://gall.dcinside.com/list.php?id=programming")

    print rf.read().decode('utf-8')

    raw_input()
    ---------------------------------------------------------------

    It works fine on python shell

    but when I make the file "wrong.py" and run it,
    Error rises.

    ----------------------------------------------------------------
    Traceback (most recent call last):
    File "C:wrong.py", line 8, in <module>
    print rf.read().decode('utf-8')
    UnicodeEncodeError: 'cp949' codec can't encode character u'u1368' in position 5
    5122: illegal multibyte sequence
    ---------------------------------------------------------------------

    cp949 is the basic codec of sys.stdout and cmd.exe
    but I have no idea why it doesn't works.
    printing without decode('utf-8') works fine on IDLE but on cmd, it print broken characters(Ascii portion is still fine, problem is only about the Korean)

    the question may look silly:(
    but I want to know what is the problem or how to print the not broken strings.

    thanks for reading.
     
    self.python, Jul 3, 2012
    #1
    1. Advertising

  2. self.python

    Andrew Berg Guest

    On 7/2/2012 7:49 PM, self.python wrote:
    > ----------------------------------------------------------------
    > Traceback (most recent call last):
    > File "C:wrong.py", line 8, in <module>
    > print rf.read().decode('utf-8')
    > UnicodeEncodeError: 'cp949' codec can't encode character u'u1368' in position 5
    > 5122: illegal multibyte sequence
    > ---------------------------------------------------------------------
    >
    > cp949 is the basic codec of sys.stdout and cmd.exe
    > but I have no idea why it doesn't works.
    > printing without decode('utf-8') works fine on IDLE but on cmd, it print broken characters(Ascii portion is still fine, problem is only about the Korean)

    Your terminal can't display those characters. You could try using other
    code pages with chcp (a CLI utility that is part of Windows). IDLE is a
    GUI, so it does not have to work with code pages.

    Python 3.3 supports cp65001 (which is the equivalent of UTF-8 for
    Windows terminals), but unfortunately, previous versions do not.
    --
    CPython 3.3.0a4 | Windows NT 6.1.7601.17803
     
    Andrew Berg, Jul 3, 2012
    #2
    1. Advertising

  3. self.python

    MRAB Guest

    On 03/07/2012 01:49, self.python wrote:
    > it's a simple source view program.
    >
    > the codec of the target website is utf-8
    > so I read it and print the decoded
    >
    > --------------------------------------------------------------
    > #-*-coding:utf8-*-
    > import urllib2
    >
    > rf=urllib2.urlopen(r"http://gall.dcinside.com/list.php?id=programming")
    >
    > print rf.read().decode('utf-8')
    >
    > raw_input()
    > ---------------------------------------------------------------
    >
    > It works fine on python shell
    >
    > but when I make the file "wrong.py" and run it,
    > Error rises.
    >
    > ----------------------------------------------------------------
    > Traceback (most recent call last):
    > File "C:wrong.py", line 8, in <module>
    > print rf.read().decode('utf-8')
    > UnicodeEncodeError: 'cp949' codec can't encode character u'u1368' in position 5
    > 5122: illegal multibyte sequence
    > ---------------------------------------------------------------------
    >
    > cp949 is the basic codec of sys.stdout and cmd.exe
    > but I have no idea why it doesn't works.
    > printing without decode('utf-8') works fine on IDLE but on cmd, it print broken characters(Ascii portion is still fine, problem is only about the Korean)
    >
    > the question may look silly:(
    > but I want to know what is the problem or how to print the not broken strings.
    >
    > thanks for reading.
    >

    The encoding of your console is 'cp949', so when you try to print the
    Unicode string, Python tries to encode it as 'cp949'.

    Unfortunately, the character (actually, when talking about Unicode the
    correct term is 'codepoint') u'\u1368' cannot be encoded into 'cp949'
    because that codepoint does not exist in that encoding, in the same way
    that ASCII doesn't have Korean characters.

    So what is that codepoint?

    >>> import unicodedata
    >>> unicodedata.name(u'\u1368')

    'ETHIOPIC PARAGRAPH SEPARATOR'

    Apparently 'cp949', which is for the Korean language, doesn't support
    Ethiopic codepoints. Somehow that doesn't surprise me! :)
     
    MRAB, Jul 3, 2012
    #3
  4. self.python

    Terry Reedy Guest

    On 7/2/2012 8:49 PM, self.python wrote:
    > it's a simple source view program.
    >
    > the codec of the target website is utf-8
    > so I read it and print the decoded


    which re-encodes before printing

    > --------------------------------------------------------------
    > #-*-coding:utf8-*-
    > import urllib2
    >
    > rf=urllib2.urlopen(r"http://gall.dcinside.com/list.php?id=programming")
    >
    > print rf.read().decode('utf-8')
    >
    > raw_input()
    > ---------------------------------------------------------------
    >
    > It works fine on python shell


    Do you mean the Windows Command Prompt shell?
    >
    > but when I make the file "wrong.py" and run it,
    > Error rises.
    >
    > ----------------------------------------------------------------
    > Traceback (most recent call last):
    > File "C:wrong.py", line 8, in <module>
    > print rf.read().decode('utf-8')
    > UnicodeEncodeError: 'cp949' codec can't encode character u'u1368' in position 5
    > 5122: illegal multibyte sequence
    > ---------------------------------------------------------------------
    >
    > cp949 is the basic codec of sys.stdout and cmd.exe
    > but I have no idea why it doesn't works.


    cp949 is a Euro-Korean multibyte encoding whose mapping is given at
    http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT
    u1368 is not in the mapping. There is no reason the utf-8 site would
    restrict itself to the cp949 subset.

    Perhap it prints in the interpreter because 2.x uses errors = 'replace'
    rather than 'strict' (as in 3.x).

    Try print rf.read().decode('utf-8').encode('cp949', errors = 'replace')
    Non-cp949 chars will print as '?'.

    > printing without decode('utf-8') works fine on IDLE


    because IDLE encodes to utf-8, and x.decode('utf-8').encode('utf-8') == x

    > but on cmd, it print broken characters


    Printing utf-8 encoded bytes as if cp949 encoded bytes is pretty hilariour
    >
    > the question may look silly:(

    but I want to know what is the problem



    or how to print the not broken strings.
    >
    > thanks for reading.
    >



    --
    Terry Jan Reedy
     
    Terry Reedy, Jul 3, 2012
    #4
  5. self.python

    Terry Reedy Guest

    On 7/2/2012 8:49 PM, self.python wrote:
    > it's a simple source view program.
    >
    > the codec of the target website is utf-8
    > so I read it and print the decoded


    which re-encodes before printing

    > --------------------------------------------------------------
    > #-*-coding:utf8-*-
    > import urllib2
    >
    > rf=urllib2.urlopen(r"http://gall.dcinside.com/list.php?id=programming")
    >
    > print rf.read().decode('utf-8')
    >
    > raw_input()
    > ---------------------------------------------------------------
    >
    > It works fine on python shell


    Do you mean the Windows Command Prompt shell?
    >
    > but when I make the file "wrong.py" and run it,
    > Error rises.
    >
    > ----------------------------------------------------------------
    > Traceback (most recent call last):
    > File "C:wrong.py", line 8, in <module>
    > print rf.read().decode('utf-8')
    > UnicodeEncodeError: 'cp949' codec can't encode character u'u1368' in position 5
    > 5122: illegal multibyte sequence
    > ---------------------------------------------------------------------
    >
    > cp949 is the basic codec of sys.stdout and cmd.exe
    > but I have no idea why it doesn't works.


    cp949 is a Euro-Korean multibyte encoding whose mapping is given at
    http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT
    u1368 is not in the mapping. There is no reason the utf-8 site would
    restrict itself to the cp949 subset.

    Perhap it prints in the interpreter because 2.x uses errors = 'replace'
    rather than 'strict' (as in 3.x).

    Try print rf.read().decode('utf-8').encode('cp949', errors = 'replace')
    Non-cp949 chars will print as '?'.

    > printing without decode('utf-8') works fine on IDLE


    because IDLE encodes to utf-8, and x.decode('utf-8').encode('utf-8') == x

    > but on cmd, it print broken characters


    Printing utf-8 encoded bytes as if cp949 encoded bytes is pretty hilariour
    >
    > the question may look silly:(

    but I want to know what is the problem



    or how to print the not broken strings.
    >
    > thanks for reading.
    >



    --
    Terry Jan Reedy
     
    Terry Reedy, Jul 3, 2012
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Suhail A, Salman
    Replies:
    0
    Views:
    360
    Suhail A, Salman
    Aug 13, 2003
  2. soni29
    Replies:
    2
    Views:
    390
    Roedy Green
    Jul 18, 2003
  3. helping

    , Jun 28, 2005, in forum: Java
    Replies:
    1
    Views:
    511
    Andrew Thompson
    Jun 28, 2005
  4. Carl J. Van Arsdall

    need helping tracking down weird bug in cPickle

    Carl J. Van Arsdall, Jun 20, 2006, in forum: Python
    Replies:
    1
    Views:
    251
    Scott David Daniels
    Jun 20, 2006
  5. StreamLogic
    Replies:
    0
    Views:
    400
    StreamLogic
    Aug 11, 2006
Loading...

Share This Page