Re: lxml can't output right unicode result

Discussion in 'Python' started by MRAB, Sep 7, 2012.

  1. MRAB

    MRAB Guest

    On 07/09/2012 01:21, contro opinion wrote:
    > i eidt a file and save it in gbk encode named test. my system is
    > :debian,locale,en.utf-8;python2.6,locale,utf-8.
    >
    > <html>
    > <p>ä½ </p>
    > </html>
    >
    > in terminal i input:
    >
    > xxd test
    >
    > 0000000: 3c68 746d 6c3e 0a3c 703e c4e3 3c2f 703e <html>.<p>..</p>
    > 0000010: 0a3c 2f68 746d 6c3e 0a .</html>.
    >
    > ä½  is you in english,
    > "\xc4\xe3" is the gbk encode of it.
    > "\xe4\xbd\xe3" is the utf-8 encode of it.
    > "u\x4f\x60" is the unicode encode of it.
    > now i parse it in lxml
    >
    > >>> "ä½ "

    > '\xe4\xbd\xa0'
    > >>> "ä½ ".decode("utf-8")

    > u'\u4f60'
    > >>> "ä½ ".decode("utf-8").encode("gbk")

    > '\xc4\xe3'
    > >>>

    >
    > code1:
    >
    > >>> import lxml.html
    > >>> root=lxml.html.parse("test")
    > >>> d=root.xpath("//p")
    > >>> d[0].text_content()

    > u'\xc4\xe3'
    >
    > in material ,lxml parse file to output the unicode form.
    > why the d[0].text_content() can not output u'\x4f\x60'?
    >
    > code2:
    >
    > import codecs
    > import lxml.html
    > f = codecs.open('test', 'r', 'gbk')
    > root=lxml.html.parse(f)
    > d=root.xpath("//p")
    > d[0].text_content()
    > u'\xe4\xbd\xa0'
    >
    > why the d[0].text_content() can not output u'\x4f\x60'?
    >
    > i am confused by this problem for two days.
    >

    You can't just put some text into a file and expect it to know
    "magically" what the encoding is. You have to specify that the encoding
    is GBK, something like this (in a file actually encoded as GBK, of
    course):

    <html>
    <meta http-equiv="content-type" content="text/html; charset=gbk">
    <p>ä½ </p>
    </html>

    I hope there's a good reason why you're using that encoding and not
    UTF-8.
    MRAB, Sep 7, 2012
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. J.Ram
    Replies:
    7
    Views:
    639
  2. elca
    Replies:
    22
    Views:
    819
  3. Pavel
    Replies:
    7
    Views:
    510
    Pavel
    Sep 19, 2010
  4. Michael Tan
    Replies:
    32
    Views:
    930
    Ara.T.Howard
    Jul 21, 2005
  5. Leon
    Replies:
    3
    Views:
    162
    TaeHo Yoo
    Nov 26, 2004
Loading...

Share This Page