Re: Problem processing Chinese character with Python

Discussion in 'Python' started by Anthony Liu, Mar 7, 2004.

  1. Anthony Liu

    Anthony Liu Guest

    Hey, I fiddled with the Chinese punctuations, and it
    can work elegantly now.

    Thanks a lot!

    --- Andrew Bennetts <>
    wrote:
    > On Sat, Mar 06, 2004 at 02:05:11AM -0800, Anthony
    > Liu wrote:
    > > Andrew gave me a sample code with let me read a

    > text
    > > file sentence by sentence.
    > >
    > > Suppose I just wanna read the part between 2 full
    > > stops each time.
    > >
    > > It works nicely with English text files, where the
    > > full stop is a dot (.).
    > >
    > > But when I tried to read Chinese text files, I

    > found
    > > that it sometimes reads a few sentences at one

    > time.
    >
    > Yep -- you'll notice I'm reading bytes, but the
    > sentences generator is
    > expecting characters. That assumption holds for
    > ASCII, but not many other
    > encodings.
    >
    > You need some way of reading *characters*, rather
    > than bytes from the file.
    > To do this you need to know the encoding of the file
    > (of course), and then I
    > guess you need to try to decode the bytes as you
    > read them in. I'm just a
    > boring mono-lingual English speaker, so I haven't
    > really played with unicode
    > much, but I guess something along these lines would
    > work:
    >
    > def characters(textFile, encoding):
    > bytes = ''
    > for byte in iter(lambda: textFile.read(1), ''):
    > bytes += byte
    > try:
    > yield bytes.decode(encoding)
    > except TypeError:
    > pass
    > else:
    > bytes = ''
    > if bytes:
    > yield bytes.decode(encoding)
    >
    > Hopefully someone who knows more about unicode will
    > tell me if I've somehow
    > got this completely wrong.
    >
    > Again, reading one byte at a time is pretty
    > inefficient. You can probably
    > optimise fairly easily by reading and decoding large
    > chunks.
    >
    > -Andrew.
    >
    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list



    __________________________________
    Do you Yahoo!?
    Yahoo! Search - Find what you’re looking for faster
    http://search.yahoo.com
    Anthony Liu, Mar 7, 2004
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?U3BpZGVyX0ppYQ==?=

    how to diaplay chinese character in aspx page

    =?Utf-8?B?U3BpZGVyX0ppYQ==?=, May 27, 2004, in forum: ASP .Net
    Replies:
    3
    Views:
    753
    Natty Gur
    May 28, 2004
  2. Anthony Liu
    Replies:
    0
    Views:
    599
    Anthony Liu
    Mar 6, 2004
  3. Anthony Liu

    Problem processing Chinese

    Anthony Liu, Oct 14, 2005, in forum: Python
    Replies:
    1
    Views:
    358
    Peter Otten
    Oct 14, 2005
  4. Blackguester
    Replies:
    0
    Views:
    397
    Blackguester
    Jan 12, 2009
  5. bob
    Replies:
    1
    Views:
    137
    Axel Etzold
    Jun 14, 2007
Loading...

Share This Page