Problem processing Chinese character with Python

A

Anthony Liu

Andrew gave me a sample code with let me read a text
file sentence by sentence.

Suppose I just wanna read the part between 2 full
stops each time.

It works nicely with English text files, where the
full stop is a dot (.).

But when I tried to read Chinese text files, I found
that it sometimes reads a few sentences at one time.

I guess the reason is that in Chinese, the full stop
is not a dot (.), but a little circle, as many of you
probably know.

Indeed, if I replace the Chinese full stop with the
dot. It nicely gets only one sentence each time.

So, how should I fix this problem? I am really having
headache processing Chinese characters with Python.

Here is the sample code that Andrew offered:

def bytes(f):
# Below: f.read(2) to process Chinese
for byte in iter(lambda: f.read(1), ''):
yield byte

def sentences(iterable):
sentence = ''
for char in iterable:
sentence += char
# The little cirlce is the Chinese
# full stop. Some of might not be able
# view it if you don't have
# east Asian language support.
if char in ('。','.'):
yield sentence.strip()
sentence = ''
sentence = sentence.strip()
if sentence:
yield sentence


__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top