MeCab UTF-8 Decoding Problem

F

fobos3

Hi,

I am trying to use a program called MeCab, which does syntax analysis on Japanese text. The problem I am having is that it returns a byte string and if I try to print it, it prints question marks for almost all characters. However, if I try to use .decide, it throws an error. Here is my code:

#!/usr/bin/python
# -*- coding:utf-8 -*-

import MeCab
tagger = MeCab.Tagger("-Owakati")
text = 'MeCabã§éŠã‚“ã§ã¿ã‚ˆã†ï¼'

result = tagger.parse(text)
print result

result = result.decode('utf-8')
print result

And here is the output:

MeCab �� �� ��んã§ï¿½ï¿½ �� ��ã†ï¼

Traceback (most recent call last):
File "test.py", line 11, in <module>
result = result.decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7: invalid continuation byte


------------------
(program exited with code: 1)
Press return to continue

Also my terminal is able to display Japanese characters properly. For example print '日本語' works perfectly fine.

Any ideas?
 
G

Giorgos Tzampanakis

Hi,

I am trying to use a program called MeCab, which does syntax analysis on
Japanese text. The problem I am having is that it returns a byte string
and if I try to print it, it prints question marks for almost all
characters. However, if I try to use .decide, it throws an error. Here
is my code:

#!/usr/bin/python
# -*- coding:utf-8 -*-

import MeCab
tagger = MeCab.Tagger("-Owakati")
text = 'MeCab????????????????????????'

result = tagger.parse(text)
print result

result = result.decode('utf-8')
print result

And here is the output:

MeCab ?????? ?????? ?????????????????? ?????? ????????????

Traceback (most recent call last):
File "test.py", line 11, in <module>
result = result.decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7:
invalid continuation byte

Find out what the output of tagger.parse is. Your program assumes it is a
bytestring that contains the utf-8 encoded representation of some text,
but it is obvious that this assumption is wrong.
 
D

Dave Angel


Using Python 2.7 on Linux, presumably? It'd be better to be explicit.
I am trying to use a program called MeCab, which does syntax analysis on Japanese text. The problem I am having is that it returns a byte string and if I try to print it, it prints question marks for almost all characters. However, if I try to use .decide, it throws an error. Here is my code:

What do the MeCab docs say the tagger.parse byte string represents?
Maybe it's not text at all. But surely it's not utf-8.
#!/usr/bin/python
# -*- coding:utf-8 -*-

import MeCab
tagger = MeCab.Tagger("-Owakati")
text = 'MeCabã§éŠã‚“ã§ã¿ã‚ˆã†ï¼'

result = tagger.parse(text)
print result

result = result.decode('utf-8')
print result

And here is the output:

MeCab �� �� ��んã§ï¿½ï¿½ �� ��ã†ï¼

Traceback (most recent call last):
File "test.py", line 11, in <module>
result = result.decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7: invalid continuation byte


------------------
(program exited with code: 1)
Press return to continue

Also my terminal is able to display Japanese characters properly. For example print '日本語' works perfectly fine.

Are your terminal and your text editor using utf-8, or something else?
Can you put your print statement in the source file above, and it'll
also work fine?

Are you actually running it from the terminal, or some GUI? I notice
you get "(program exited with code: 1)" and "Press return to continue".
Neither of those is standard terminal fare on any OS I know of.
 
T

Terry Reedy

Using Python 2.7 on Linux, presumably? It'd be better to be explicit.

It is generally nice to give a link when asking about 3rd party
software. https://code.google.com/p/mecab/
In this case, nearly all the non-boilerplate text is Japanese ;-(.

and the problem with bytes is that they can have any encoding.
In Python 2 (indicated by your print *statements*), a byte string is
just a string.
What do the MeCab docs say the tagger.parse byte string represents?
Maybe it's not text at all. But surely it's not utf-8.

https://mecab.googlecode.com/svn/trunk/mecab/doc/index.html
MeCab: Yet Another Part-of-Speech and Morphological Analyzer
followed by Japanese.

Parts of this appear in the output, as indicated by spaces.
'MeCabã§éŠ ん㧠ã¿ã‚ˆ ã†ï¼'

Python normally prints bytes with ascii chars representing either
themselves or other values with hex escapes. This looks more like
unicode sent to a terminal with a limited character set. I would add

print type(result)

to be sure.
 
T

Terry Reedy

It is generally nice to give a link when asking about 3rd party
software. https://code.google.com/p/mecab/
In this case, nearly all the non-boilerplate text is Japanese ;-(.

My daughter translated the summary paragraph for me.

MeCab is an open source morphological analysis open source engine
developed through a collaborative unit project between Kyoto
University's Informatics Research Department and Nippon Telegraph and
Telephone Corporation Communication Science Laboratories. Its
fundamental premise is a design which is general-purpose and not reliant
on a language, dictionary, or corpus. It uses Conditional Random Fields
(CRF) for the estimation of the parameters, and has improved performance
over ChaSen, which uses a hidden Markov model. In addition, on average
it is faster than ChaSen, Juman, and KAKASI. Incidentally, the creator's
favorite food is mekabu (thick leaves of wakame, a kind of edible
seaweed, from near the root of the stalk).
 
M

MRAB

Hi,

I am trying to use a program called MeCab, which does syntax analysis on Japanese text. The problem I am having is that it returns a byte string and if I try to print it, it prints question marks for almost all characters. However, if I try to use .decide, it throws an error. Here is my code:

#!/usr/bin/python
# -*- coding:utf-8 -*-

import MeCab
tagger = MeCab.Tagger("-Owakati")

This is a bytestring. Are you sure it shouldn't be a Unicode string
instead, i.e. u'MeCabã§éŠã‚“ã§ã¿ã‚ˆã†ï¼'?
 
S

Steven D'Aprano

Hi,

I am trying to use a program called MeCab, which does syntax analysis on
Japanese text. The problem I am having is that it returns a byte string
and if I try to print it, it prints question marks for almost all
characters. However, if I try to use .decide, it throws an error. Here
is my code:

#!/usr/bin/python
# -*- coding:utf-8 -*-

import MeCab
tagger = MeCab.Tagger("-Owakati")
text = 'MeCabã§éŠã‚“ã§ã¿ã‚ˆã†ï¼'

I see from below you are using Python 2.7.

Here you are using a byte-string rather than Unicode. The actual bytes
that you get *may* be indeterminate. I don't think that Python guarantees
that just because the source file is declared as UTF-8, that *implicit*
encoding into bytes will necessarily use UTF-8.

Even if it does, it is still better to use an explicit Unicode string,
and explicitly encode into bytes using whatever encoding MeCab expects
you to use, say:

text = u'MeCabã§éŠã‚“ã§ã¿ã‚ˆã†ï¼'.encode('utf-8')

By the way, what makes you think that MeCab expects, and returns, text
encoded using UTF-8?

result = tagger.parse(text)
print result

result = result.decode('utf-8')
print result

And here is the output:

MeCab �� �� ��んã§ï¿½ï¿½ �� ��ã†ï¼

MeCab has returned a bunch of bytes, representing some text in some
encoding. When you print those bytes, your terminal uses whatever its
default encoding is (probably UTF-8, on a Linux system) and tries to make
sense of the bytes, using � for any byte it cannot make sense of. This is
good evidence that MeCab is *not* actually using UTF-8.

And sure enough, when you try to decode it manually:

Traceback (most recent call last):
File "test.py", line 11, in <module>
result = result.decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7:
invalid continuation byte

Assuming that the bytes being returned are *supposed* to be encoded in
UTF-8, it's possible that MeCab is simply buggy and cannot produce proper
UTF-8 encoded byte strings. This wouldn't surprise me -- after all, using
*byte strings* as non-ASCII text strongly suggests that the author
doesn't understand Unicode very well.

But perhaps more likely, MeCab isn't using UTF-8 at all. What does the
documentation say?

A third possibility is that the string you feed to MeCab is simply
mangled beyond recognition due to the way you create it using the
implicit encoding from chars to bytes. Change the line

text = 'MeCab ...'

to use an explicit Unicode string and encode, as above, and maybe the
error will go away.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top