japanese encoding iso-2022-jp in python vs. perl

kettle · Oct 23, 2007

Hi,
I am rather new to python, and am currently struggling with some
encoding issues. I have some utf-8-encoded text which I need to
encode as iso-2022-jp before sending it out to the world. I am using
python's encode functions:
--
var = var.encode("iso-2022-jp", "replace")
print var
--

I am using the 'replace' argument because there seem to be a couple
of utf-8 japanese characters which python can't correctly convert to
iso-2022-jp. The output looks like this:
â†“æ±äº¬???æ—¥æ¯”è°·ç·š?åŒ—åƒä½è¡Œ

However if use perl's encode module to re-encode the exact same bit
of text:
--
$var = encode("iso-2022-jp", decode("utf8", $var))
print $var
--

I get proper output (no unsightly question-marks):
â†“æ±äº¬ãƒ¡ãƒˆãƒæ—¥æ¯”è°·ç·šãƒ»åŒ—åƒä½è¡Œ

So, what's the deal? Why can't python properly encode some of these
characters? I know there are a host of different iso-2022-jp
variants, could it be using a different one than I think (the
default)? I'm quite liking python at the moment for a variety of
different reasons (I suspect perl will forever win when it comes to
regular expressions but everything else is pretty darn nice), but this
is a bit worrying.

-Joe

Ryan Ginstrom · Oct 23, 2007

On Behalf Of kettle

I am rather new to python, and am currently struggling with some
encoding issues. I have some utf-8-encoded text which I need to
encode as iso-2022-jp before sending it out to the world. I am using
python's encode functions:

Possibly silly question: Is that a utf-8 string, or Unicode?

print unicode(var, "utf8").encode("iso-2022-jp")

On my computer (Japanese XP), your string round-trips between utf-8 and
iso-2022-jp without problems.

Another possible thing to look at is whether your Python output terminal can
print Japanese OK. Does it choke when printing the string as Unicode?

Regards,
Ryan Ginstrom

Guest · Oct 23, 2007

var = var.encode("iso-2022-jp", "replace")

print var [...]
â†“æ±äº¬ãƒ¡ãƒˆãƒæ—¥æ¯”è°·ç·šãƒ»åŒ—åƒä½è¡Œ

So, what's the deal? Why can't python properly encode some of these
characters?

It's not clear. As Ryan says, it works just fine (and so it does for
me with Python 2.4.4 on Debian).

What Python version are you using, and what is the precise string that
you want to encode? (use "print repr(var)" to report that exact value)

HTH,
Martin

Leo Kislov · Oct 24, 2007

Hi,
Â I am rather new to python, and am currently struggling with some
encoding issues. Â I have some utf-8-encoded text which I need to
encode as iso-2022-jp before sending it out to the world. I am using
python's encode functions:
--
Â var = var.encode("iso-2022-jp", "replace")
Â print var
--

Â I am using the 'replace' argument because there seem to be a couple
of utf-8 japanese characters which python can't correctly convert to
iso-2022-jp. Â The output looks like this:
â†“æ±äº¬???æ—¥æ¯”è°·ç·š?åŒ—åƒä½è¡Œ

Â However if use perl's encode module to re-encode the exact same bit
of text:
--
Â $var = encode("iso-2022-jp", decode("utf8", $var))
Â print $var
--

Â I get proper output (no unsightly question-marks):
â†“æ±äº¬ãƒ¡ãƒˆãƒæ—¥æ¯”è°·ç·šãƒ»åŒ—åƒä½è¡Œ

So, what's the deal? Â

Thanks that I have my crystal ball working. I can see clearly that the
forth
character of the input is 'HALFWIDTH KATAKANA LETTER ME' (U+FF92)
which is
not present in ISO-2022-JP as defined by RFC 1468 so python converts
it into
question mark as you requested. Meanwhile perl as usual is trying to
guess what
you want and silently converts that character into 'KATAKANA LETTER
ME' (U+30E1)
which is present in ISO-2022-JP.

Why can't python properly encode some of these
characters?

Because "Explicit is better than implicit". Do you care about
roundtripping?
Do you care about width of characters? What about full-width " (U
+FF02)? Python
doesn't know answers to these questions so it doesn't do anything with
your
input. You have to do it yourself. Assuming you don't care about
roundtripping
and width here is an example demonstrating how to deal with narrow
characters:

from unicodedata import normalize
iso2022_squeezing = dict((i, normalize('NFKC',unichr(i))) for i in
range(0xFF61,0xFFE0))
print repr(u'\uFF92'.translate(iso2022_squeezing))

It prints u'\u30e1'. Feel free to ask questions if something is not
clear.

Note, this is just an example, I *don't* claim it does what you want
for any character
in FF61-FFDF range. You may want to carefully review the whole unicode
block:
http://www.unicode.org/charts/PDF/UFF00.pdf

-- Leo.

kettle · Oct 24, 2007

Thanks Leo, and everyone else, these were very helpful replies. The
issue was exactly as Leo described, and I apologize for not being
aware of it, and thus not quite reporting it correctly.

At the moment I don't care about round-tripping between half-width and
full-width kana, rather I need only be able to rely on any particular
kana character be translated correctly to its half-width or full-width
equivalent, and I need the Japanese I send out to be readable.

I appreciate the 'implicit versus explicit' point, and have read about
it in a few different python mailing lists. In this instance it seems
that perl perhaps ought to flash a warning notification regarding what
it is doing, but as this conversion between half-width and full-width
characters is by far the most logical one available, it also seems
reasonable that python might perhaps include such capabilities by
default, just as it currently includes the 'replace' option for
mapping missed characters generically to '?'.

I still haven't worked out the entire mapping routine, but Leo's hint
is probably sufficient to get it working with a bit more effort.

Again, thanks for the help.

-Joe

How to convert between Japanese coding systems?	3	Feb 19, 2009
HTMLParser can't read japanese	3	Apr 13, 2010
Check if a string contains japanese character and convert from UTF-8 to ISO-2022-JP	5	Mar 16, 2006
Unicode Support in Ruby, Perl, Python, Emacs Lisp	6	Oct 7, 2010
files.py (encoding error)	0	Jun 10, 2013
files.py (weird encoding error)	0	Jun 10, 2013
help wanted regarding displaying Japanese characters in a GUI using QT and python	11	Apr 19, 2006
Python Windows release and encoding	1	May 22, 2013

japanese encoding iso-2022-jp in python vs. perl

kettle

Ryan Ginstrom

Guest

Leo Kislov

kettle

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads