[2.5.1] ShiftJIS to Unicode?

Gilles Ganault · Nov 26, 2008

Hello

I'm trying to read pages from Amazon JP, whose web pages are
supposed to be encoded in ShiftJIS, and decode contents into Unicode
to keep Python happy:

www.amazon.co.jp
<meta http-equiv="content-type" content="text/html; charset=Shift_JIS"
/>

But this doesn't work:

======
m = try.search(the_page)
if m:
#UnicodeEncodeError: 'charmap' codec can't encode characters in
position 49-55: character maps to <undefined>
title = m.group(1).decode('shift_jis').strip()
======

Has someone successfully accessed Shift-JIS-encoded Japanese contents
with Python?

Thank you.

skip · Nov 26, 2008

Gilles> ======
Gilles> m = try.search(the_page)
Gilles> if m:
Gilles> #UnicodeEncodeError: 'charmap' codec can't encode characters in
Gilles> position 49-55: character maps to <undefined>
Gilles> title = m.group(1).decode('shift_jis').strip()
Gilles> ======

Gilles> Has someone successfully accessed Shift-JIS-encoded Japanese
Gilles> contents with Python?

Have you verified that the characters in position 49-55 are actually
Shift-JIS characters? In my experience problems decoding a source string in
any given character set are because of errors in the source, not errors in
Python.

OTOH, the characters in position 49-55 look like plain old ASCII to me.
Does Shift-JIS have ASCII as a proper subset?

MRAB · Nov 26, 2008

Gilles said:
Hello

I'm trying to read pages from Amazon JP, whose web pages are
supposed to be encoded in ShiftJIS, and decode contents into Unicode
to keep Python happy:

www.amazon.co.jp
<meta http-equiv="content-type" content="text/html; charset=Shift_JIS"
/>

But this doesn't work:

======
m = try.search(the_page)

How can you have name "try"? It's a reserved word!

if m:
#UnicodeEncodeError: 'charmap' codec can't encode characters in
position 49-55: character maps to <undefined>
title = m.group(1).decode('shift_jis').strip()
======

Has someone successfully accessed Shift-JIS-encoded Japanese contents
with Python?

No problem here:

Walter Dörwald · Nov 27, 2008

Gilles said:
Hello

I'm trying to read pages from Amazon JP, whose web pages are
supposed to be encoded in ShiftJIS, and decode contents into Unicode
to keep Python happy:

www.amazon.co.jp
<meta http-equiv="content-type" content="text/html; charset=Shift_JIS"
/>

But this doesn't work:

======
m = try.search(the_page)
if m:
#UnicodeEncodeError: 'charmap' codec can't encode characters in
position 49-55: character maps to <undefined>
title = m.group(1).decode('shift_jis').strip()
======

There's something fishy going on: You're calling the decode method and
get a UnicodeEncodeError. This means that you're calling the decode
method on something that already *is* unicode. What does

print type(m.group(1))

output?

Servus,
Walter

Gilles Ganault · Nov 27, 2008

No problem here:

Thanks, but it seems like some pages contain ShiftJIS mixed with some
other code page, and Python complains when trying to display this. I
ended up not displaying the string, and just sending it directly to
the database:

========
title = None
m = firsttry.search(the_page)
if m:
try:
title = m.group(1).decode('shift-jis').strip()
except UnicodeEncodeError:
title = m.group(1).decode('iso8859-1').strip()
except:
title = ""
else:
m = secondtry.search(the_page)
if m:
try:
title = m.group(1).decode('shift-jis').strip()
except UnicodeEncodeError:
title = m.group(1).decode('iso8859-1').strip()
except:
title = ""
else:
print "Nothing found for ISBN %s" % isbn

if title:
#UnicodeEncodeError: 'charmap' codec can't encode characters in
position 49-55: character maps to <undefined>
#print "Found : %s" % title
print "Found stuff"

sql = 'INSERT INTO books (title) VALUES (?)'
cursor.execute(sql,(title,))
========

Thank you

Mark Tolonen · Nov 27, 2008

This is correct. You should read in the whole page and convert it to
Unicode immediately.

Thanks, but it seems like some pages contain ShiftJIS mixed with some
other code page, and Python complains when trying to display this. I
ended up not displaying the string, and just sending it directly to
the database:

========
title = None
m = firsttry.search(the_page)
if m:
try:
title = m.group(1).decode('shift-jis').strip()

You should not search the raw data and decode it later...decode the data
when first brought into the program and do all processing in Unicode.

except UnicodeEncodeError:
title = m.group(1).decode('iso8859-1').strip()
except:
title = ""
else:
m = secondtry.search(the_page)
if m:
try:
title = m.group(1).decode('shift-jis').strip()
except UnicodeEncodeError:
title = m.group(1).decode('iso8859-1').strip()
except:
title = ""
else:
print "Nothing found for ISBN %s" % isbn

if title:
#UnicodeEncodeError: 'charmap' codec can't encode characters in
position 49-55: character maps to <undefined>
#print "Found : %s" % title
print "Found stuff"

Note here that you are getting an "encode" error. When trying to print the
data, Python will try to encode the Unicode data using the terminal's
default encoding, which I suspect is not Shift-JIS.

-Mark

Unicode	2	Mar 15, 2013
Python 3.3, gettext and Unicode problems	0	Dec 30, 2012
string to unicode	0	Aug 15, 2011
Right solution to unicode error?	21	Nov 7, 2012
Ascii to Unicode.	4	Jul 28, 2010
unicode box drawing	4	Mar 4, 2008
unable to print Unicode characters in Python 3	12	Jan 26, 2009
q: how to output a unicode string?	5	Apr 24, 2007

[2.5.1] ShiftJIS to Unicode?

Gilles Ganault

skip

MRAB

Walter Dörwald

Gilles Ganault

Mark Tolonen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads