[2.5.1] ShiftJIS to Unicode?

G

Gilles Ganault

Hello

I'm trying to read pages from Amazon JP, whose web pages are
supposed to be encoded in ShiftJIS, and decode contents into Unicode
to keep Python happy:

www.amazon.co.jp
<meta http-equiv="content-type" content="text/html; charset=Shift_JIS"
/>

But this doesn't work:

======
m = try.search(the_page)
if m:
#UnicodeEncodeError: 'charmap' codec can't encode characters in
position 49-55: character maps to <undefined>
title = m.group(1).decode('shift_jis').strip()
======

Has someone successfully accessed Shift-JIS-encoded Japanese contents
with Python?

Thank you.
 
S

skip

Gilles> ======
Gilles> m = try.search(the_page)
Gilles> if m:
Gilles> #UnicodeEncodeError: 'charmap' codec can't encode characters in
Gilles> position 49-55: character maps to <undefined>
Gilles> title = m.group(1).decode('shift_jis').strip()
Gilles> ======

Gilles> Has someone successfully accessed Shift-JIS-encoded Japanese
Gilles> contents with Python?

Have you verified that the characters in position 49-55 are actually
Shift-JIS characters? In my experience problems decoding a source string in
any given character set are because of errors in the source, not errors in
Python.

OTOH, the characters in position 49-55 look like plain old ASCII to me.
Does Shift-JIS have ASCII as a proper subset?
 
M

MRAB

Gilles said:
Hello

I'm trying to read pages from Amazon JP, whose web pages are
supposed to be encoded in ShiftJIS, and decode contents into Unicode
to keep Python happy:

www.amazon.co.jp
<meta http-equiv="content-type" content="text/html; charset=Shift_JIS"
/>

But this doesn't work:

======
m = try.search(the_page)

How can you have name "try"? It's a reserved word!
if m:
#UnicodeEncodeError: 'charmap' codec can't encode characters in
position 49-55: character maps to <undefined>
title = m.group(1).decode('shift_jis').strip()
======

Has someone successfully accessed Shift-JIS-encoded Japanese contents
with Python?
No problem here:
 
W

Walter Dörwald

Gilles said:
Hello

I'm trying to read pages from Amazon JP, whose web pages are
supposed to be encoded in ShiftJIS, and decode contents into Unicode
to keep Python happy:

www.amazon.co.jp
<meta http-equiv="content-type" content="text/html; charset=Shift_JIS"
/>

But this doesn't work:

======
m = try.search(the_page)
if m:
#UnicodeEncodeError: 'charmap' codec can't encode characters in
position 49-55: character maps to <undefined>
title = m.group(1).decode('shift_jis').strip()
======

There's something fishy going on: You're calling the decode method and
get a UnicodeEncodeError. This means that you're calling the decode
method on something that already *is* unicode. What does

print type(m.group(1))

output?

Servus,
Walter
 
G

Gilles Ganault

No problem here:

Thanks, but it seems like some pages contain ShiftJIS mixed with some
other code page, and Python complains when trying to display this. I
ended up not displaying the string, and just sending it directly to
the database:

========
title = None
m = firsttry.search(the_page)
if m:
try:
title = m.group(1).decode('shift-jis').strip()
except UnicodeEncodeError:
title = m.group(1).decode('iso8859-1').strip()
except:
title = ""
else:
m = secondtry.search(the_page)
if m:
try:
title = m.group(1).decode('shift-jis').strip()
except UnicodeEncodeError:
title = m.group(1).decode('iso8859-1').strip()
except:
title = ""
else:
print "Nothing found for ISBN %s" % isbn

if title:
#UnicodeEncodeError: 'charmap' codec can't encode characters in
position 49-55: character maps to <undefined>
#print "Found : %s" % title
print "Found stuff"

sql = 'INSERT INTO books (title) VALUES (?)'
cursor.execute(sql,(title,))
========

Thank you
 
M

Mark Tolonen

This is correct. You should read in the whole page and convert it to
Unicode immediately.
Thanks, but it seems like some pages contain ShiftJIS mixed with some
other code page, and Python complains when trying to display this. I
ended up not displaying the string, and just sending it directly to
the database:

========
title = None
m = firsttry.search(the_page)
if m:
try:
title = m.group(1).decode('shift-jis').strip()

You should not search the raw data and decode it later...decode the data
when first brought into the program and do all processing in Unicode.
except UnicodeEncodeError:
title = m.group(1).decode('iso8859-1').strip()
except:
title = ""
else:
m = secondtry.search(the_page)
if m:
try:
title = m.group(1).decode('shift-jis').strip()
except UnicodeEncodeError:
title = m.group(1).decode('iso8859-1').strip()
except:
title = ""
else:
print "Nothing found for ISBN %s" % isbn

if title:
#UnicodeEncodeError: 'charmap' codec can't encode characters in
position 49-55: character maps to <undefined>
#print "Found : %s" % title
print "Found stuff"

Note here that you are getting an "encode" error. When trying to print the
data, Python will try to encode the Unicode data using the terminal's
default encoding, which I suspect is not Shift-JIS.

-Mark
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,681
Members
48,796
Latest member
Greg L.

Latest Threads

Top