How to convert between Japanese coding systems?

D

Dietrich Bollmann

Hi,

Are there any functions in python to convert between different Japanese
coding systems?

I would like to convert between (at least) ISO-2022-JP, UTF-8, EUC-JP
and SJIS. I also need some function to encode / decode base64 encoded
strings.

I get the strings (which actually are emails) from a server on the
internet with:

import urllib
server = urllib.urlopen(serverURL, parameters)
email = server.read()

The coding systems are given in the response string:

Example:

email = '''[...]
Subject:
=?UTF-8?Q?romaji=E3=81=B2=E3=82=89=E3=81=8C=E3=81=AA=E3=82=AB=E3=82=BF?=
=?UTF-8?Q?=E3=82=AB=E3=83=8A=E6=BC=A2=E5=AD=97?=
[...]
Content-Type: text/plain; charset=EUC-JP
[...]
Content-Transfer-Encoding: base64
[...]

cm9tYWpppNKk6aSspMqlq6W/paulyrTBu/oNCg0K

'''

My idea is to first parse the 'email' string and to extract the email
body as well as the values of the 'Subject: ', the 'Content-Type: ' and
the 'Content-Transfer-Encoding: ' attributes and to after use them to
convert them to some other coding system:

Something in the lines of:

(subject, contentType, contentTransferEncoding, content) =
parseEmail(email)

to = 'utf-8'
subjectUtf8 = decodeSubject(subject, to)

from = contentType
to = 'utf-8'
contentUtf8 = convertCodingSystem(decodeBase64(content), from, to)

The only problem is that I could not find any standard functionality to
convert between different Japanese coding systems.

Thanks,

Dietrich Bollmann
 
P

Peter Otten

Dietrich said:
I get the strings (which actually are emails) from a server on the
internet with:

import urllib
server = urllib.urlopen(serverURL, parameters)
email = server.read()

The coding systems are given in the response string:

Example:

email = '''[...]
Subject:
=?UTF-8?Q?romaji=E3=81=B2=E3=82=89=E3=81=8C=E3=81=AA=E3=82=AB=E3=82=BF?=
=?UTF-8?Q?=E3=82=AB=E3=83=8A=E6=BC=A2=E5=AD=97?=
[...]
Content-Type: text/plain; charset=EUC-JP
[...]
Content-Transfer-Encoding: base64
[...]

cm9tYWpppNKk6aSspMqlq6W/paulyrTBu/oNCg0K

'''

Is that an email? Maybe you can get it in a format that is supported by the
email package in the standard library.
The only problem is that I could not find any standard functionality to
convert between different Japanese coding systems.

Then you didn't look hard enough:
'\x89\xef\x8e\xd0\x8aT\x97v'

See also http://www.amk.ca/python/howto/unicode

Peter
 
J

Justin Ezequiel

Are there any functions in python to convert between different Japanese
coding systems?

I would like to convert between (at least) ISO-2022-JP, UTF-8, EUC-JP
and SJIS.  I also need some function to encode / decode base64 encoded
strings.

Example:

email = '''[...]
Subject:
=?UTF-8?Q?romaji=E3=81=B2=E3=82=89=E3=81=8C=E3=81=AA=E3=82=AB=E3=82=BF?=
=?UTF-8?Q?=E3=82=AB=E3=83=8A=E6=BC=A2=E5=AD=97?=
[...]
Content-Type: text/plain; charset=EUC-JP
[...]
Content-Transfer-Encoding: base64
[...]

cm9tYWpppNKk6aSspMqlq6W/paulyrTBu/oNCg0K

'''

  from = contentType
  to = 'utf-8'
  contentUtf8 = convertCodingSystem(decodeBase64(content), from, to)

The only problem is that I could not find any standard functionality to
convert between different Japanese coding systems.

Thanks,

Dietrich Bollmann

import base64

ENCODINGS = ['ISO-2022-JP', 'UTF-8', 'EUC-JP', 'SJIS']

def decodeBase64(content):
return base64.decodestring(content)

def convertCodingSystem(s, _from, _to):
unicode = s.decode(_from)
return unicode.encode(_to)

if __name__ == '__main__':
content = 'cm9tYWpppNKk6aSspMqlq6W/paulyrTBu/oNCg0K'
_from = 'EUC-JP'
for _to in ENCODINGS:
x = convertCodingSystem(decodeBase64(content), _from, _to)
print _to, repr(x)
 
J

Justin Ezequiel

import email
from email.Header import decode_header
from unicodedata import name as un

MS = '''\
Subject: =?UTF-8?Q?
romaji=E3=81=B2=E3=82=89=E3=81=8C=E3=81=AA=E3=82=AB=E3=82=BF?=
Date: Thu, 19 Feb 2009 09:34:56 -0000
MIME-Version: 1.0
Content-Type: text/plain; charset=EUC-JP
Content-Transfer-Encoding: base64

cm9tYWpppNKk6aSspMqlq6W/paulyrTBu/oNCg0K
'''

def get_header(msg, name):
(value, charset), = decode_header(msg.get(name))
if not charset: return value
return value.decode(charset)

if __name__ == '__main__':
msg = email.message_from_string(MS)
s = get_header(msg, 'Subject')
print repr(s)
for c in s:
try: print un(c)
except ValueError: print repr(c)
print

e = msg.get_content_charset()
b = msg.get_payload(decode=True).decode(e)
print repr(b)
for c in b:
try: print un(c)
except ValueError: print repr(c)
print
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,019
Latest member
RoxannaSta

Latest Threads

Top