Shift-JIS to UTF-8 conversion

Discussion in 'Python' started by PyTJ, May 19, 2005.

  1. PyTJ

    PyTJ Guest

    Hello everybody,

    I need to convert a Japanese Shift-JIS CSV file to Unicode UTF-8.

    My machine is a Windows 98 english computer with Python 2.3.4

    Any hints?.
    PyTJ, May 19, 2005
    #1
    1. Advertising

  2. PyTJ

    Jeff Epler Guest

    I think you do something like this (untested):

    import codecs

    def transcode(infile, outfile, incoding="shift-jis",
    outcoding="utf-8"):
    f = codecs.open(infile, "rb", incoding)
    g = codecs.open(outfile, "wb", outcoding)

    g.write(f.read())
    # If the file is so large that it can't be read at once, do a loop which
    # reads and writes smaller chunks
    # while 1:
    # block = f.read(4096000)
    # if not block: break
    # g.write(block)

    f.close()
    g.close()

    Jeff

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.6 (GNU/Linux)

    iD8DBQFCjRzZJd01MZaTXX0RAg8YAJ4rQ8Fcpwi1AB2a/ZVdALGysct8jACfYdXm
    in2aJ3xmdB0ncRZBWXmfMQs=
    =bHjV
    -----END PGP SIGNATURE-----
    Jeff Epler, May 20, 2005
    #2
    1. Advertising

  3. PyTJ

    Guest

    Hello,
    I think the answer is basically correct but shift-jis is not a standard
    part of
    Python 2.3. You will either need to use Python 2.4 where the cjkcodes
    are integrated or install them under Python 2.3. The link is
    http://cjkpython.i18n.org/

    You then also need:
    import cjkcodecs.aliases

    Richard

    Jeff Epler wrote:
    > I think you do something like this (untested):
    >
    > import codecs
    >
    > def transcode(infile, outfile, incoding="shift-jis",
    > outcoding="utf-8"):
    > f = codecs.open(infile, "rb", incoding)
    > g = codecs.open(outfile, "wb", outcoding)
    >
    > g.write(f.read())
    > # If the file is so large that it can't be read at once, do a loop

    which
    > # reads and writes smaller chunks
    > # while 1:
    > # block = f.read(4096000)
    > # if not block: break
    > # g.write(block)
    >
    > f.close()
    > g.close()
    >
    > Jeff
    , May 20, 2005
    #3
  4. PyTJ wrote:

    > I need to convert a Japanese Shift-JIS CSV file to Unicode UTF-8.
    >
    > My machine is a Windows 98 english computer with Python 2.3.4
    >
    > Any hints?.
    >


    First, you need to install codecs to support japanese encodings.
    Python 2.3.* does not support SJIS by default.

    I'll give you two options.

    - Japanese Codecs
    http://www.python.jp/Zope/download/JapaneseCodecs

    http://ftp.python.jp/pub/JapaneseCodecs/JapaneseCodecs-1.4.10.win32-py2.3.exe

    - CJKCodecs
    http://cjkpython.i18n.org/
    http://download.berlios.de/cjkpython/cjkcodecs-1.1.win32-py2.3.exe

    If you only need Japanese support, Japanese Codecs might be handy.
    On the other hand, CJKCodecs can handle much broader encodings.
    Aside from that, starting from 2.4, Python ships with CJKCodecs,
    so I'd recomment CJKCodecs without reservations.

    -- george
    George Yoshida, May 20, 2005
    #4
  5. PyTJ

    Jeff Epler Guest

    On Fri, May 20, 2005 at 12:16:15AM -0700, wrote:
    > Hello, I think the answer is basically correct but shift-jis is not a
    > standard part of Python 2.3.


    Ah, I was fooled --- I tested on Python 2.3, but my packager must have
    included the codecs you went on to mention.

    Jeff

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.1 (GNU/Linux)

    iD8DBQFCkcJCJd01MZaTXX0RAhNUAKCLbSsAAzxXe9UIjMXd5AN/wKcfbQCeI9j0
    lpU5Zu0BgAdD2hTFvKB8kJs=
    =Tof0
    -----END PGP SIGNATURE-----
    Jeff Epler, May 23, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Roberto Gallo

    Shift - byte[] buf shift

    Roberto Gallo, Jan 27, 2004, in forum: Java
    Replies:
    3
    Views:
    2,056
    Thomas Schodt
    Jan 27, 2004
  2. Arifi Koseoglu
    Replies:
    2
    Views:
    971
    Arifi Koseoglu
    Apr 13, 2004
  3. JIS 2004 support

    , Dec 11, 2006, in forum: ASP .Net
    Replies:
    0
    Views:
    382
  4. Replies:
    6
    Views:
    875
    Matej Cepl
    Apr 12, 2007
  5. Ed Brandmark

    UTF-8 to Shift JIS

    Ed Brandmark, Sep 12, 2003, in forum: Javascript
    Replies:
    4
    Views:
    522
    Ed Brandmark
    Sep 15, 2003
Loading...

Share This Page