Convert to big5 to unicode

Discussion in 'Python' started by GM, Sep 7, 2006.

  1. GM

    GM Guest

    Dear all,

    Could you all give me some guide on how to convert my big5 string to
    unicode using python? I already knew that I might use cjkcodecs or
    python 2.4 but I still don't have idea on what exactly I should do.
    Please give me some sample code if you could. Thanks a lot


    GM, Sep 7, 2006
    1. Advertisements

  2. GM

    xiejw Guest

    Install the codecs. In Debain, you can do :
    apt-get install python-cjkcodecs

    Then, it is easy to encode ( I use 'gb2312' ) :

    str = '我们'
    u = unicode(str,'gb2312')

    The convertion is done and you can get the string of UTF-8:
    str_utf8 = u.encode("utf-8")

    You can get the original string:
    str_gb = u.encode("gb2312")

    GM 写é“:
    xiejw, Sep 7, 2006
    1. Advertisements

  3. GM

    John Machin Guest

    xiejw topposted:
    With Windows & 2.4, no extra installation step is required.

    | Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
    on win32
    | >>> bc = '\xb1i'
    | >>> unicode(bc, 'big5')
    | u'\u5f35'
    | >>>

    John Machin, Sep 7, 2006
  4. Gary, I used this Java program quite a few years ago to convert
    various Big5 files to UTF-16. (Sorry it's Java not Python, but I'm a
    very recent convert to the latter.) My newsgroup reader has messed the
    formatting up somewhat. If this causes a problem, email me and I'll
    send you the source directly.

    -Richard Schulman

    /* This program converts an input file of one encoding format to
    an output file of
    * another format. It will be mainly used to convert Big5 text
    files to Unicode text files.

    public class ConvertEncoding
    { public static void main(String[] args)
    { String outfile = null;
    { convert(args[0], args[1], "BIG5",
    // Or, at command line:
    // convert(args[0], args[1], "GB2312",
    // or numerous variations thereon. Among possible
    choices for input or output:
    // "GB2312", "BIG5", "UTF8", "UTF-16LE".
    The last named is MS UCS-2 format.
    // I.e., "input file","output file",
    "input encoding", "output encoding"
    catch (Exception e)
    { System.out.print(e.getMessage());

    public static void convert(String infile, String outfile,
    String from, String to)
    throws IOException, UnsupportedEncodingException
    { // set up byte streams
    InputStream in;
    if (infile != null)
    in = new FileInputStream(infile);
    in =;

    OutputStream out;
    if (outfile != null)
    out = new FileOutputStream(outfile);
    out = System.out;

    // Set up character stream
    Reader r = new BufferedReader(new
    InputStreamReader(in, from));
    Writer w = new BufferedWriter(new
    OutputStreamWriter(out, to));

    w.write("\ufeff"); // This character signals
    Unicode in the NT environment
    char[] buffer = new char[4096];
    int len;
    while((len = != -1)
    w.write(buffer, 0, len);
    Richard Schulman, Sep 7, 2006
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.