Need help on UNICODE conversion

Discussion in 'Python' started by Bernd Preusing, Sep 6, 2003.

  1. Hi,

    today I (Python beginner) ran into a problem:^

    I have a JPG file which contains some comment as unicode.

    After reading in the string with s=file.read(70) from file offest 4
    I get a string which is shown as
    'UNICODE\\0x00\\ox00K\\0x00o' and so forth in the debugger
    (using Komodo).

    How do I convert such string to a real unicode string and to
    a windows_1252 or latin1 afterwards? I know it's a text with
    german umlauts.

    I tried this:
    if rawdata[:7] == "UNICODE":
    ustring = rawdata[7:]
    us2 = unicode(ustring, "windows_1252")
    as2 = us2.encode("windows_1252")
    self.dic["ComUNI"] = rawdata

    But all I get on each stage is a normal string with lots of \\0x00.

    TIA
    Bernd
     
    Bernd Preusing, Sep 6, 2003
    #1
    1. Advertising

  2. Bernd Preusing <> writes:

    > After reading in the string with s=file.read(70) from file offest 4
    > I get a string which is shown as
    > 'UNICODE\\0x00\\ox00K\\0x00o' and so forth in the debugger
    > (using Komodo).


    Can you find out what the real value of that string is? I very much
    doubt that it contains literal backslashes. Also, I find it strange
    that it has the letter 'o' after one backslash, but the number '0'
    after all other bacskslashes.

    Regards,
    Martin
     
    Martin v. =?iso-8859-15?q?L=F6wis?=, Sep 6, 2003
    #2
    1. Advertising

  3. Bernd Preusing

    Peter Otten Guest

    Bernd Preusing wrote:

    > I have a JPG file which contains some comment as unicode.
    >
    > After reading in the string with s=file.read(70) from file offest 4
    > I get a string which is shown as
    > 'UNICODE\\0x00\\ox00K\\0x00o' and so forth in the debugger
    > (using Komodo).


    Seems that this is not properly cut and pasted :-(

    I suppose that "\\0x00" is just a complicated replacement for "\x00" used by
    the debugger. As long as all characters are in the range 0..255, you could
    simply remove every other character:

    >>> "XHXeXlXlXoX XWXoXrXlXd"[1::2]

    'Hello World'
    >>>


    Use 8 instead of 1 as start index to also remove "UNICODE".
    That might eliminate the need for a unicode string, or you could easily
    create one from the "normal" string.


    Peter
     
    Peter Otten, Sep 6, 2003
    #3
  4. Bernd Preusing wrote:

    > I have a JPG file which contains some comment as unicode.
    >
    > After reading in the string with s=file.read(70) from file offest 4
    > I get a string which is shown as
    > 'UNICODE\\0x00\\ox00K\\0x00o' and so forth in the debugger
    > (using Komodo).


    As others have pointed out, this seems to be an unfaithful cut and
    paste; to really tell what it is we'd have to see the actual contents of
    the string. If it is really Unicode, however, it looks like it might be
    a UTF-16 encoding. Try 'utf-16' for the encoding name.

    --
    Erik Max Francis && && http://www.alcyone.com/max/
    __ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
    / \ You're wasting time / Asking what if / You linger on too long
    \__/ Chante Moore
     
    Erik Max Francis, Sep 6, 2003
    #4
  5. Erik Max Francis <> wrote:

    >Bernd Preusing wrote:
    >
    >> I have a JPG file which contains some comment as unicode.
    >>
    >> After reading in the string with s=file.read(70) from file offest 4
    >> I get a string which is shown as
    >> 'UNICODE\\0x00\\ox00K\\0x00o' and so forth in the debugger
    >> (using Komodo).

    >
    >As others have pointed out, this seems to be an unfaithful cut and
    >paste; to really tell what it is we'd have to see the actual contents of
    >the string. If it is really Unicode, however, it looks like it might be
    >a UTF-16 encoding. Try 'utf-16' for the encoding name.


    Yes, sorry. Cut & paste was not possible, so I wrote it down
    with some errors, very tired and frustrated :-(
    I had tried to attach a small screenshot, but this is no binary news
    group...

    My first fault was to cut off the first 7 bytes, but I had to
    eliminate 8.

    The byte array is
    0000: 55 4e 49 43 4f 44 45 00 00 4b 00 6f 00 6d 00 6d UNICODE..K.o.m.m
    0010: 00 65 00 6e 00 74 00 61 00 72 00 20 00 55 00 6e .e.n.t.a.r. .U.n
    0020: 00 69 00 63 00 6f 00 64 00 65 00 20 00 2a 00 e4 .i.c.o.d.e. .*..
    0030: 00 f6 00 fc 00 c4 00 d6 00 dc 00 df 00 2a 00 0d
    0040: 00 0a 00 0d 00 0a

    I had to cut off the beginning, which is "UNICODE\x00".
    The remainder means "Kommentar Unicode *äöüÄÖÜß*"
    (this contains german umlauts at the end)

    Now I have a string
    ustring = "\x00K\x00o\x00m....."

    us2 = unicode(ustring, "utf_16")
    yields: UnicodeDecodeError: 'utf16' codec can't decode bytes in
    position 48-49: illegal encoding

    Strange, because that position is at "00 dc" and not earlier!?

    According to your tips I stripped off all remainig \x00 and got
    "Kommentar Unicode *\xe4\xf6\xfc\xc4\xd6\xdc\xdf*\r\n\r\n"

    I can go on with that string now :))
    But what would have been the "right" way?

    Thaks again
    Bernd
     
    Bernd Preusing, Sep 7, 2003
    #5
  6. Erik Max Francis <> writes:

    > >>> u = unicode(codecs.BOM_UTF16_BE + u, 'utf-16')
    > >>> u

    > u'Kommentar Unicode *\xe4\xf6\xfc\xc4\xd6\xdc\xdf*\r\n\r\n'
    >
    > ... which I can convert to Latin-1 and print to then see the umlauts and
    > the double S.


    It is better to use "utf-16-be" as the codec name in the first place,
    instead of artificially prepending a BOM, and letting the UTF-16 codec
    determine byte order.

    Regards,
    Martin
     
    Martin v. =?iso-8859-15?q?L=F6wis?=, Sep 7, 2003
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Spamtrap

    UTF8 to Unicode conversion

    Spamtrap, Jul 30, 2004, in forum: Perl
    Replies:
    6
    Views:
    9,936
    Joe Smith
    Jul 31, 2004
  2. Raphael A. Bauer
    Replies:
    0
    Views:
    307
    Raphael A. Bauer
    Feb 17, 2005
  3. Holger Joukl
    Replies:
    5
    Views:
    553
    Ben Finney
    Dec 13, 2006
  4. , India
    Replies:
    2
    Views:
    486
    Fraser Ross
    Sep 15, 2009
  5. Chirag Mistry
    Replies:
    6
    Views:
    176
    Ollivier Robert
    Feb 8, 2008
Loading...

Share This Page