Need help on UNICODE conversion

B

Bernd Preusing

Hi,

today I (Python beginner) ran into a problem:^

I have a JPG file which contains some comment as unicode.

After reading in the string with s=file.read(70) from file offest 4
I get a string which is shown as
'UNICODE\\0x00\\ox00K\\0x00o' and so forth in the debugger
(using Komodo).

How do I convert such string to a real unicode string and to
a windows_1252 or latin1 afterwards? I know it's a text with
german umlauts.

I tried this:
if rawdata[:7] == "UNICODE":
ustring = rawdata[7:]
us2 = unicode(ustring, "windows_1252")
as2 = us2.encode("windows_1252")
self.dic["ComUNI"] = rawdata

But all I get on each stage is a normal string with lots of \\0x00.

TIA
Bernd
 
M

Martin v. =?iso-8859-15?q?L=F6wis?=

Bernd Preusing said:
After reading in the string with s=file.read(70) from file offest 4
I get a string which is shown as
'UNICODE\\0x00\\ox00K\\0x00o' and so forth in the debugger
(using Komodo).

Can you find out what the real value of that string is? I very much
doubt that it contains literal backslashes. Also, I find it strange
that it has the letter 'o' after one backslash, but the number '0'
after all other bacskslashes.

Regards,
Martin
 
P

Peter Otten

Bernd said:
I have a JPG file which contains some comment as unicode.

After reading in the string with s=file.read(70) from file offest 4
I get a string which is shown as
'UNICODE\\0x00\\ox00K\\0x00o' and so forth in the debugger
(using Komodo).

Seems that this is not properly cut and pasted :-(

I suppose that "\\0x00" is just a complicated replacement for "\x00" used by
the debugger. As long as all characters are in the range 0..255, you could
simply remove every other character:
"XHXeXlXlXoX XWXoXrXlXd"[1::2] 'Hello World'

Use 8 instead of 1 as start index to also remove "UNICODE".
That might eliminate the need for a unicode string, or you could easily
create one from the "normal" string.


Peter
 
E

Erik Max Francis

Bernd said:
I have a JPG file which contains some comment as unicode.

After reading in the string with s=file.read(70) from file offest 4
I get a string which is shown as
'UNICODE\\0x00\\ox00K\\0x00o' and so forth in the debugger
(using Komodo).

As others have pointed out, this seems to be an unfaithful cut and
paste; to really tell what it is we'd have to see the actual contents of
the string. If it is really Unicode, however, it looks like it might be
a UTF-16 encoding. Try 'utf-16' for the encoding name.
 
B

Bernd Preusing

Erik Max Francis said:
As others have pointed out, this seems to be an unfaithful cut and
paste; to really tell what it is we'd have to see the actual contents of
the string. If it is really Unicode, however, it looks like it might be
a UTF-16 encoding. Try 'utf-16' for the encoding name.

Yes, sorry. Cut & paste was not possible, so I wrote it down
with some errors, very tired and frustrated :-(
I had tried to attach a small screenshot, but this is no binary news
group...

My first fault was to cut off the first 7 bytes, but I had to
eliminate 8.

The byte array is
0000: 55 4e 49 43 4f 44 45 00 00 4b 00 6f 00 6d 00 6d UNICODE..K.o.m.m
0010: 00 65 00 6e 00 74 00 61 00 72 00 20 00 55 00 6e .e.n.t.a.r. .U.n
0020: 00 69 00 63 00 6f 00 64 00 65 00 20 00 2a 00 e4 .i.c.o.d.e. .*..
0030: 00 f6 00 fc 00 c4 00 d6 00 dc 00 df 00 2a 00 0d
0040: 00 0a 00 0d 00 0a

I had to cut off the beginning, which is "UNICODE\x00".
The remainder means "Kommentar Unicode *äöüÄÖÜß*"
(this contains german umlauts at the end)

Now I have a string
ustring = "\x00K\x00o\x00m....."

us2 = unicode(ustring, "utf_16")
yields: UnicodeDecodeError: 'utf16' codec can't decode bytes in
position 48-49: illegal encoding

Strange, because that position is at "00 dc" and not earlier!?

According to your tips I stripped off all remainig \x00 and got
"Kommentar Unicode *\xe4\xf6\xfc\xc4\xd6\xdc\xdf*\r\n\r\n"

I can go on with that string now :))
But what would have been the "right" way?

Thaks again
Bernd
 
M

Martin v. =?iso-8859-15?q?L=F6wis?=

Erik Max Francis said:
u'Kommentar Unicode *\xe4\xf6\xfc\xc4\xd6\xdc\xdf*\r\n\r\n'

... which I can convert to Latin-1 and print to then see the umlauts and
the double S.

It is better to use "utf-16-be" as the codec name in the first place,
instead of artificially prepending a BOM, and letting the UTF-16 codec
determine byte order.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top