unknown encoding problem

U

Uwe Mayer

Hi,

I need to read in a text file which seems to be stored in some unknown
encoding. Opening and reading the files content returns:
'\x00 \x00 \x00<\x00l\x00o\x00g\x00E\x00n\x00t\x00r\x00y\x00...

Each character has a \x00 prepended to it. I suspect its some kind of
unicode - how do I get rid of it?

str.replace('\x00', '') "works" but is not really nice. I don't quite get
the hang of str.encode /str.decode

Any Ideas?
Thanks
Ciao
Uwe
 
P

Peter Otten

Uwe said:
I need to read in a text file which seems to be stored in some unknown
encoding. Opening and reading the files content returns:

'\x00 \x00 \x00<\x00l\x00o\x00g\x00E\x00n\x00t\x00r\x00y\x00...

Each character has a \x00 prepended to it. I suspect its some kind of
unicode - how do I get rid of it?

Intermittent '\x00' bytes are a indeed strong evidence for unicode. Use
codecs.open() to access the data in such a file:
u' <logEntry'

If you don't want unicode, convert back to str:
' <logEntry'

Note that the last step may fail if the file contains characters not
available in the string encoding you specify.

Peter
 
L

Leif K-Brooks

Uwe said:
Hi,

I need to read in a text file which seems to be stored in some unknown
encoding. Opening and reading the files content returns:



'\x00 \x00 \x00<\x00l\x00o\x00g\x00E\x00n\x00t\x00r\x00y\x00...

Each character has a \x00 prepended to it. I suspect its some kind of
unicode - how do I get rid of it?

f.read().decode('utf16')
 
J

John Machin

Hi,

I need to read in a text file which seems to be stored in some unknown
encoding. Opening and reading the files content returns:

'\x00 \x00 \x00<\x00l\x00o\x00g\x00E\x00n\x00t\x00r\x00y\x00...

Each character has a \x00 prepended to it. I suspect its some kind of
unicode - how do I get rid of it?

Interesting attitude. Why do you want to "get rid of it"? Have you
considered investigating the source of this suspicious text? You never
know, there could be something really interesting in there, like
'\x00v\x00o\x00n\x00 \x04\x1c\x04>\x04A\x04:\x042\x040\x00
\x00m\x00i\x00t\x00 \x00L\x00i\x00e\x00b' :)
str.replace('\x00', '')

Why not go the whole hog:

''.join([c for c in foreign_text if 32 <= ord(c) <= 126 or c in
'\t\r\n'])

Alternatively, try embracing Unicode -- it's the way forward, and it's
not that difficult.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top