Newbie Q: Extra spaces after conversion from utf-8 to utf-16-le ?

A

Arifi Koseoglu

Hello everyone.

I am an absolute Newbie who has done a good amount of googling with
the keywords utf-8, utf-16, python, convert and has reasoned that the
following code could be used to convert a utf-8 text file to a
utf-16-le (I believe this is what Windows uses for Unicode):

s1 = open("utf8_file_generated_with_perl.txt", "r").read()
s2 = unicode(s1, "utf-8")
s3 = s2.encode("utf-16-le")
open ("new_file_supposedly_in_utf16le", "w").write(s3)

Well, this code kind of works (meaning I do not get any errors), but
the produced file contains an extra space after every character (l i k
e t h i s) and Windows believes this is an ANSI (i.e. non-unicode
file). Clearly, what I think is working is actually not.

What do I need to do?

Many thanks in advance,
-arifi
 
J

Josiah Carlson

I am an absolute Newbie who has done a good amount of googling with
the keywords utf-8, utf-16, python, convert and has reasoned that the
following code could be used to convert a utf-8 text file to a
utf-16-le (I believe this is what Windows uses for Unicode):

s1 = open("utf8_file_generated_with_perl.txt", "r").read()
s2 = unicode(s1, "utf-8")
s3 = s2.encode("utf-16-le")
open ("new_file_supposedly_in_utf16le", "w").write(s3)

Well, this code kind of works (meaning I do not get any errors), but
the produced file contains an extra space after every character (l i k
e t h i s) and Windows believes this is an ANSI (i.e. non-unicode
file). Clearly, what I think is working is actually not.

For standard /ASCII/ characters, when encoded with utf-16-le, there
exists a 'null' character trailing every input character that exists in
standard ASCII... 'h\x00e\x00l\x00l\x00o\x00'

Generally, "Windows" makes no assumption about encoding and always
assumes ASCII. What many (not all) systems do to tell the app what
encoding is being used, is place what is known as a 'BOM' at the
beginning of the file. Check unicode.org for more information.

You will also likely find opening files as 'binary' in Windows, when
working with unicode, goes a long ways towards making correct output.

- Josiah
 
A

Arifi Koseoglu

Many thanks Josiah,

Reading and writing in binary form did the trick.

Very much appreciated.
Cheers
-arifi
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,734
Messages
2,569,441
Members
44,832
Latest member
GlennSmall

Latest Threads

Top