Unicode string handling problem

  • Thread starter Richard Schulman
  • Start date
R

Richard Schulman

The following program fragment works correctly with an ascii input
file.

But the file I actually want to process is Unicode (utf-16 encoding).
The file must be Unicode rather than ASCII or Latin-1 because it
contains mixed Chinese and English characters.

When I run the program below I get an attribute_count of zero, which
is incorrect for the input file, which should give a value of fifteen
or sixteen. In other words, the count function isn't recognizing the
", characters in the line being read. Here's the program:

in_file = open("c:\\pythonapps\\in-graf1.my","rU")
try:
# Skip the first line; make the second available for processing
in_file.readline()
in_line = readline()
attribute_count = in_line.count('",')
print attribute_count
finally:
in_file.close()

Any suggestions?

Richard Schulman
(For email reply, delete the 'xx' characters)
 
J

John Machin

Richard said:
The following program fragment works correctly with an ascii input
file.

But the file I actually want to process is Unicode (utf-16 encoding).
The file must be Unicode rather than ASCII or Latin-1 because it
contains mixed Chinese and English characters.

When I run the program below I get an attribute_count of zero, which
is incorrect for the input file, which should give a value of fifteen
or sixteen. In other words, the count function isn't recognizing the
", characters in the line being read. Here's the program:

in_file = open("c:\\pythonapps\\in-graf1.my","rU")
try:
# Skip the first line; make the second available for processing
in_file.readline()
in_line = readline()

You mean in_line = in_file.readline(), I hope. Do please copy/paste
actual code, not what you think you ran.
attribute_count = in_line.count('",')
print attribute_count

Insert
print type(in_line)
print repr(in_line)
here [also make the appropriate changes to get the same info from the
first line], run it again, copy/paste what you get, show us what you
see.

If you're coy about that, then you'll have to find out yourself if it
has a BOM at the front, and if not whether it's little/big/endian.
finally:
in_file.close()

Any suggestions?

1. Read the Unicode HOWTO.
2. Read the docs on the codecs module ...

You'll need to use

in_file = codecs.open(filepath, mode, encoding="utf16???????")

It would also be a good idea to get into the habit of using unicode
constants like u'",'

HTH,
John
 
J

John Roth

Richard said:
The following program fragment works correctly with an ascii input
file.

But the file I actually want to process is Unicode (utf-16 encoding).
The file must be Unicode rather than ASCII or Latin-1 because it
contains mixed Chinese and English characters.

When I run the program below I get an attribute_count of zero, which
is incorrect for the input file, which should give a value of fifteen
or sixteen. In other words, the count function isn't recognizing the
", characters in the line being read. Here's the program:

in_file = open("c:\\pythonapps\\in-graf1.my","rU")
try:
# Skip the first line; make the second available for processing
in_file.readline()
in_line = readline()
attribute_count = in_line.count('",')
print attribute_count
finally:
in_file.close()

Any suggestions?

Richard Schulman
(For email reply, delete the 'xx' characters)

You're not detecting the file encoding and then
using it in the open statement. If you know this is
utf-16le or utf-16be, you need to say so in the
open. If you don't, then you should read it into
a string, go through some autodetect logic, and
then decode it with the <string>.decode(encoding)
method.

A clue: a properly formatted utf-16 or utf-32
file MUST have a BOM as the first character.
That's mandated in the unicode standard. If
it doesn't have a BOM, then try ascii and
utf-8 in that order. The first
one that succeeds is correct. If neither succeeds,
you're on your own in guessing the file encoding.

John Roth
 
R

Richard Schulman

Thanks for your excellent debugging suggestions, John. See below for
my follow-up:

Richard Schulman:
John Machin:
Insert
print type(in_line)
print repr(in_line)
here [also make the appropriate changes to get the same info from the
first line], run it again, copy/paste what you get, show us what you
see.

Here's the revised program, per your suggestion:

=====================================================

# This program processes a UTF-16 input file that is
# to be loaded later into a mySQL table. The input file
# is not yet ready for prime time. The purpose of this
# program is to ready it.

in_file = open("c:\\pythonapps\\in-graf1.my","rU")
try:
# The first line read is a SQL INSERT statement; no
# processing will be required.
in_line = in_file.readline()
print type(in_line) #For debugging
print repr(in_line) #For debugging

# The second line read is the first data row.
in_line = in_file.readline()
print type(in_line) #For debugging
print repr(in_line) #For debugging

# For this and subsequent rows, we must count all
# the < ", > character-pairs in a given line/row.
# This will provide an n-1 measure of the attributes
# for a SQL insert of this row. All rows must have
# sixteen attributes, but some don't yet.
attribute_count = in_line.count('",')
print attribute_count
finally:
in_file.close()

=====================================================

The output of this program, which I ran at the command line,
must needs to be copied by hand and abridged, but I think I
have included the relevant information:

C:\pythonapps>python graf_correction.py
<type 'str'>
'\xff\xfeI\x00N\x00S... [the beginning of a SQL INSERT statement]
....\x00U\x00E\x00S\x00\n' [the VALUES keyword at the end of the row,
followed by an end-of-line]
<type 'str'>
'\x00\n' [oh-oh! For the second row, all we're seeing
is an end-of-line character. Is that from
the first row? Wasn't the "rU" mode
supposed to handle that]
0 [the counter value. It's hardly surprising
it's only zero, given that most of the row
never got loaded, just an eol mark]

J.M.:
If you're coy about that, then you'll have to find out yourself if it
has a BOM at the front, and if not whether it's little/big/endian.

The BOM is little-endian, I believe.

R.S.:
J.M.
1. Read the Unicode HOWTO.
2. Read the docs on the codecs module ...

You'll need to use

in_file = codecs.open(filepath, mode, encoding="utf16???????")

Right you are. Here is the output produced by so doing:

<type 'unicode'>
u'\ufeffINSERT INTO [...] VALUES\N'
<type 'unicode'>
u'\n'
0 [The counter value]
It would also be a good idea to get into the habit of using unicode
constants like u'",'
Right.

HTH,
John

Yes, it did. Many thanks! Now I've got to figure out the best way to
handle that \n\n at the end of each row, which the program is
interpreting as two rows. That represents two surprises: first, I
thought that Microsoft files ended as \n\r ; second, I thought that
Python mode "rU" was supposed to be the universal eol handler and
would handle the \n\r as one mark.

Richard Schulman
 
R

Richard Schulman

[T]he file I actually want to process is Unicode (utf-16 encoding).
...
in_file = open("c:\\pythonapps\\in-graf1.my","rU")
...

John Roth:
You're not detecting the file encoding and then
using it in the open statement. If you know this is
utf-16le or utf-16be, you need to say so in the
open. If you don't, then you should read it into
a string, go through some autodetect logic, and
then decode it with the <string>.decode(encoding)
method.

A clue: a properly formatted utf-16 or utf-32
file MUST have a BOM as the first character.
That's mandated in the unicode standard. If
it doesn't have a BOM, then try ascii and
utf-8 in that order. The first
one that succeeds is correct. If neither succeeds,
you're on your own in guessing the file encoding.

Thanks for this further information. I'm now using the codec with
improved results, but am still puzzled as to how to handle the row
termination of \n\n, which is being interpreted as two rows instead of
one.
 
R

Richard Schulman

...I'm now using the codec with
improved results, but am still puzzled as to how to handle the row
termination of \n\n, which is being interpreted as two rows instead of
one.

Of course, I could do a double read on each row and ignore the second
read, which merely fetches the final of the two u'\n' characters. But
that's not very elegant, and I'm sure there's a better way to do it
(hint, hint someone).

Richard Schulman (for email, drop the 'xx' in the reply-to)
 
J

John Machin

Richard Schulman wrote:
[big snip]
The BOM is little-endian, I believe. Correct.


Right you are. Here is the output produced by so doing:

You don't say which encoding you used, but I guess that you used
utf_16_le.
<type 'unicode'>
u'\ufeffINSERT INTO [...] VALUES\N'

Use utf_16 -- it will strip off the BOM for you.
<type 'unicode'>
u'\n'
0 [The counter value]
[snip]
Yes, it did. Many thanks! Now I've got to figure out the best way to
handle that \n\n at the end of each row, which the program is
interpreting as two rows.

Well we don't know yet exactly what you have there. We need a byte dump
of the first few bytes of your file. Get into the interactive
interpreter and do this:

open('yourfile', 'rb').read(200)
(the 'b' is for binary, in case you are on Windows)
That will show us exactly what's there, without *any* EOL
interpretation at all.

That represents two surprises: first, I
thought that Microsoft files ended as \n\r ;

Nah. Wrong on two counts. In text mode, Microsoft *lines* end in \r\n
(not \n\r); *files* may end in ctrl-Z aka chr(26) -- an inheritance
from CP/M.

Ummmm ... are you saying the file has \n\r at the end of each row?? How
did you know that if you didn't know what if any BOM it had??? Who
created the file????
second, I thought that
Python mode "rU" was supposed to be the universal eol handler and
would handle the \n\r as one mark.

Nah again. It contemplates only \n, \r, and \r\n as end of line. See
the docs. Thus \n\r becomes *two* newlines when read with "rU".

Having "\n\r" at the end of each row does fit with your symptoms:

| >>> bom = u"\ufeff"
| >>> guff = '\n\r'.join(['abc', 'def', 'ghi'])
| >>> guffu = unicode(guff)
| >>> import codecs
| >>> f = codecs.open('guff.utf16le', 'wb', encoding='utf_16_le')
| >>> f.write(bom+guffu)
| >>> f.close()

| >>> open('guff.utf16le', 'rb').read() #### see exactly what we've got

|
'\xff\xfea\x00b\x00c\x00\n\x00\r\x00d\x00e\x00f\x00\n\x00\r\x00g\x00h\x00i\x00'

| >>> codecs.open('guff.utf16le', 'r', encoding='utf_16').read()
| u'abc\n\rdef\n\rghi' ######### Look, Mom, no BOM!

| >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16').read()
| u'abc\n\ndef\n\nghi' #### U means \r -> \n

| >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16_le').read()
| u'\ufeffabc\n\ndef\n\nghi' ######### reproduces your second
experience

| >>> open('guff.utf16le', 'rU').readlines()
| ['\xff\xfea\x00b\x00c\x00\n', '\x00\n', '\x00d\x00e\x00f\x00\n',
'\x00\n', '\x00
| g\x00h\x00i\x00']
| >>> f = open('guff.utf16le', 'rU')
| >>> f.readline()
| '\xff\xfea\x00b\x00c\x00\n'
| >>> f.readline()
| '\x00\n' ######### reproduces your first experience
| >>> f.readline()
| '\x00d\x00e\x00f\x00\n'
| >>>

If that file is a one-off, you can obviously fix it by
throwing away every second line. Otherwise, if it's an ongoing
exercise, you need to talk sternly to the file's creator :)

HTH,
John
 
R

Richard Schulman

Many thanks for your help, John, in giving me the tools to work
successfully in Python with Unicode from here on out.

It turns out that the Unicode input files I was working with (from MS
Word and MS Notepad) were indeed creating eol sequences of \r\n, not
\n\n as I had originally thought. The file reading statement that I
was using, with unpredictable results, was

#in_file =
codecs.open("c:\\pythonapps\\in-graf2.my","rU",encoding="utf-16LE")

This was reading to the \n on first read (outputting the whole line,
including the \n but, weirdly, not the preceding \r). Then, also
weirdly, the next readline would read the same \n again, interpreting
that as the entirety of a phantom second line. So each input file line
ended up producing two output lines.

Once the mode string "rU" was dropped, as in

in_file =
codecs.open("c:\\pythonapps\\in-graf2.my",encoding="utf-16LE")

all suddenly became well: no more doubled readlines, and one could see
the \r\n termination of each line.

This behavior of "rU" was not at all what I had expected from the
brief discussion of it in _Python Cookbook_. Which all goes to point
out how difficult it is to cook challenging dishes with sketchy
recipes alone. There is no substitute for the helpful advice of an
experienced chef.

-Richard Schulman
(remove "xx" for email reply)

Richard Schulman wrote:
[big snip]
The BOM is little-endian, I believe. Correct.


Right you are. Here is the output produced by so doing:

You don't say which encoding you used, but I guess that you used
utf_16_le.
<type 'unicode'>
u'\ufeffINSERT INTO [...] VALUES\N'

Use utf_16 -- it will strip off the BOM for you.
<type 'unicode'>
u'\n'
0 [The counter value]
[snip]
Yes, it did. Many thanks! Now I've got to figure out the best way to
handle that \n\n at the end of each row, which the program is
interpreting as two rows.

Well we don't know yet exactly what you have there. We need a byte dump
of the first few bytes of your file. Get into the interactive
interpreter and do this:

open('yourfile', 'rb').read(200)
(the 'b' is for binary, in case you are on Windows)
That will show us exactly what's there, without *any* EOL
interpretation at all.

That represents two surprises: first, I
thought that Microsoft files ended as \n\r ;

Nah. Wrong on two counts. In text mode, Microsoft *lines* end in \r\n
(not \n\r); *files* may end in ctrl-Z aka chr(26) -- an inheritance
from CP/M.

Ummmm ... are you saying the file has \n\r at the end of each row?? How
did you know that if you didn't know what if any BOM it had??? Who
created the file????
second, I thought that
Python mode "rU" was supposed to be the universal eol handler and
would handle the \n\r as one mark.

Nah again. It contemplates only \n, \r, and \r\n as end of line. See
the docs. Thus \n\r becomes *two* newlines when read with "rU".

Having "\n\r" at the end of each row does fit with your symptoms:

| >>> bom = u"\ufeff"
| >>> guff = '\n\r'.join(['abc', 'def', 'ghi'])
| >>> guffu = unicode(guff)
| >>> import codecs
| >>> f = codecs.open('guff.utf16le', 'wb', encoding='utf_16_le')
| >>> f.write(bom+guffu)
| >>> f.close()

| >>> open('guff.utf16le', 'rb').read() #### see exactly what we've got

|
'\xff\xfea\x00b\x00c\x00\n\x00\r\x00d\x00e\x00f\x00\n\x00\r\x00g\x00h\x00i\x00'

| >>> codecs.open('guff.utf16le', 'r', encoding='utf_16').read()
| u'abc\n\rdef\n\rghi' ######### Look, Mom, no BOM!

| >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16').read()
| u'abc\n\ndef\n\nghi' #### U means \r -> \n

| >>> codecs.open('guff.utf16le', 'rU', encoding='utf_16_le').read()
| u'\ufeffabc\n\ndef\n\nghi' ######### reproduces your second
experience

| >>> open('guff.utf16le', 'rU').readlines()
| ['\xff\xfea\x00b\x00c\x00\n', '\x00\n', '\x00d\x00e\x00f\x00\n',
'\x00\n', '\x00
| g\x00h\x00i\x00']
| >>> f = open('guff.utf16le', 'rU')
| >>> f.readline()
| '\xff\xfea\x00b\x00c\x00\n'
| >>> f.readline()
| '\x00\n' ######### reproduces your first experience
| >>> f.readline()
| '\x00d\x00e\x00f\x00\n'
| >>>

If that file is a one-off, you can obviously fix it by
throwing away every second line. Otherwise, if it's an ongoing
exercise, you need to talk sternly to the file's creator :)

HTH,
John
 
J

John Machin

Richard said:
It turns out that the Unicode input files I was working with (from MS
Word and MS Notepad) were indeed creating eol sequences of \r\n, not
\n\n as I had originally thought. The file reading statement that I
was using, with unpredictable results, was

#in_file =
codecs.open("c:\\pythonapps\\in-graf2.my","rU",encoding="utf-16LE")

This was reading to the \n on first read (outputting the whole line,
including the \n but, weirdly, not the preceding \r). Then, also
weirdly, the next readline would read the same \n again, interpreting
that as the entirety of a phantom second line. So each input file line
ended up producing two output lines.

Once the mode string "rU" was dropped, as in

in_file =
codecs.open("c:\\pythonapps\\in-graf2.my",encoding="utf-16LE")

all suddenly became well: no more doubled readlines, and one could see
the \r\n termination of each line.

You are on Windows. I would *not* describe as "well" lines read in (the
default) text mode ending in u"\r\n". It would expect it to convert the
line endings to u"\n". At best, this should be documented. Perhaps
someone with some knowledge of the intended treatment of line endings
by codecs.open() in text mode could comment? The two problems are
succintly described below:

File created in Windows Notepad and saved with "Unicode" encoding.
Results in UTF-16LE encoding, line terminator is CR LF, has BOM (LE) at
front -- as show below.

| Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
on win32
| Type "help", "copyright", "credits" or "license" for more
information.
| >>> open('notepad_uc.txt', 'rb').read()
|
'\xff\xfea\x00b\x00c\x00\r\x00\n\x00d\x00e\x00f\x00\r\x00\n\x00g\x00h\x00i\x00\r
| \x00\n\x00'
| >>> import codecs
| >>> codecs.open('notepad_uc.txt', 'r',
encoding='utf_16_le').readlines()
| [u'\ufeffabc\r\n', u'def\r\n', u'ghi\r\n']
| >>> codecs.open('notepad_uc.txt', 'r', encoding='utf_16').readlines()
| [u'abc\r\n', u'def\r\n', u'ghi\r\n']
### presence ot u'\r' was *not* expected
| >>> codecs.open('notepad_uc.txt', 'rU',
encoding='utf_16_le').readlines()
| [u'\ufeffabc\n', u'\n', u'def\n', u'\n', u'ghi\n', u'\n']
| >>> codecs.open('notepad_uc.txt', 'rU',
encoding='utf_16').readlines()
| [u'abc\n', u'\n', u'def\n', u'\n', u'ghi\n', u'\n']
### 'U' flag does change the behaviour, but *not* as expected.

Cheers,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top