read(1) returns string of length 2

  • Thread starter wolfgang haefelinger
  • Start date
W

wolfgang haefelinger

Greetings,

I'm trying to read (japanese) chars from a file. While doing so
I encounter that a char with length 2 is returned. Is this to be
expected or is there something wrong?

Basically it's this what I'm doing:

import codecs
f = codecs.open("ident.in",'rb','Shift-JIS') ## japanses codecs installed

c = f.read(1)
while c:
if len(c)==1:
print hex(ord(c)),
else:
print "{",
for x in c: print hex(ord(x)),
print "}",
c = f.read(1)

This is my input (file is also attached):

$ od -tx1 ident.in
0000000 8d 87 8c 76 8e 9e 8a d4 3b 0d 0a
0000013

This is what I'm getting:

$ python ident.py ## python 2.3.4
on Windows
0x5408 0x8a08 0x6642 0x9593 { 0x3b 0xd } 0xa

"Python" believes that there are 6 chars on the stream while there are
actually 7 chars.

My naive assumption was that f.read(1) returns always a char of length 1 (or
zero).

Remark:
The input is believed to be "SJIS" but I haven't found a Python codecs for
this.
Therefore I'm using Shift-JIS. Of course this could be the problem. Note
that
when feeding Java with my input "correct" using SJIS, chars are spit out:

c=21512 c=35336 c=26178 c=38291 c=59 c=13 c=10 : 7 char(s)

References:
I downloaded Japanese codecs from here (version: 1.4.10)
http://www.asahi-net.or.jp/~rd6t-kjym/python/

Thanks for any hints,
Wolfgang.
 
S

Skip Montanaro

wolfgang> I'm trying to read (japanese) chars from a file. While doing
wolfgang> so I encounter that a char with length 2 is returned. Is this
wolfgang> to be expected or is there something wrong?

I believe it's to be expected. You opened the file with codecs.open(), so
your basic unit of operation will be a Unicode character, not a byte.

wolfgang> My naive assumption was that f.read(1) returns always a char
wolfgang> of length 1 (or zero).

If you simply used the builtin open() to open the file that would be true.

Skip
 
W

wolfgang haefelinger

Hey Skip,

That's exactly the point. What I'm expecting to be returned is
a unicode string of length 1, ie. something I'm calling a uni-
code character.

Note that I do not count the number of bytes at all.

Btw, you can see that the first unicode string returned
by f.read(1) is

0x5408 (21512)

The lenght of this unicode string is 1, ie. we got a char (but
we need 2 bytes represent it).

Actually, everything is fine until the codecs reader is about
to read '3b'. Instead of delivering this as next unicode char,
I'm getting '3b' and '0d' as string of length 2.

Anyway, my question can also be written like this:

f = codecs.open(...)
c = f.read(1)
if c:
assert len(c)==1

I was thinking that this piece of code should be true in
general.

Cheers,
Wolfgang.
 
G

George Yoshida

wolfgang said:
Actually, everything is fine until the codecs reader is about
to read '3b'. Instead of delivering this as next unicode char,
I'm getting '3b' and '0d' as string of length 2.

I tried this out with Python 2.3 and 2.4 and noticed that they
handle input streams differently.
With 2.4 I get the same result as Java:

0x5408 0x8a08 0x6642 0x9593 0x3b 0xd 0xa

(There's no {} marks.)

This makes me wonder where the difference comes from?
Is this a bug in 2.3 or a new feature in 2.4?

-- george
 
B

Bengt Richter

Greetings,

I'm trying to read (japanese) chars from a file. While doing so
I encounter that a char with length 2 is returned. Is this to be
expected or is there something wrong?

Basically it's this what I'm doing:

import codecs
f = codecs.open("ident.in",'rb','Shift-JIS') ## japanses codecs installed

c = f.read(1)
while c:
if len(c)==1:
print hex(ord(c)),
else:
print "{",
for x in c: print hex(ord(x)),
print "}",
c = f.read(1)

This is my input (file is also attached):

$ od -tx1 ident.in
0000000 8d 87 8c 76 8e 9e 8a d4 3b 0d 0a
0000013

This is what I'm getting:

$ python ident.py ## python 2.3.4
on Windows
0x5408 0x8a08 0x6642 0x9593 { 0x3b 0xd } 0xa

"Python" believes that there are 6 chars on the stream while there are
actually 7 chars.

My naive assumption was that f.read(1) returns always a char of length 1 (or
zero).
On my 2.4b1 it does, see below.
Remark:
The input is believed to be "SJIS" but I haven't found a Python codecs for
this.
Therefore I'm using Shift-JIS. Of course this could be the problem. Note
that
when feeding Java with my input "correct" using SJIS, chars are spit out:

c=21512 c=35336 c=26178 c=38291 c=59 c=13 c=10 : 7 char(s)

References:
I downloaded Japanese codecs from here (version: 1.4.10)
http://www.asahi-net.or.jp/~rd6t-kjym/python/

Thanks for any hints,
Wolfgang.
I added a print line and dropped the ending commas on your print chunks,
but otherwise didn't (I think ;-) change your code:

Python 2.4b1 (#56, Nov 3 2004, 01:47:27)
[GCC 3.2.3 (mingw special 20030504-1)] on win32
Type "help", "copyright", "credits" or "license" for more information. ... print repr(c), len(c), '=>',
... if len(c)==1:
... print hex(ord(c))
... else:
... print "{",
... for x in c: print hex(ord(x)),
... print "}"
... c = f.read(1)
...
u'\u5408' 1 => 0x5408
u'\u8a08' 1 => 0x8a08
u'\u6642' 1 => 0x6642
u'\u9593' 1 => 0x9593
u';' 1 => 0x3b
u'\r' 1 => 0xd
u'\n' 1 => 0xa

I reproduced your binary file:
...
8d 87 8c 76 8e 9e 8a d4 3b 0d 0a

What version/platform are you using? Perhaps you can upgrade?

Regards,
Bengt Richter
 
W

wolfgang haefelinger

Hi,

works fine for me with 2.4c1! Don't even need to install
Japanese codecs now as it's already done. Shame that this
not mentioned.

I believe it's a bug but perhaps in the installed
Japanese Codecs.

Thanks to all provided feedback,
Wolfgang.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,772
Messages
2,569,588
Members
45,099
Latest member
AmbrosePri
Top