Some questions about decode/encode

G

glacier

I use chinese charactors as an example here.

My first question is : what strategy does 'decode' use to tell the way
to seperate the words. I mean since s1 is an multi-bytes-char string,
how did it determine to seperate the string every 2bytes or 1byte?


My second question is: is there any one who has tested very long mbcs
decode? I tried to decode a long(20+MB) xml yesterday, which turns out
to be very strange and cause SAX fail to parse the decoded string.
However, I use another text editor to convert the file to utf-8 and
SAX will parse the content successfully.

I'm not sure if some special byte array or too long text caused this
problem. Or maybe thats a BUG of python 2.5?
 
B

Ben Finney

glacier said:
I use chinese charactors as an example here.


My first question is : what strategy does 'decode' use to tell the way
to seperate the words. I mean since s1 is an multi-bytes-char string,
how did it determine to seperate the string every 2bytes or 1byte?

The codec you specified ("GBK") is, like any character-encoding codec,
a precise mapping between characters and bytes. It's almost certainly
not aware of "words", only character-to-byte mappings.
 
B

Ben Finney

Ben Finney said:
The codec you specified ("GBK") is, like any character-encoding
codec, a precise mapping between characters and bytes. It's almost
certainly not aware of "words", only character-to-byte mappings.

To be clear, I should point out that I didn't mean to imply static
tabular mappings only. The mappings in a character encoding are often
more complex and algorithmic.

That doesn't make them any less precise, of course; and the core point
is that a character-mapping codec is *only* about getting between
characters and bytes, nothing else.
 
B

bbtestingbb

I use chinese charactors as an example here.


My first question is : what strategy does 'decode' use to tell the way
to seperate the words.

decode() uses the GBK strategy you specified to determine what
constitutes a character in your string.
My second question is: is there any one who has tested very long mbcs
decode? I tried to decode a long(20+MB) xml yesterday, which turns out
to be very strange and cause SAX fail to parse the decoded string.
However, I use another text editor to convert the file to utf-8 and
SAX will parse the content successfully.

I'm not sure if some special byte array or too long text caused this
problem. Or maybe thats a BUG of python 2.5?

That's probably to vague of a description to determine why SAX isn't
doing what you expect it to.
 
G

glacier

To be clear, I should point out that I didn't mean to imply static
tabular mappings only. The mappings in a character encoding are often
more complex and algorithmic.

That doesn't make them any less precise, of course; and the core point
is that a character-mapping codec is *only* about getting between
characters and bytes, nothing else.

--
\ "He who laughs last, thinks slowest." -- Anonymous |
`\ |
_o__) |
Ben Finney- Òþ²Ø±»ÒýÓÃÎÄ×Ö -

- ÏÔʾÒýÓõÄÎÄ×Ö -

thanks for your respoonse:)

When I mentioned 'word' in the previous post, I mean character.
According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a='ÄãºÃÂð'*100000
s1 = u''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode('mbcs')
cur += d
######################################

May the code above produce any bogus characters in s1?


Thanks :)
 
G

glacier

decode() uses the GBK strategy you specified to determine what
constitutes a character in your string.



That's probably to vague of a description to determine why SAX isn't
doing what you expect it to.

You mean to post a copy of the XML document?
 
G

Gabriel Genellina

According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a='你好å—'*100000
s1 = u''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode('mbcs')
cur += d
######################################

May the code above produce any bogus characters in s1?

Don't do that. You might be splitting the input string at a point that is
not a character boundary. You won't get bogus output, decode will raise a
UnicodeDecodeError instead.
You can control how errors are handled, see
http://docs.python.org/lib/string-methods.html#l2h-237
 
M

Marc 'BlackJack' Rintsch

My second question is: is there any one who has tested very long mbcs
decode? I tried to decode a long(20+MB) xml yesterday, which turns out
to be very strange and cause SAX fail to parse the decoded string.

That's because SAX wants bytes, not a decoded string. Don't decode it
yourself.
However, I use another text editor to convert the file to utf-8 and
SAX will parse the content successfully.

Because now you feed SAX with bytes instead of a unicode string.

Ciao,
Marc 'BlackJack' Rintsch
 
J

John Machin

I use chinese charactors as an example here.


My first question is : what strategy does 'decode' use to tell the way
to seperate the words. I mean since s1 is an multi-bytes-char string,
how did it determine to seperate the string every 2bytes or 1byte?

The usual strategy for encodings like GBK is:
1. If the current byte is less than 0x80, then it's a 1-byte
character.
2. Current byte 0x81 to 0xFE inclusive: current byte and the next byte
make up a two-byte character.
3. Current byte 0x80: undefined (or used e.g. in cp936 for the 1-byte
euro character)
4: Current byte 0xFF: undefined

Cheers,
John
 
7

7stud

That's because SAX wants bytes, not a decoded string.  Don't decode it
yourself.

encode() converts a unicode string to a regular string. decode()
converts a regular string to a unicode string. So I think what Marc
is saying is that SAX needs a regular string(i.e. bytes) not a decoded
string(i.e. a unicode string).
 
G

glacier

According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a='ÄãºÃÂð'*100000
s1 = u''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode('mbcs')
cur += d
######################################
May the code above produce any bogus characters in s1?

Don't do that. You might be splitting the input string at a point that is
not a character boundary. You won't get bogus output, decode will raise a
UnicodeDecodeError instead.
You can control how errors are handled, see http://docs.python.org/lib/string-methods.html#l2h-237

Thanks Gabriel,

I guess I understand what will happen if I didn't split the string at
the character's boundry.
I'm not sure if the decode method will miss split the boundry.
Can you tell me then ?

Thanks a lot.
 
G

glacier

That's because SAX wants bytes, not a decoded string. Don't decode it
yourself.


Because now you feed SAX with bytes instead of a unicode string.

Ciao,
Marc 'BlackJack' Rintsch

Yepp. I feed SAX with the unicode string since SAX didn't support my
encoding system(GBK).

Is there any way to solve this better?
I mean if I shouldn't convert the GBK string to unicode string, what
should I do to make SAX work?

Thanks , Marc.
:)
 
G

glacier

The usual strategy for encodings like GBK is:
1. If the current byte is less than 0x80, then it's a 1-byte
character.
2. Current byte 0x81 to 0xFE inclusive: current byte and the next byte
make up a two-byte character.
3. Current byte 0x80: undefined (or used e.g. in cp936 for the 1-byte
euro character)
4: Current byte 0xFF: undefined

Cheers,
John

Thanks John, I will try to write a function to test if the strategy
above caused the problem I described in the 1st post:)
 
M

Marc 'BlackJack' Rintsch

Yepp. I feed SAX with the unicode string since SAX didn't support my
encoding system(GBK).

If the `decode()` method supports it, IMHO SAX should too.
Is there any way to solve this better?
I mean if I shouldn't convert the GBK string to unicode string, what
should I do to make SAX work?

Decode it and then encode it to utf-8 before feeding it to the parser.

Ciao,
Marc 'BlackJack' Rintsch
 
J

John Machin

Yepp. I feed SAX with the unicode string since SAX didn't support my
encoding system(GBK).

Let's go back to the beginning. What is "SAX"? Show us exactly what
command or code you used.

How did you let this SAX know that the file was encoded in GBK? An
argument to SAX? An encoding declaration in the first few lines of the
file? Some other method? ... precise answer please. Or did you expect
that this SAX would guess correctly what the encoding was without
being told?

What does "didn't support my encoding system" mean? Have you actually
tried pushing raw undecoded GBK at SAX using a suitable documented
method of telling SAX that the file is in fact encoded in GBK? If so,
what was the error message that you got?

How do you know that it's GBK, anyway? Have you considered these
possible scenarios:
(1) It's GBK but you are telling SAX that it's GB2312
(2) It's GB18030 but you are telling SAX it's GBK

HTH,
John
 
J

John Machin

En Thu, 24 Jan 2008 04:52:22 -0200, glacier <[email protected]> escribi¨®:
According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a='ÄãºÃÂð'*100000
s1 = u''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode('mbcs')
cur += d
######################################
May the code above produce any bogus characters in s1?
Don't do that. You might be splitting the input string at a point that is
not a character boundary. You won't get bogus output, decode will raise a
UnicodeDecodeError instead.
You can control how errors are handled, see http://docs.python.org/lib/string-methods.html#l2h-237

Thanks Gabriel,

I guess I understand what will happen if I didn't split the string at
the character's boundry.
I'm not sure if the decode method will miss split the boundry.
Can you tell me then ?

Thanks a lot.

*IF* the file is well-formed GBK, then the codec will not mess up when
decoding it to Unicode. The usual cause of mess is a combination of a
human and a text editor :)
 
G

glacier

Let's go back to the beginning. What is "SAX"? Show us exactly what
command or code you used.
SAX is the package 'xml.sax' distributed with Python 2.5:)
1,I read text from a GBK encoded XML file then I skip the first line
declare the encoding.
2,I converted the string to uncode by call decode('mbcs')
3,I used xml.sax.parseString to parse the string.

########################################################################
f = file('e:/temp/456.xml','rb')
s = f.read()
f.close()
n = 0
for i in xrange(len(s)):
if s=='\n':
n += 1
if n == 1:
s = s[i+1:]
break
s = '<root>'+s+'</root>'
s = s.decode('mbcs')
xml.sax.parseString(s,handler,handler)
########################################################################

How did you let this SAX know that the file was encoded in GBK? An
argument to SAX? An encoding declaration in the first few lines of the
file? Some other method? ... precise answer please. Or did you expect
that this SAX would guess correctly what the encoding was without
being told?
I didn't tell the SAX the file is encoded in GBK since I used the
'parseString' method.
What does "didn't support my encoding system" mean? Have you actually
tried pushing raw undecoded GBK at SAX using a suitable documented
method of telling SAX that the file is in fact encoded in GBK? If so,
what was the error message that you got?
I mean SAX only support a limited number of encoding such as utf-8
utf-16 etc.,which didn't include GBK.
How do you know that it's GBK, anyway? Have you considered these
possible scenarios:
(1) It's GBK but you are telling SAX that it's GB2312
(2) It's GB18030 but you are telling SAX it's GBK
Frankly speaking, I cannot tell if the file contains any GB18030
characters...^______^
 
G

glacier

En Thu, 24 Jan 2008 04:52:22 -0200, glacier <[email protected]> escribi¨®:
According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a='ÄãºÃÂð'*100000
s1 = u''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode('mbcs')
cur += d
######################################
May the code above produce any bogus characters in s1?
Don't do that. You might be splitting the input string at a point that is
not a character boundary. You won't get bogus output, decode will raise a
UnicodeDecodeError instead.
You can control how errors are handled, see http://docs.python.org/lib/string-methods.html#l2h-237
Thanks Gabriel,
I guess I understand what will happen if I didn't split the string at
the character's boundry.
I'm not sure if the decode method will miss split the boundry.
Can you tell me then ?
Thanks a lot.

*IF* the file is well-formed GBK, then the codec will not mess up when
decoding it to Unicode. The usual cause of mess is a combination of a
human and a text editor :)- Òþ²Ø±»ÒýÓÃÎÄ×Ö -

- ÏÔʾÒýÓõÄÎÄ×Ö -

I guess firstly, I should check if the file I used to test is well-
formed GBK..:)
 
M

Martin v. Löwis

Is there any way to solve this better?
Decode it and then encode it to utf-8 before feeding it to the parser.

The tricky part is that you also need to change the encoding declaration
in doing so, but in this case, it should be fairly simple:

unicode_doc = original_doc.decode("gbk")
unicode_doc = unicode_doc.replace('gbk','utf-8',1)
utf8_doc = unicode_doc.encode("utf-8")

This assumes that the string "gbk" occurs in the encoding declaration
as

<?xml version="1.0" encoding="gbk"?>

If the encoding name has a different spelling (e.g. GBK), you need to
cater for that as well. You might want to try replacing the entire
XML declaration (i.e. everything between <? and ?>), or just the
encoding= parameter. Notice that the encoding declaration may include
' instead of ", and may have additional spaces, e.g.

<?xml version = '1.0'
encoding= 'gbK' ?>

HTH,
Martin
 
M

Mark Tolonen

*IF* the file is well-formed GBK, then the codec will not mess up when
decoding it to Unicode. The usual cause of mess is a combination of a
human and a text editor :)

SAX uses the expat parser. From the pyexpat module docs:

Expat doesn't support as many encodings as Python does, and its repertoire
of encodings can't be extended; it supports UTF-8, UTF-16, ISO-8859-1
(Latin1), and ASCII. If encoding is given it will override the implicit or
explicit encoding of the document.

--Mark
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top