Some questions about decode/encode

glacier · Jan 24, 2008

I use chinese charactors as an example here.

My first question is : what strategy does 'decode' use to tell the way
to seperate the words. I mean since s1 is an multi-bytes-char string,
how did it determine to seperate the string every 2bytes or 1byte?

My second question is: is there any one who has tested very long mbcs
decode? I tried to decode a long(20+MB) xml yesterday, which turns out
to be very strange and cause SAX fail to parse the decoded string.
However, I use another text editor to convert the file to utf-8 and
SAX will parse the content successfully.

I'm not sure if some special byte array or too long text caused this
problem. Or maybe thats a BUG of python 2.5?

Ben Finney · Jan 24, 2008

glacier said:
I use chinese charactors as an example here.

My first question is : what strategy does 'decode' use to tell the way
to seperate the words. I mean since s1 is an multi-bytes-char string,
how did it determine to seperate the string every 2bytes or 1byte?

The codec you specified ("GBK") is, like any character-encoding codec,
a precise mapping between characters and bytes. It's almost certainly
not aware of "words", only character-to-byte mappings.

Ben Finney · Jan 24, 2008

Ben Finney said:
The codec you specified ("GBK") is, like any character-encoding
codec, a precise mapping between characters and bytes. It's almost
certainly not aware of "words", only character-to-byte mappings.

To be clear, I should point out that I didn't mean to imply static
tabular mappings only. The mappings in a character encoding are often
more complex and algorithmic.

That doesn't make them any less precise, of course; and the core point
is that a character-mapping codec is *only* about getting between
characters and bytes, nothing else.

bbtestingbb · Jan 24, 2008

I use chinese charactors as an example here.

My first question is : what strategy does 'decode' use to tell the way
to seperate the words.

decode() uses the GBK strategy you specified to determine what
constitutes a character in your string.

My second question is: is there any one who has tested very long mbcs
decode? I tried to decode a long(20+MB) xml yesterday, which turns out
to be very strange and cause SAX fail to parse the decoded string.
However, I use another text editor to convert the file to utf-8 and
SAX will parse the content successfully.

I'm not sure if some special byte array or too long text caused this
problem. Or maybe thats a BUG of python 2.5?

That's probably to vague of a description to determine why SAX isn't
doing what you expect it to.

glacier · Jan 24, 2008

To be clear, I should point out that I didn't mean to imply static
tabular mappings only. The mappings in a character encoding are often
more complex and algorithmic.

That doesn't make them any less precise, of course; and the core point
is that a character-mapping codec is *only* about getting between
characters and bytes, nothing else.

--
\ "He who laughs last, thinks slowest." -- Anonymous |
`\ |
_o__) |
Ben Finney- Òþ²Ø±»ÒýÓÃÎÄ×Ö -

- ÏÔÊ¾ÒýÓÃµÄÎÄ×Ö -

thanks for your respoonse

When I mentioned 'word' in the previous post, I mean character.
According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a='ÄãºÃÂð'*100000
s1 = u''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode('mbcs')
cur += d
######################################

May the code above produce any bogus characters in s1?

Thanks

glacier · Jan 24, 2008

decode() uses the GBK strategy you specified to determine what
constitutes a character in your string.

That's probably to vague of a description to determine why SAX isn't
doing what you expect it to.

You mean to post a copy of the XML document?

Gabriel Genellina · Jan 24, 2008

En Thu said:
According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a='ä½ å¥½å—'*100000
s1 = u''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode('mbcs')
cur += d
######################################

May the code above produce any bogus characters in s1?

Don't do that. You might be splitting the input string at a point that is
not a character boundary. You won't get bogus output, decode will raise a
UnicodeDecodeError instead.
You can control how errors are handled, see
http://docs.python.org/lib/string-methods.html#l2h-237

Marc 'BlackJack' Rintsch · Jan 24, 2008

My second question is: is there any one who has tested very long mbcs
decode? I tried to decode a long(20+MB) xml yesterday, which turns out
to be very strange and cause SAX fail to parse the decoded string.

That's because SAX wants bytes, not a decoded string. Don't decode it
yourself.

However, I use another text editor to convert the file to utf-8 and
SAX will parse the content successfully.

Because now you feed SAX with bytes instead of a unicode string.

Ciao,
Marc 'BlackJack' Rintsch

John Machin · Jan 24, 2008

I use chinese charactors as an example here.

My first question is : what strategy does 'decode' use to tell the way
to seperate the words. I mean since s1 is an multi-bytes-char string,
how did it determine to seperate the string every 2bytes or 1byte?

The usual strategy for encodings like GBK is:
1. If the current byte is less than 0x80, then it's a 1-byte
character.
2. Current byte 0x81 to 0xFE inclusive: current byte and the next byte
make up a two-byte character.
3. Current byte 0x80: undefined (or used e.g. in cp936 for the 1-byte
euro character)
4: Current byte 0xFF: undefined

Cheers,
John

7stud · Jan 24, 2008

That's because SAX wants bytes, not a decoded string. Don't decode it
yourself.

encode() converts a unicode string to a regular string. decode()
converts a regular string to a unicode string. So I think what Marc
is saying is that SAX needs a regular string(i.e. bytes) not a decoded
string(i.e. a unicode string).

glacier · Jan 27, 2008

En Thu said:
En Thu said:

According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a='ÄãºÃÂð'*100000
s1 = u''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode('mbcs')
cur += d
######################################

Click to expand...

May the code above produce any bogus characters in s1?

Click to expand...

Don't do that. You might be splitting the input string at a point that is
not a character boundary. You won't get bogus output, decode will raise a
UnicodeDecodeError instead.
You can control how errors are handled, see http://docs.python.org/lib/string-methods.html#l2h-237

Thanks Gabriel,

I guess I understand what will happen if I didn't split the string at
the character's boundry.
I'm not sure if the decode method will miss split the boundry.
Can you tell me then ?

Thanks a lot.

glacier · Jan 27, 2008

That's because SAX wants bytes, not a decoded string. Don't decode it
yourself.

Because now you feed SAX with bytes instead of a unicode string.

Ciao,
Marc 'BlackJack' Rintsch

Yepp. I feed SAX with the unicode string since SAX didn't support my
encoding system(GBK).

Is there any way to solve this better?
I mean if I shouldn't convert the GBK string to unicode string, what
should I do to make SAX work?

Thanks , Marc.

glacier · Jan 27, 2008

The usual strategy for encodings like GBK is:
1. If the current byte is less than 0x80, then it's a 1-byte
character.
2. Current byte 0x81 to 0xFE inclusive: current byte and the next byte
make up a two-byte character.
3. Current byte 0x80: undefined (or used e.g. in cp936 for the 1-byte
euro character)
4: Current byte 0xFF: undefined

Cheers,
John

Thanks John, I will try to write a function to test if the strategy
above caused the problem I described in the 1st post

Marc 'BlackJack' Rintsch · Jan 27, 2008

Yepp. I feed SAX with the unicode string since SAX didn't support my
encoding system(GBK).

If the `decode()` method supports it, IMHO SAX should too.

Is there any way to solve this better?
I mean if I shouldn't convert the GBK string to unicode string, what
should I do to make SAX work?

Decode it and then encode it to utf-8 before feeding it to the parser.

Ciao,
Marc 'BlackJack' Rintsch

John Machin · Jan 27, 2008

Yepp. I feed SAX with the unicode string since SAX didn't support my
encoding system(GBK).

Let's go back to the beginning. What is "SAX"? Show us exactly what
command or code you used.

How did you let this SAX know that the file was encoded in GBK? An
argument to SAX? An encoding declaration in the first few lines of the
file? Some other method? ... precise answer please. Or did you expect
that this SAX would guess correctly what the encoding was without
being told?

What does "didn't support my encoding system" mean? Have you actually
tried pushing raw undecoded GBK at SAX using a suitable documented
method of telling SAX that the file is in fact encoded in GBK? If so,
what was the error message that you got?

How do you know that it's GBK, anyway? Have you considered these
possible scenarios:
(1) It's GBK but you are telling SAX that it's GB2312
(2) It's GB18030 but you are telling SAX it's GBK

HTH,
John

John Machin · Jan 27, 2008

En Thu, 24 Jan 2008 04:52:22 -0200, glacier <[email protected]> escribi¨®:

According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a='ÄãºÃÂð'*100000
s1 = u''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode('mbcs')
cur += d
######################################
May the code above produce any bogus characters in s1?

Click to expand...

Click to expand...

Don't do that. You might be splitting the input string at a point that is
not a character boundary. You won't get bogus output, decode will raise a
UnicodeDecodeError instead.
You can control how errors are handled, see http://docs.python.org/lib/string-methods.html#l2h-237

Click to expand...

Thanks Gabriel,

I guess I understand what will happen if I didn't split the string at
the character's boundry.
I'm not sure if the decode method will miss split the boundry.
Can you tell me then ?

Thanks a lot.

*IF* the file is well-formed GBK, then the codec will not mess up when
decoding it to Unicode. The usual cause of mess is a combination of a
human and a text editor

glacier · Jan 27, 2008

Let's go back to the beginning. What is "SAX"? Show us exactly what
command or code you used.

SAX is the package 'xml.sax' distributed with Python 2.5

1,I read text from a GBK encoded XML file then I skip the first line
declare the encoding.
2,I converted the string to uncode by call decode('mbcs')
3,I used xml.sax.parseString to parse the string.

########################################################################
f = file('e:/temp/456.xml','rb')
s = f.read()
f.close()
n = 0
for i in xrange(len(s)):
if s=='\n':
n += 1
if n == 1:
s = s[i+1:]
break
s = '<root>'+s+'</root>'
s = s.decode('mbcs')
xml.sax.parseString(s,handler,handler)
########################################################################

How did you let this SAX know that the file was encoded in GBK? An
argument to SAX? An encoding declaration in the first few lines of the
file? Some other method? ... precise answer please. Or did you expect
that this SAX would guess correctly what the encoding was without
being told?

Click to expand...

I didn't tell the SAX the file is encoded in GBK since I used the
'parseString' method.

What does "didn't support my encoding system" mean? Have you actually
tried pushing raw undecoded GBK at SAX using a suitable documented
method of telling SAX that the file is in fact encoded in GBK? If so,
what was the error message that you got?

Click to expand...

I mean SAX only support a limited number of encoding such as utf-8
utf-16 etc.,which didn't include GBK.

How do you know that it's GBK, anyway? Have you considered these
possible scenarios:
(1) It's GBK but you are telling SAX that it's GB2312
(2) It's GB18030 but you are telling SAX it's GBK

Click to expand...

Frankly speaking, I cannot tell if the file contains any GB18030
characters...^______^

glacier · Jan 27, 2008

En Thu, 24 Jan 2008 04:52:22 -0200, glacier <[email protected]> escribi¨®:
According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a='ÄãºÃÂð'*100000
s1 = u''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode('mbcs')
cur += d
######################################
May the code above produce any bogus characters in s1?
Don't do that. You might be splitting the input string at a point that is
not a character boundary. You won't get bogus output, decode will raise a
UnicodeDecodeError instead.
You can control how errors are handled, see http://docs.python.org/lib/string-methods.html#l2h-237

Click to expand...

Click to expand...

Thanks Gabriel,

Click to expand...

I guess I understand what will happen if I didn't split the string at
the character's boundry.
I'm not sure if the decode method will miss split the boundry.
Can you tell me then ?

Click to expand...

Thanks a lot.

Click to expand...

*IF* the file is well-formed GBK, then the codec will not mess up when
decoding it to Unicode. The usual cause of mess is a combination of a
human and a text editor - Òþ²Ø±»ÒýÓÃÎÄ×Ö -

- ÏÔÊ¾ÒýÓÃµÄÎÄ×Ö -

I guess firstly, I should check if the file I used to test is well-
formed GBK..

Martin v. LÃ¶wis · Jan 27, 2008

Is there any way to solve this better?

Decode it and then encode it to utf-8 before feeding it to the parser.

The tricky part is that you also need to change the encoding declaration
in doing so, but in this case, it should be fairly simple:

unicode_doc = original_doc.decode("gbk")
unicode_doc = unicode_doc.replace('gbk','utf-8',1)
utf8_doc = unicode_doc.encode("utf-8")

This assumes that the string "gbk" occurs in the encoding declaration
as

<?xml version="1.0" encoding="gbk"?>

If the encoding name has a different spelling (e.g. GBK), you need to
cater for that as well. You might want to try replacing the entire
XML declaration (i.e. everything between <? and ?>), or just the
encoding= parameter. Notice that the encoding declaration may include
' instead of ", and may have additional spaces, e.g.

<?xml version = '1.0'
encoding= 'gbK' ?>

HTH,
Martin

Mark Tolonen · Jan 27, 2008

*IF* the file is well-formed GBK, then the codec will not mess up when
decoding it to Unicode. The usual cause of mess is a combination of a
human and a text editor

SAX uses the expat parser. From the pyexpat module docs:

Expat doesn't support as many encodings as Python does, and its repertoire
of encodings can't be extended; it supports UTF-8, UTF-16, ISO-8859-1
(Latin1), and ASCII. If encoding is given it will override the implicit or
explicit encoding of the document.

--Mark

decode a string to "Perl's internal form" without Encode module?	4	Feb 28, 2007
Python - Map decode routine problem?	0	Dec 15, 2007
Another question about JSON	1	Sep 13, 2013
DUPLICATE MODS, PLEASE DELETE, SORRY!	1	Sep 4, 2023
How to decode JavaScript's encodeURIComponent in Perl.	4	Jan 23, 2007
unicode: is decode-process-encode a "good" aproach?	2	Sep 28, 2004
some questions on C++	11	Apr 14, 2008
HOWTO: Parsing email using Python part1	2	Jul 3, 2011

Some questions about decode/encode

glacier

Ben Finney

Ben Finney

bbtestingbb

glacier

glacier

Gabriel Genellina

Marc 'BlackJack' Rintsch

John Machin

7stud

glacier

glacier

glacier

Marc 'BlackJack' Rintsch

John Machin

John Machin

glacier

glacier

Martin v. LÃ¶wis

Mark Tolonen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads