how to transfer my utf8 code saved in a file to gbk code

H

higer

My file contains such strings :
\xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a

I want to read the content of this file and transfer it to the
corresponding gbk code,a kind of Chinese character encode style.
Everytime I was trying to transfer, it will output the same thing no
matter which method was used.
It seems like that when Python reads it, Python will taks '\' as a
common char and this string at last will be represented as "\\xe6\\x97\
\xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a" , then the "\" can be 'correctly'
output,but that's not what I want to get.

Anyone can help me?


Thanks in advance.
 
R

R. David Murray

higer said:
My file contains such strings :
\xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a

If those bytes are what is in the file (and it sounds like they are),
then the data in your file is not in UTF8 encoding, it is in ASCII
encoded as hexidecimal escape codes.
I want to read the content of this file and transfer it to the
corresponding gbk code,a kind of Chinese character encode style.

You'll have to convert it from hex-escape into UTF8 first, then.

Perhaps better would be to write the original input files in UTF8,
since it sounds like that is what you were intending to do.
 
J

John Machin

My file contains such strings :
\xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a

Are you sure? Does that occupy 9 bytes in your file or 36 bytes?
I want to read the content of this file and transfer it to the
corresponding gbk code,a kind of Chinese character encode style.
Everytime I was trying to transfer, it will output the same thing no
matter which method was used.
It seems like that when Python reads it, Python will taks '\' as a
common char and this string at last will be represented as "\\xe6\\x97\
\xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a" , then the "\" can be 'correctly'
output,but that's not what I want to get.

Anyone can help me?

try this:

utf8_data = your_data.decode('string-escape')
unicode_data = utf8_data.decode('utf8')
# unicode derived from your sample looks like this $BF|4|!'(B is that what
you expected?
gbk_data = unicode_data.encode('gbk')

If that "doesn't work", do three things:
(1) give us some unambiguous hard evidence about the contents of your
data:
e.g. # assuming Python 2.x
your_data = open('your_file.txt', 'rb').read(36)
print repr(your_data)
print len(your_data)
print your_data.count('\\')
print your_data.count('x')

(2) show us the source of the script that you used
(3) Tell us what "doesn't work" means in this case

Cheers,
John
 
J

John Machin

If those bytes are what is in the file (and it sounds like they are),
then the data in your file is not in UTF8 encoding, it is in ASCII
encoded as hexidecimal escape codes.

OK, I'll bite: what *ASCII* character is encoded as either "\xe6" or
r"\xe6" by what mechanism in which parallel universe?
 
M

MRAB

John said:
OK, I'll bite: what *ASCII* character is encoded as either "\xe6" or
r"\xe6" by what mechanism in which parallel universe?

Maybe he means that the file itself is in ASCII.
 
J

John Machin

Maybe he means that the file itself is in ASCII.

Maybe indeed, but only so because hex escape codes are by design in
ASCII. "in ASCII" is redundant ... I can't imagine how the OP parsed
"ASCII <omitted 'because it is'> encoded" given that his native
tongue's grammar varies from that of English in several interesting
ways :)
 
H

higer

Are you sure? Does that occupy 9 bytes in your file or 36 bytes?

It was saved in a file, so it occupy 36 bytes. If I just use a
variable to contain this string, it can certainly work out correct
result,but how to get right answer when reading from file.
try this:

utf8_data = your_data.decode('string-escape')
unicode_data = utf8_data.decode('utf8')
# unicode derived from your sample looks like this $BF|4|!'(B is that what
you expected?

You are right , the result is $BF|4|(B which I just expect. If you save the
string in a variable, you surely can get the correct result. But it is
just a sample, so I give a short string, what if so many characters in
a file?
gbk_data = unicode_data.encode('gbk')

I have tried this method which you just told me, but unfortunately it
does not work(mess code).

If that "doesn't work", do three things:
(1) give us some unambiguous hard evidence about the contents of your
data:
e.g. # assuming Python 2.x

My Python versoin is 2.5.2
your_data = open('your_file.txt', 'rb').read(36)
print repr(your_data)
print len(your_data)
print your_data.count('\\')
print your_data.count('x')

The result is:

'\\xe6\\x97\\xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a'
36
9
9
(2) show us the source of the script that you used

def UTF8ToChnWords():
f = open("123.txt","rb")
content=f.read()
print repr(content)
print len(content)
print content.count("\\")
print content.count("x")

pass
if __name__ == '__main__':
UTF8ToChnWords()
(3) Tell us what "doesn't work" means in this case

It doesn't work because no matter in what way we deal with it we often
get 36 bytes string not 9 bytes.Thus, we can not get the correct
answer.
Cheers,
John

Thank you very much,
higer
 
R

R. David Murray

John Machin said:
OK, I'll bite: what *ASCII* character is encoded as either "\xe6" or
r"\xe6" by what mechanism in which parallel universe?

Well, you are correct that the OP might have had trouble parsing my
English. My English is more or less valid ("[the file] is _in_ ASCII",
ie: consists of ASCII characters, "encoded as hexideicmal escape codes",
which specifies the encoding used). But better perhaps would have been
to just say that the data is encoded as hexidecimal escape sequences.

--David
 
M

Mark Tolonen

higer said:
It was saved in a file, so it occupy 36 bytes. If I just use a
variable to contain this string, it can certainly work out correct
result,but how to get right answer when reading from file.

Did you create this file? If it is 36 characters, it contains literal
backslash characters, not the 9 bytes that would correctly encode as UTF-8.
If you created the file yourself, show us the code.
You are right , the result is 日期 which I just expect. If you save the
string in a variable, you surely can get the correct result. But it is
just a sample, so I give a short string, what if so many characters in
a file?


I have tried this method which you just told me, but unfortunately it
does not work(mess code).

How are you determining this is 'mess code'? How are you viewing the
result? You'll need to use a viewer that understands GBK encoding, such as
"Chinese Window's Notepad".
My Python versoin is 2.5.2


The result is:

'\\xe6\\x97\\xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a'
36
9
9


def UTF8ToChnWords():
f = open("123.txt","rb")
content=f.read()
print repr(content)
print len(content)
print content.count("\\")
print content.count("x")

Try:

utf8data = content.decode('string-escape')
unicodedata = utf8data.decode('utf8')
gbkdata = unicodedata.encode('gbk')
print len(gbkdata),repr(gbkdata)
open("456.txt","wb").write(gbkdata)

The print should give:

6 '\xc8\xd5\xc6\xda\xa3\xba'

This is correct for GBK encoding. 456.txt should contain the 6 bytes of GBK
data. View the file with a program that understand GBK encoding.

-Mark
 
H

higer

Thank you Mark,
that works.

Firstly using 'string-escape' to decode the content is the key
point,so I can get the Chinese characters now.




Regards,
-higer
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top