how to transfer my utf8 code saved in a file to gbk code

higer · Jun 7, 2009

My file contains such strings :
\xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a

I want to read the content of this file and transfer it to the
corresponding gbk code,a kind of Chinese character encode style.
Everytime I was trying to transfer, it will output the same thing no
matter which method was used.
It seems like that when Python reads it, Python will taks '\' as a
common char and this string at last will be represented as "\\xe6\\x97\
\xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a" , then the "\" can be 'correctly'
output,but that's not what I want to get.

Anyone can help me?

Thanks in advance.

R. David Murray · Jun 7, 2009

higer said:
My file contains such strings :
\xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a

If those bytes are what is in the file (and it sounds like they are),
then the data in your file is not in UTF8 encoding, it is in ASCII
encoded as hexidecimal escape codes.

I want to read the content of this file and transfer it to the
corresponding gbk code,a kind of Chinese character encode style.

You'll have to convert it from hex-escape into UTF8 first, then.

Perhaps better would be to write the original input files in UTF8,
since it sounds like that is what you were intending to do.

John Machin · Jun 7, 2009

My file contains such strings :
\xe6\x97\xa5\xe6\x9c\x9f\xef\xbc\x9a

Are you sure? Does that occupy 9 bytes in your file or 36 bytes?

I want to read the content of this file and transfer it to the
corresponding gbk code,a kind of Chinese character encode style.
Everytime I was trying to transfer, it will output the same thing no
matter which method was used.
It seems like that when Python reads it, Python will taks '\' as a
common char and this string at last will be represented as "\\xe6\\x97\
\xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a" , then the "\" can be 'correctly'
output,but that's not what I want to get.

Anyone can help me?

try this:

utf8_data = your_data.decode('string-escape')
unicode_data = utf8_data.decode('utf8')
# unicode derived from your sample looks like this $BF|4|!'(B is that what
you expected?
gbk_data = unicode_data.encode('gbk')

If that "doesn't work", do three things:
(1) give us some unambiguous hard evidence about the contents of your
data:
e.g. # assuming Python 2.x
your_data = open('your_file.txt', 'rb').read(36)
print repr(your_data)
print len(your_data)
print your_data.count('\\')
print your_data.count('x')

(2) show us the source of the script that you used
(3) Tell us what "doesn't work" means in this case

Cheers,
John

John Machin · Jun 8, 2009

If those bytes are what is in the file (and it sounds like they are),
then the data in your file is not in UTF8 encoding, it is in ASCII
encoded as hexidecimal escape codes.

OK, I'll bite: what *ASCII* character is encoded as either "\xe6" or
r"\xe6" by what mechanism in which parallel universe?

MRAB · Jun 8, 2009

John said:
OK, I'll bite: what *ASCII* character is encoded as either "\xe6" or
r"\xe6" by what mechanism in which parallel universe?

Maybe he means that the file itself is in ASCII.

John Machin · Jun 8, 2009

Maybe he means that the file itself is in ASCII.

Maybe indeed, but only so because hex escape codes are by design in
ASCII. "in ASCII" is redundant ... I can't imagine how the OP parsed
"ASCII <omitted 'because it is'> encoded" given that his native
tongue's grammar varies from that of English in several interesting
ways

higer · Jun 8, 2009

Are you sure? Does that occupy 9 bytes in your file or 36 bytes?

It was saved in a file, so it occupy 36 bytes. If I just use a
variable to contain this string, it can certainly work out correct
result,but how to get right answer when reading from file.

try this:

utf8_data = your_data.decode('string-escape')
unicode_data = utf8_data.decode('utf8')
# unicode derived from your sample looks like this $BF|4|!'(B is that what
you expected?

You are right , the result is $BF|4|(B which I just expect. If you save the
string in a variable, you surely can get the correct result. But it is
just a sample, so I give a short string, what if so many characters in
a file?

gbk_data = unicode_data.encode('gbk')

I have tried this method which you just told me, but unfortunately it
does not work(mess code).

If that "doesn't work", do three things:
(1) give us some unambiguous hard evidence about the contents of your
data:
e.g. # assuming Python 2.x

My Python versoin is 2.5.2

your_data = open('your_file.txt', 'rb').read(36)
print repr(your_data)
print len(your_data)
print your_data.count('\\')
print your_data.count('x')

The result is:

'\\xe6\\x97\\xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a'
36
9
9

(2) show us the source of the script that you used

def UTF8ToChnWords():
f = open("123.txt","rb")
content=f.read()
print repr(content)
print len(content)
print content.count("\\")
print content.count("x")

pass
if __name__ == '__main__':
UTF8ToChnWords()

(3) Tell us what "doesn't work" means in this case

It doesn't work because no matter in what way we deal with it we often
get 36 bytes string not 9 bytes.Thus, we can not get the correct
answer.

Cheers,
John

Thank you very much,
higer

higer · Jun 8, 2009

Maybe he means that the file itself is in ASCII.

Yes,my file itself is in ASCII.

R. David Murray · Jun 8, 2009

John Machin said:
OK, I'll bite: what *ASCII* character is encoded as either "\xe6" or
r"\xe6" by what mechanism in which parallel universe?

Well, you are correct that the OP might have had trouble parsing my
English. My English is more or less valid ("[the file] is _in_ ASCII",
ie: consists of ASCII characters, "encoded as hexideicmal escape codes",
which specifies the encoding used). But better perhaps would have been
to just say that the data is encoded as hexidecimal escape sequences.

--David

Mark Tolonen · Jun 8, 2009

higer said:
It was saved in a file, so it occupy 36 bytes. If I just use a
variable to contain this string, it can certainly work out correct
result,but how to get right answer when reading from file.

Did you create this file? If it is 36 characters, it contains literal
backslash characters, not the 9 bytes that would correctly encode as UTF-8.
If you created the file yourself, show us the code.

You are right , the result is æ—¥æœŸ which I just expect. If you save the
string in a variable, you surely can get the correct result. But it is
just a sample, so I give a short string, what if so many characters in
a file?

I have tried this method which you just told me, but unfortunately it
does not work(mess code).

How are you determining this is 'mess code'? How are you viewing the
result? You'll need to use a viewer that understands GBK encoding, such as
"Chinese Window's Notepad".

My Python versoin is 2.5.2

The result is:

'\\xe6\\x97\\xa5\\xe6\\x9c\\x9f\\xef\\xbc\\x9a'
36
9
9

def UTF8ToChnWords():
f = open("123.txt","rb")
content=f.read()
print repr(content)
print len(content)
print content.count("\\")
print content.count("x")

Try:

utf8data = content.decode('string-escape')
unicodedata = utf8data.decode('utf8')
gbkdata = unicodedata.encode('gbk')
print len(gbkdata),repr(gbkdata)
open("456.txt","wb").write(gbkdata)

The print should give:

6 '\xc8\xd5\xc6\xda\xa3\xba'

This is correct for GBK encoding. 456.txt should contain the 6 bytes of GBK
data. View the file with a program that understand GBK encoding.

-Mark

higer · Jun 8, 2009

Thank you Mark,
that works.

Firstly using 'string-escape' to decode the content is the key
point,so I can get the Chinese characters now.

Regards,
-higer

Output confusion	2	Mar 9, 2023
How to try a range of hex values in C# code ?	0	Nov 19, 2022
windows active directory ldap output encoding	2	Jul 8, 2008
Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023
Set sys.stdout.encoding to utf8 in emacs/python-mode?	0	Mar 27, 2008
u'a' in string.letters fails: a Python 2.3 bug?	2	Oct 10, 2003
How to include a text file in my executable JAR file?	6	Aug 23, 2012
How to know that two pyc files contain the same code	1	Mar 10, 2012

how to transfer my utf8 code saved in a file to gbk code

higer

R. David Murray

John Machin

John Machin

MRAB

John Machin

higer

higer

R. David Murray

Mark Tolonen

higer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads