Validate string as UTF-8?

T

Tony Nelson

I'd like to have a fast way to validate large amounts of string data as
being UTF-8.

I don't see a fast way to do it in Python, though:

unicode(s,'utf-8').encode('utf-8)

seems to notice at least some of the time (the unicode() part works but
the encode() part bombs). I don't consider a RE based solution to be
fast. GLib provides a routine to do this, and I am using GTK so it's
included in there somewhere, but I don't see a way to call GLib
routines. I don't want to write another extension module.

Is there a (fast) Python function to validate UTF-8 data?

Is there some other fast way to validate UTF-8 data?

Is there a general way to call GLib functions?
________________________________________________________________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>
 
F

Fredrik Lundh

Tony said:
I'd like to have a fast way to validate large amounts of string data as
being UTF-8.

define "validate".
I don't see a fast way to do it in Python, though:

unicode(s,'utf-8').encode('utf-8)

if "validate" means "make sure the byte stream doesn't use invalid
sequences", a plain

unicode(s, "utf-8")

should be sufficient.

</F>
 
D

Diez B. Roggisch

Tony said:
I'd like to have a fast way to validate large amounts of string data as
being UTF-8.

I don't see a fast way to do it in Python, though:

unicode(s,'utf-8').encode('utf-8)

seems to notice at least some of the time (the unicode() part works but
the encode() part bombs). I don't consider a RE based solution to be
fast. GLib provides a routine to do this, and I am using GTK so it's
included in there somewhere, but I don't see a way to call GLib
routines. I don't want to write another extension module.

I somehow doubt that the encode bombs. Can you provide some more
details? Maybe of some allegedly not working strings?

Besides that, it's unneccessary - the unicode(s, "utf-8") should be
sufficient. If there are any undecodable byte sequences in there, that
should find them.

Regards,

Diez
 
W

Waitman Gobble

I have done this using a sytem call to the program "recode". Recode a
file UTF-8 and do a diff on the original and recoded files. Not an
elegant solution but did seem to function properly.

Take care,

Waitman Gobble
 
T

Tony Nelson

"Fredrik Lundh said:
define "validate".

All data conforms to the UTF-8 encoding format. I can stand if someone
has made data that impersonates UTF-8 that isn't really Unicode.

if "validate" means "make sure the byte stream doesn't use invalid
sequences", a plain

unicode(s, "utf-8")

should be sufficient.

You are correct. I misunderstood what was happening in my code. I
apologise for wasting bandwidth and your time (and I wasted my own time
as well).

Indeed, unicode(s, 'utf-8') will catch the problem and is fast enough
for my purpose, adding about 25% to the time to load a file.
________________________________________________________________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,077
Latest member
SangMoor21

Latest Threads

Top