Validate string as UTF-8?

Tony Nelson · Nov 6, 2005

I'd like to have a fast way to validate large amounts of string data as
being UTF-8.

I don't see a fast way to do it in Python, though:

unicode(s,'utf-8').encode('utf-8)

seems to notice at least some of the time (the unicode() part works but
the encode() part bombs). I don't consider a RE based solution to be
fast. GLib provides a routine to do this, and I am using GTK so it's
included in there somewhere, but I don't see a way to call GLib
routines. I don't want to write another extension module.

Is there a (fast) Python function to validate UTF-8 data?

Is there some other fast way to validate UTF-8 data?

Is there a general way to call GLib functions?
________________________________________________________________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>

Fredrik Lundh · Nov 6, 2005

Tony said:
I'd like to have a fast way to validate large amounts of string data as
being UTF-8.

define "validate".

I don't see a fast way to do it in Python, though:

unicode(s,'utf-8').encode('utf-8)

if "validate" means "make sure the byte stream doesn't use invalid
sequences", a plain

unicode(s, "utf-8")

should be sufficient.

</F>

david mugnai · Nov 6, 2005

On Sun, 06 Nov 2005 18:58:50 +0000, Tony Nelson wrote:

[snip]

Is there a general way to call GLib functions?

ctypes?
http://starship.python.net/crew/theller/ctypes/

Diez B. Roggisch · Nov 6, 2005

Tony said:
I'd like to have a fast way to validate large amounts of string data as
being UTF-8.

I don't see a fast way to do it in Python, though:

unicode(s,'utf-8').encode('utf-8)

seems to notice at least some of the time (the unicode() part works but
the encode() part bombs). I don't consider a RE based solution to be
fast. GLib provides a routine to do this, and I am using GTK so it's
included in there somewhere, but I don't see a way to call GLib
routines. I don't want to write another extension module.

I somehow doubt that the encode bombs. Can you provide some more
details? Maybe of some allegedly not working strings?

Besides that, it's unneccessary - the unicode(s, "utf-8") should be
sufficient. If there are any undecodable byte sequences in there, that
should find them.

Regards,

Diez

Tony Nelson · Nov 6, 2005

david mugnai said:
On Sun, 06 Nov 2005 18:58:50 +0000, Tony Nelson wrote:

[snip]

Is there a general way to call GLib functions?

Click to expand...

ctypes?
http://starship.python.net/crew/theller/ctypes/

Umm. Might be easier to write an extension module.
________________________________________________________________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>

Waitman Gobble · Nov 6, 2005

I have done this using a sytem call to the program "recode". Recode a
file UTF-8 and do a diff on the original and recoded files. Not an
elegant solution but did seem to function properly.

Take care,

Waitman Gobble

Tony Nelson · Nov 6, 2005

"Fredrik Lundh said:
define "validate".

All data conforms to the UTF-8 encoding format. I can stand if someone
has made data that impersonates UTF-8 that isn't really Unicode.

if "validate" means "make sure the byte stream doesn't use invalid
sequences", a plain

unicode(s, "utf-8")

should be sufficient.

You are correct. I misunderstood what was happening in my code. I
apologise for wasting bandwidth and your time (and I wasted my own time
as well).

Indeed, unicode(s, 'utf-8') will catch the problem and is fast enough
for my purpose, adding about 25% to the time to load a file.
________________________________________________________________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>

Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
Simple converter of files into their hex components... but i can'tarrange utf-8 parts!	2	Jun 9, 2013
hex dump w/ or w/out utf-8 chars	40	Jul 8, 2013
Unicode (UTF-8) in C	13	Mar 16, 2014
Python unicode utf-8 characters and MySQL unicode utf-8 characters	2	Jan 18, 2011
utf-8 and ctypes	5	Sep 28, 2010
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
Unicode/UTF-8 confusion	1	Mar 15, 2008

Validate string as UTF-8?

Tony Nelson

Fredrik Lundh

david mugnai

Diez B. Roggisch

Tony Nelson

Waitman Gobble

Tony Nelson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads