How to clean an xml files from non-utf-8 chars?

Krzysieq · Sep 17, 2008

[Note: parts of this message were removed to make it a legal post.]

Hi,

I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should be
utf-8, some of them aren't. Probably because some db data isn't. In any
case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.

Does anyone know, how to get rid of such characters? When opened in an
editor like Kate, they are viewed as a white question mark in black square.
I don't really care much about the data - if it's missing some chars, nobody
will care. The point is not to destroy the xml structure and enable other
tool's operations. Any help will be greatly appreciated.

Cheers,
Chris

Brian Candler · Sep 17, 2008

If you really don't care about the content:
str.gsub(/[\x80-\xff]/,'?')

Rob Biedenharn · Sep 17, 2008

If you really don't care about the content:
str.gsub(/[\x80-\xff]/,'?')
--

You can have bytes in that range as the first byte of a well-formed
UTF-8 Byte Sequence. They just can't represent a single byte. It's
just not that simple.

-Rob

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)

James Gray · Sep 17, 2008

I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they
should be utf-8, some of them aren't. Probably because some db data
isn't. In any case, this makes other toys break down, like xslt
transformation and
anything else that relies on the xml files being utf-8.

Does anyone know, how to get rid of such characters?

If you can figure out the encoding they are actually in, I recommend
using Iconv's transliterate mode:

require "iconv"
Iconv.conv("UTF-8//TRANSLIT", old_encoding_name, data)

James Edward Gray II

Krzysieq · Sep 17, 2008

[Note: parts of this message were removed to make it a legal post.]

Hey,

Thanks for inputs. So do You have another proposition?

Cheers,
Chris

2008/9/17 Rob Biedenharn said:
str.gsub(/[\x80-\xff]/,'?')

Click to expand...

You can have bytes in that range as the first byte of a well-formed UTF-8
Byte Sequence. They just can't represent a single byte. It's just not that
simple.

-Rob

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)

Mark Thomas · Sep 17, 2008

[Note: parts of this message were removed to make it a legal post.]

Hi,

I have a problem. I'm trying to parse with ruby some test results from
jmeter, that are stored in xml files. Unfortunately, while they should be
utf-8, some of them aren't. Probably because some db data isn't. In any
case, this makes other toys break down, like xslt transformation and
anything else that relies on the xml files being utf-8.

Look at http://www.botvector.net/2007/11/encoding-problems.html,
particularly the "iconvert" method which attempts conversion to UTF-8,
but in the case where the string cannot be converted to UTF-8 (e.g.
double-byte chars) then it replaces the chars with "?".

-- Mark.

Brian Candler · Sep 17, 2008

Rob said:
If you really don't care about the content:
str.gsub(/[\x80-\xff]/,'?')
--

Click to expand...

You can have bytes in that range as the first byte of a well-formed
UTF-8 Byte Sequence. They just can't represent a single byte. It's
just not that simple.

That's why I said "if you really don't care" ... it strips all valid
non-ASCII UTF8 as well as invalid.

There is a nice table at http://en.wikipedia.org/wiki/UTF-8 which would
let you build something more accurate. Ruby quiz perhaps?

Jeremy Hinegardner · Sep 17, 2008

If you can figure out the encoding they are actually in, I recommend using
Iconv's transliterate mode:

require "iconv"
Iconv.conv("UTF-8//TRANSLIT", old_encoding_name, data)

This is the approach we have take on some of our code, basically we wanted to
replicate the 'iconv -c' behavior. Does TRANSLIT do this ? I've never used
that mode before.

module UTF8
module Cleanable
#
# Converts the string representation of this class to a utf8 clean
# string. This assumes that #to_s on the object will result in a utf8
# string. All chars that are not valid utf8 char sequences will be
# silently dropped.
#
def utf8_clean
Iconv.open( "UTF-8", "UTF-8" ) do |iconv|
output = StringIO.new
working = self.to_s
loop do
begin
output.print iconv.iconv( working )
break
rescue Iconv::IllegalSequence => is
output.print is.success
working = is.failed[1..-1]
end
end
return output.string
end
end
end
end

class String
include UTF8::Cleanable
end

enjoy,

-jeremy

Gregory Brown · Sep 17, 2008

This is the approach we have take on some of our code, basically we wanted to
replicate the 'iconv -c' behavior. Does TRANSLIT do this ? I've never used
that mode before.

module UTF8
module Cleanable
#
# Converts the string representation of this class to a utf8 clean
# string. This assumes that #to_s on the object will result in a utf8
# string. All chars that are not valid utf8 char sequences will be
# silently dropped.

To silently drop chars with IConv, you'd want to do:

Iconv.conv("UTF-8//IGNORE", old_encoding_name, data)

TRANSLIT just works a little harder and tries to convert your
characters into a series of UTF-8 chars if possible.
I'm not sure if it drops chars that can't be transliterated...

-greg

James Gray · Sep 17, 2008

This is the approach we have take on some of our code, basically we
wanted to
replicate the 'iconv -c' behavior. Does TRANSLIT do this ? I've
never used
that mode before.

//TRANSLIT is better than that. It tries to translate the
characters. Thus a UTF-8 ellipse would become three periods if
converted to ISO-8859-1 with //TRANSLIT.

You can mimic -c though, just use //IGNORE instead of //TRANSLIT. You
can even do //TRANSLIT//IGNORE which transliterates what it can and
discards the rest.

James Edward Gray II

Krzysieq · Sep 18, 2008

[Note: parts of this message were removed to make it a legal post.]

Unfortunately, there's no way telling the original encoding. I would rather
go for some method of removing / substituting the chars that don't belong
there, but the method first suggested by Brian doesn't seem to work for some
reason. Does anyone have another option? I'm investigating the reasons of
failure, I will write more when I know something more. Thanks for all help
anyways

Cheers,
Chris

Gregory Brown · Sep 18, 2008

Unfortunately, there's no way telling the original encoding. I would rather
go for some method of removing / substituting the chars that don't belong
there, but the method first suggested by Brian doesn't seem to work for some
reason. Does anyone have another option? I'm investigating the reasons of
failure, I will write more when I know something more. Thanks for all help
anyways

If there is no way of telling the original encoding, the input data
may not have valid unicode in it at all, right?

-greg

Mark Thomas · Sep 18, 2008

[Note: parts of this message were removed to make it a legal post.]

Unfortunately, there's no way telling the original encoding. I would rather
go for some method of removing / substituting the chars that don't belong
there, but the method first suggested by Brian doesn't seem to work for some
reason. Does anyone have another option?

Try the iconv solutions with latin-1 (iso-8859-1) as the From. That's
as close as you can get to a one-byte "anything-goes" encoding.

-Mark.

Krzysieq · Sep 19, 2008

[Note: parts of this message were removed to make it a legal post.]

Ok, I tried all previous suggestions, neither worked (gsub idea, TRANSLIT,
IGNORE or the one from the link posted by Mark Thomas). In fact, the last
two don't seem to have done anything, while gsub seems to do too much -
seems like it has damaged the xml structure in some way, which seems very
strange to me. I don't really care about the data inside, but I need the xml
to remain valid.

@Gregory - that's true, it may not. However, the places where I found the
funny characters are text nodes inside xml documents, and there aren't that
many of them. Surely, one is many enough to break the whole thing, but
typically there's very few and it seems more like corrupted database data. I
think they store some newspaper articles there or pieces of news. I learned
from the team who maintain that database in their app, that typically it
should all be ISO-8859-1, but for some reason it's not always the case.
Hence the idea with corrupted data seems quite likely.

Thanks for any help You can provide me with

Cheers,
Chris

2008/9/18 Mark Thomas said:
[Note: parts of this message were removed to make it a legal post.]

Unfortunately, there's no way telling the original encoding. I would rather
go for some method of removing / substituting the chars that don't belong
there, but the method first suggested by Brian doesn't seem to work for some
reason. Does anyone have another option?

Click to expand...

Try the iconv solutions with latin-1 (iso-8859-1) as the From. That's
as close as you can get to a one-byte "anything-goes" encoding.

-Mark.

Gregory Brown · Sep 19, 2008

Thanks for any help You can provide me with

Silly question, but did you set $KCODE = "U" while processing your data?

-greg

Krzysieq · Sep 19, 2008

[Note: parts of this message were removed to make it a legal post.]

Sill answer, but what is $KCODE ?? I'm relatively new to Ruby, so this tells
me nothing... And as You might have guessed, no, I haven't set it. What's it
do?

Cheers,
Chris

Mark Thomas · Sep 19, 2008

How is the XML file created? If you know in advance which parts of the
XML come from the database, wrap those sections in CDATA blocks and
your XML will remain valid.

Gregory Brown · Sep 19, 2008

Sill answer, but what is $KCODE ?? I'm relatively new to Ruby, so this tells
me nothing... And as You might have guessed, no, I haven't set it. What's it
do?

It tells Ruby that you are working with UTF-8

-greg

James Gray · Sep 19, 2008

Sill answer, but what is $KCODE ??

It's a global variable that affects how Ruby 1.8 handles characters.

And as You might have guessed, no, I haven't set it.

Does your code run inside of a recent version of Rails? I'm just
asking because it sets $KCODE for you.

James Edward Gray II

I need UTF-8. But I keep getting UTF-16. Why? How to fix?	1	Apr 30, 2007
Problems with UTF-8 encoded XML files	1	Dec 15, 2003
minidom xml & non ascii / unicode & files	4	Aug 5, 2005
XML Problem as XMLDatasource from windowsapp to asp.net	2	Oct 28, 2009
How to create an home page using xsl and jsp with header,footer and navbar written in separate file	0	May 3, 2007
trying to get started with simple xml stuff - use xalan ?	1	Feb 19, 2004
Ruby Weekly News 28th March - 3rd April 2005	6	Apr 4, 2005
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006

How to clean an xml files from non-utf-8 chars?

Krzysieq

Brian Candler

Rob Biedenharn

James Gray

Krzysieq

Mark Thomas

Brian Candler

Jeremy Hinegardner

Gregory Brown

James Gray

Krzysieq

Gregory Brown

Mark Thomas

Krzysieq

Gregory Brown

Krzysieq

Mark Thomas

Gregory Brown

James Gray

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads