how to convert all invalid UTF-8 sequences to numeric equivalent?

S

Shambo

Hey folks,

I've been grappling with this for days, and can see no option but to
use brute force.

We have a ton of text files from all over the world, often times
including invalid UTF-8 characters such as ø or £ (that was an o with
a line thru it, a la Scandanavian letters, and a British pound
sterling symbol). When I convert these text files to XML, the
resulting XML is not valid becuase it contains these characters. I can
map individual charatcers to their numerical equivalent (ø and
£ in this case), but I'm wary about performing such a conversion
for each and every non UTF-8 valid sequence I may find.

So my question is, has someone found a way to automate converion of
these charcters to their numerical equivalent without having to list
every sinlge character? I searched for scripts and modules that might
do this, but didn't see any that jumped out at me.

Secondly, I had been doing brute-force checking for every non-UTF-8
valid sequence, and I might be doing it incorrectly. For example, if I
searched for the hex string \xA3, I was expecting to match on the £
symbol. Not so. I have to explicitly search for the £ symbol, not the
hex equivalent, because that's how it is in the text file.

To re-iterate:

$line =~ s/\xA3/\&#163\;/g;
does not work when the literal symbol £ is in the text. I thought
forcing Perl to find the hex version of any character would work. I
guess I'm missing something.

Any insight would be mst appreciated.

thanks very much,
Shambo
 
A

Alan J. Flavell

On Wed, Jun 25, Shambo inscribed on the eternal scroll:

[Oh dear, this _is_ getting to be more like some
hypothetical comp.encoding group...]
We have a ton of text files from all over the world, often times
including invalid UTF-8 characters such as ø or £

Well, your posting was encoded in iso-8859-1, so if that's to be
taken seriously, then you haven't got utf-8. So what's the point of
trying to read it as utf-8? It doesn't even remotely resemble it
(aside from the characters that are us-ascii anyway...).
(that was an o with
a line thru it, a la Scandanavian letters, and a British pound
sterling symbol).

In iso-8859-1 (or Windows-1252, not that I'd encourage that), they
would indeed be.
When I convert these text files to XML, the
resulting XML is not valid becuase it contains these characters.

This is because you're not telling XML what your character coding is.
I can
map individual charatcers to their numerical equivalent (ø and
£ in this case),

It's a valid choice. But why the hell? If you want to represent them
in utf-8, then do so.

In Perl 5.8 you just tell the input file handle that its encoding is
iso-8859-1, and the output file handle that its encoding is utf-8, and
the job is done.

In earlier Perls you'd use the Encode module explicitly...
but I'm wary about performing such a conversion
for each and every non UTF-8 valid sequence I may find.

Your mental model is way adrift, I'm afraid. This talk of "non utf-8
valid sequences" strikes me as a bit like counting what you've been
told is a stack of pound notes and then being surprised that the stack
doesn't contain US dollars.
So my question is, has someone found a way to automate converion of
these charcters to their numerical equivalent without having to list
every sinlge character?

Well yes, it's called an XML normaliser, and it's got nothing to do
with Perl. You'd tell it that it was getting iso-8859-1 input, and
that you wanted us-ascii output, and that's what it would do.

But why would you want to do that, when XML likes to get utf-8 anyway?

You have the choice of either delivering utf-8 as XML likes it as
default, or telling XML that it's getting iso-8859-1. Nothing to do
with Perl there, though.
 
S

Shambo

Your mental model is way adrift, I'm afraid. This talk of "non utf-8
valid sequences" strikes me as a bit like counting what you've been
told is a stack of pound notes and then being surprised that the stack
doesn't contain US dollars.

You're sort of correct. I am believing what I'm being told. After
checking the converted XML against the Xerces parser, it reports
errors as "invalid utf-8 sequence". When I look at the character it's
referring to, it's something along the lines of £.
You have the choice of either delivering utf-8 as XML likes it as
default, or telling XML that it's getting iso-8859-1. Nothing to do
with Perl there, though.

It has everything to do with Perl since I'm using Perl to convert the
text files to XML. I'd like to take care of all my needs in this one
script instead of having to run all the files thru several steps.

I will take your advice and figure out how to tell Perl to write the
proper encoiding on output.

thanks,
S
 
S

Shambo

File disciplines, encode_utf8 and Encode::String functions don't seem
to work. They will simply remove any character they don't like, or
replace it with a question mark.

The reason I asked about numeric equivalents (£) is 'cause the
character gets properly represented when viewed in a web browser, and
the XML validates.

After MUCH education about character sets, encoding and modules, I see
why my preivous post could be a confusing.

Still, the problem remains. I need to preserve these characters
somehow.

many thanks for your help.
-S
 
A

Alan J. Flavell

File disciplines, encode_utf8 and Encode::String functions don't seem
to work.

That doesn't get us anywhere. Sure they work.
They will simply remove any character they don't like, or
replace it with a question mark.

Where's your simple test script to demonstrate that assertion?
The reason I asked about numeric equivalents (£) is 'cause the
character gets properly represented when viewed in a web browser, and
the XML validates.

Sure, but the reason I didn't encourage you to follow that approach
and only that approach, was that you've given no clear idea of what
material you're going to be dealing with, and that could be a very
inefficient representation, even though, as you imply (and as my
character coding checklist points out), it's the safest way for people
who don't really understand what they're doing.
Still, the problem remains. I need to preserve these characters
somehow.

Isn't that what we've been working at all this time?

You don't need me to tell you that you can concatenate a & with a #
with ord($_) with a ; - that's elementary stuff. But if you didn't
tell Perl what you were reading-in in the first place (maybe it's
sometimes iso-8859-2, or koi8-r, we just don't know because you're
keeping us guessing) then you'll get the wrong answer. And if you
_do_ tell Perl correctly what you got, there should be no problem with
outputting utf-8 if that's what you wanted.

So do you want to make any progress with this or not?
many thanks for your help.

You don't seem to have used much of it yet, but I'm hopeful that it
might be of some use to the occasional lurkers anyway.
 
A

Alan J. Flavell

You must have either told it, or at least implied, that it was to
expect utf-8 on input.

If you're still reading this thread:

http://xml.apache.org/xerces-c/faq-parse.html#faq-20

| I keep getting an error: "invalid UTF-8 character". What's wrong?

Sounds rather applicable, doesn't it?
As I said, you're not correctly describing the input that you're
giving it.

The FAQ says:

Most commonly, the XML encoding = declaration is either incorrect or
missing. Without a declaration, XML defaults to the use utf-8
character encoding, which is not compatible with the default text
file encoding on most systems.

The XML declaration should look something like this:

<?xml version="1.0" encoding="iso-8859-1"?>

Make sure to specify the encoding that is actually used by file. The
encoding for "plain" text files depends both on the operating system
and the locale (country and language) in use.

Clear?

Didn't I say that it wasn't Perl-related? _Now_ would you believe me?

FAQs are good for you: take some frequently, and especially when
the symptoms occur. (SCNR).

have fun.
 
S

Shambo

I guess I should start over.

When we try to validate our XML, it tells us it doesn't like
characters like £, calling them "invlaid UTF-8 sequences." I thought
if I could get Perl to translate characters like that to numeric
equivalent, the XML parser would not complain. These files will
eventually be displayed as HTML, so those characters would need to be
represented as numeric equivalent anyway.

So I was trying to identify the character set for all characters like
these, and I assumed that stuff like £ was out of the UTF-8 character
set range. I admit I was getting confused on the encoding issue.

And to answer one of your questions, I was telling Perl to output
utf8.

open(FILE, ">:eek:utf8", "$myfile");

Using this method would simply remove any character like £, leading me
to believe something like £ is a non-UTF-8 character.

I have no idea what the input format is, and after lots of
experimentation with :latin1, :text and the like, I let it go to the
default.

I now think I'll simply have to build my own mapping table to convert
these characters to their numeirc equivalent so they will validate.
You don't seem to have used much of it yet, but I'm hopeful that it
might be of some use to the occasional lurkers anyway.

I'm not sure why you say that, I've been reading your replies over and
over to make sure I get what you're saying. This experience has been
very informative, and I do sincerely appreciate it.

best,
S
 
S

Shambo

After MUCH self-educating on encoding, XML and good old Perl, I've
gained a lot of ground. Since these XML files will ultimitely be
displayed in a web browser, I realized that ASCII was the best
encoding, and all non-ASCII characters would have to be mapped to
their numeric equivalent.

I did find a module which would do exactly what I was looking for
(more on that below), but could not get it to work properly, so I've
resorted to searching for all non-ASCII characters, and mapping them
myself. Not that hard. Still will try to get those modules working.

Alan J. Flavell said:
- convert the data to utf-8 coding before feeding it to the parser,
since that's evidently what the parser expects by default.

This is where I was getting hung up first, not knowing really what
encoding meant, and completely missing the fact that symbols such as £
can be represented in UTF-8.
| Unknown open() mode '>:eek:utf8' [...]

Something wrong, see?

Ouch, duh, yes I do see it. Should be "utf8" instead of "outf8".
Did you ever confirm that you really _are_ using Perl 5.8 ?

Perl 5.8 is in use. All modules are up to date as well.
I'm confident that Perl already has the mapping table waiting for you
to use it, if only you'd try to focus in on the issues.

I've found this to be true with the XML::UM module. It will take an
input stream and convert what it can to ASCII. Whatever doesn't
convert to ASCII, it converts to the numeric equivalent, based on the
XML::Encoding maps.

From the XML::UM synopsis:
# Create the encoding routine
my $encode = XML::UM::get_encode (
Encoding => 'US-ASCII',
EncodeUnmapped => \&XML::UM::encode_unmapped_dec);

# Convert a string from UTF-8 to the specified Encoding
my $encoded_str = $encode->($utf8_str);

However, the module seemed to have difficulty finding the paths to the
XML::Encoding maps, even tho I declared it in the script just as the
module instructed. I will continue to troubleshoot that particular
problem.
You're just not giving us enough concrete detail here to be able to
advise you with actual code. Can't you put a sample of your input on
a web page or something, so that we at least know what we're talking
about?

So the code I've resorted to using looks like:

$string =~ s/\xA3/\&#163\;/g;

which would convert a £ to its numeric equivalent. This gets past the
parser, and also allows the character to be displayed in a web
browser.

I found a vastly helpful tutorial on encoding within Perl at
http://www.xml.com/pub/a/2000/04/26/encodings/index.html. Along with
exaplaining lots and lots about encoding, and how to encode within
Perl, it highlights modules such as XML::DOM, XML::UM and XML::Code,
all of which seem to be able to do what I (think I) want to do.

From the XML::Code synopsis:
This module is an experimental module, encoding various XML strings
from UTF-8
to ASCII + Unicode entities. Everything that is not pure ASCII (US) is
encoded
as &#<nnn>;

Still trying to get these modules to work, but I at least have a
solution to work with. I do intend to get these modules working.
 
A

Alan J. Flavell

This is a total non-sequitur. Web browsers support a whole range of
document codings; while it's certainly a _legal_ option to represent
all characters by means of &-notation (e.g ) using nothing
more interesting than us-ascii, there is surely no _need_ to do so.
Indeed, XML is perfectly happy with utf-8, and so is any halfways
decent current web browser.
One of the big advantages of XML is that it's completely independant
of display format. Optimising for one presentation format might well
make it more difficult to implement another later on.

I've no argument with that, but I don't see what relevance it has to
the above. The hon Usenaut is talking about how individual unicode
characters might be represented in source code, not about any detail
of their visual presentation.

Come to that, neither of the issues are closely on-topic for
comp.lang.perl.misc, so I won't pursue that avenue.

cheers
 
A

Alan J. Flavell

So the code I've resorted to using looks like:

$string =~ s/\xA3/\&#163\;/g;

You haven't addressed the question, though. Here you're showing what
you reckon to be part of a solution, but you still haven't shown us
what your input data is like.

Is it encoded in utf-8 ? iso-8859-1 ? (Windows-1252> shudder),
utf-16LE or what?? If you won't show us, and you're not sure
yourself, it's hard to advise.
I found a vastly helpful tutorial on encoding within Perl at
http://www.xml.com/pub/a/2000/04/26/encodings/index.html. Along with
exaplaining lots and lots about encoding, and how to encode within
Perl,

But that's targetted at Perl 5.6 , where you still had to invoke
the encoding modules explicitly. You're only making things (a bit)
more complicated for yourself by doing that, when with Perl 5.8
you can do it with the i/o encoding layers.

As the article says: both XML and Perl are quite happy to work
with unicode characters. The possible motivation for resorting to
&-notations would be when you have to tangle with non-XML applications
which might not be unicode-capable. If you have such a constraint, I
must admit I don't recall you saying so. And XML-based tools can map
between unicode characters and &-notation for you without fuss, if the
need arises.
However, the module seemed to have difficulty finding the paths to
the XML::Encoding maps, even tho I declared it in the script just as
the module instructed.

I'm not personally famliar with that module, but in the 3-year-old
article that you cited, there's some notes on that very problem, did
you see?
it highlights modules such as XML::DOM, XML::UM and XML::Code,
all of which seem to be able to do what I (think I) want to do.

From the XML::Code synopsis:
This module is an experimental module, encoding various XML strings
from UTF-8
to ASCII + Unicode entities. Everything that is not pure ASCII (US) is
encoded
as &#<nnn>;

Well, if you're more comfortable with that, and can get it to work,
it's not technically wrong. I just don't think it's the way I'd want
to do it myself, and particularly with the features that 5.8 contains.

But maybe there's still features of your situation that you haven't
shown yet, that makes it a preferable approach for you.

good luck
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,040
Latest member
papereejit

Latest Threads

Top