S
Shambo
Hey folks,
I've been grappling with this for days, and can see no option but to
use brute force.
We have a ton of text files from all over the world, often times
including invalid UTF-8 characters such as ø or £ (that was an o with
a line thru it, a la Scandanavian letters, and a British pound
sterling symbol). When I convert these text files to XML, the
resulting XML is not valid becuase it contains these characters. I can
map individual charatcers to their numerical equivalent (ø and
£ in this case), but I'm wary about performing such a conversion
for each and every non UTF-8 valid sequence I may find.
So my question is, has someone found a way to automate converion of
these charcters to their numerical equivalent without having to list
every sinlge character? I searched for scripts and modules that might
do this, but didn't see any that jumped out at me.
Secondly, I had been doing brute-force checking for every non-UTF-8
valid sequence, and I might be doing it incorrectly. For example, if I
searched for the hex string \xA3, I was expecting to match on the £
symbol. Not so. I have to explicitly search for the £ symbol, not the
hex equivalent, because that's how it is in the text file.
To re-iterate:
$line =~ s/\xA3/\£\;/g;
does not work when the literal symbol £ is in the text. I thought
forcing Perl to find the hex version of any character would work. I
guess I'm missing something.
Any insight would be mst appreciated.
thanks very much,
Shambo
I've been grappling with this for days, and can see no option but to
use brute force.
We have a ton of text files from all over the world, often times
including invalid UTF-8 characters such as ø or £ (that was an o with
a line thru it, a la Scandanavian letters, and a British pound
sterling symbol). When I convert these text files to XML, the
resulting XML is not valid becuase it contains these characters. I can
map individual charatcers to their numerical equivalent (ø and
£ in this case), but I'm wary about performing such a conversion
for each and every non UTF-8 valid sequence I may find.
So my question is, has someone found a way to automate converion of
these charcters to their numerical equivalent without having to list
every sinlge character? I searched for scripts and modules that might
do this, but didn't see any that jumped out at me.
Secondly, I had been doing brute-force checking for every non-UTF-8
valid sequence, and I might be doing it incorrectly. For example, if I
searched for the hex string \xA3, I was expecting to match on the £
symbol. Not so. I have to explicitly search for the £ symbol, not the
hex equivalent, because that's how it is in the text file.
To re-iterate:
$line =~ s/\xA3/\£\;/g;
does not work when the literal symbol £ is in the text. I thought
forcing Perl to find the hex version of any character would work. I
guess I'm missing something.
Any insight would be mst appreciated.
thanks very much,
Shambo