how to convert all invalid UTF-8 sequences to numeric equivalent?

Discussion in 'Perl Misc' started by Shambo, Jun 25, 2003.

  1. Shambo

    Shambo Guest

    Hey folks,

    I've been grappling with this for days, and can see no option but to
    use brute force.

    We have a ton of text files from all over the world, often times
    including invalid UTF-8 characters such as ø or £ (that was an o with
    a line thru it, a la Scandanavian letters, and a British pound
    sterling symbol). When I convert these text files to XML, the
    resulting XML is not valid becuase it contains these characters. I can
    map individual charatcers to their numerical equivalent (ø and
    £ in this case), but I'm wary about performing such a conversion
    for each and every non UTF-8 valid sequence I may find.

    So my question is, has someone found a way to automate converion of
    these charcters to their numerical equivalent without having to list
    every sinlge character? I searched for scripts and modules that might
    do this, but didn't see any that jumped out at me.

    Secondly, I had been doing brute-force checking for every non-UTF-8
    valid sequence, and I might be doing it incorrectly. For example, if I
    searched for the hex string \xA3, I was expecting to match on the £
    symbol. Not so. I have to explicitly search for the £ symbol, not the
    hex equivalent, because that's how it is in the text file.

    To re-iterate:

    $line =~ s/\xA3/\&#163\;/g;
    does not work when the literal symbol £ is in the text. I thought
    forcing Perl to find the hex version of any character would work. I
    guess I'm missing something.

    Any insight would be mst appreciated.

    thanks very much,
    Shambo
    Shambo, Jun 25, 2003
    #1
    1. Advertising

  2. On Wed, Jun 25, Shambo inscribed on the eternal scroll:

    [Oh dear, this _is_ getting to be more like some
    hypothetical comp.encoding group...]

    > We have a ton of text files from all over the world, often times
    > including invalid UTF-8 characters such as ø or £


    Well, your posting was encoded in iso-8859-1, so if that's to be
    taken seriously, then you haven't got utf-8. So what's the point of
    trying to read it as utf-8? It doesn't even remotely resemble it
    (aside from the characters that are us-ascii anyway...).

    > (that was an o with
    > a line thru it, a la Scandanavian letters, and a British pound
    > sterling symbol).


    In iso-8859-1 (or Windows-1252, not that I'd encourage that), they
    would indeed be.

    > When I convert these text files to XML, the
    > resulting XML is not valid becuase it contains these characters.


    This is because you're not telling XML what your character coding is.

    > I can
    > map individual charatcers to their numerical equivalent (ø and
    > £ in this case),


    It's a valid choice. But why the hell? If you want to represent them
    in utf-8, then do so.

    In Perl 5.8 you just tell the input file handle that its encoding is
    iso-8859-1, and the output file handle that its encoding is utf-8, and
    the job is done.

    In earlier Perls you'd use the Encode module explicitly...

    > but I'm wary about performing such a conversion
    > for each and every non UTF-8 valid sequence I may find.


    Your mental model is way adrift, I'm afraid. This talk of "non utf-8
    valid sequences" strikes me as a bit like counting what you've been
    told is a stack of pound notes and then being surprised that the stack
    doesn't contain US dollars.

    > So my question is, has someone found a way to automate converion of
    > these charcters to their numerical equivalent without having to list
    > every sinlge character?


    Well yes, it's called an XML normaliser, and it's got nothing to do
    with Perl. You'd tell it that it was getting iso-8859-1 input, and
    that you wanted us-ascii output, and that's what it would do.

    But why would you want to do that, when XML likes to get utf-8 anyway?

    You have the choice of either delivering utf-8 as XML likes it as
    default, or telling XML that it's getting iso-8859-1. Nothing to do
    with Perl there, though.
    Alan J. Flavell, Jun 25, 2003
    #2
    1. Advertising

  3. Shambo

    Shambo Guest

    > Your mental model is way adrift, I'm afraid. This talk of "non utf-8
    > valid sequences" strikes me as a bit like counting what you've been
    > told is a stack of pound notes and then being surprised that the stack
    > doesn't contain US dollars.


    You're sort of correct. I am believing what I'm being told. After
    checking the converted XML against the Xerces parser, it reports
    errors as "invalid utf-8 sequence". When I look at the character it's
    referring to, it's something along the lines of £.

    > You have the choice of either delivering utf-8 as XML likes it as
    > default, or telling XML that it's getting iso-8859-1. Nothing to do
    > with Perl there, though.


    It has everything to do with Perl since I'm using Perl to convert the
    text files to XML. I'd like to take care of all my needs in this one
    script instead of having to run all the files thru several steps.

    I will take your advice and figure out how to tell Perl to write the
    proper encoiding on output.

    thanks,
    S
    Shambo, Jun 26, 2003
    #3
  4. Shambo

    Shambo Guest

    File disciplines, encode_utf8 and Encode::String functions don't seem
    to work. They will simply remove any character they don't like, or
    replace it with a question mark.

    The reason I asked about numeric equivalents (£) is 'cause the
    character gets properly represented when viewed in a web browser, and
    the XML validates.

    After MUCH education about character sets, encoding and modules, I see
    why my preivous post could be a confusing.

    Still, the problem remains. I need to preserve these characters
    somehow.

    many thanks for your help.
    -S
    Shambo, Jun 26, 2003
    #4
  5. On Thu, Jun 26, Shambo inscribed on the eternal scroll:

    > File disciplines, encode_utf8 and Encode::String functions don't seem
    > to work.


    That doesn't get us anywhere. Sure they work.

    > They will simply remove any character they don't like, or
    > replace it with a question mark.


    Where's your simple test script to demonstrate that assertion?

    > The reason I asked about numeric equivalents (£) is 'cause the
    > character gets properly represented when viewed in a web browser, and
    > the XML validates.


    Sure, but the reason I didn't encourage you to follow that approach
    and only that approach, was that you've given no clear idea of what
    material you're going to be dealing with, and that could be a very
    inefficient representation, even though, as you imply (and as my
    character coding checklist points out), it's the safest way for people
    who don't really understand what they're doing.

    > Still, the problem remains. I need to preserve these characters
    > somehow.


    Isn't that what we've been working at all this time?

    You don't need me to tell you that you can concatenate a & with a #
    with ord($_) with a ; - that's elementary stuff. But if you didn't
    tell Perl what you were reading-in in the first place (maybe it's
    sometimes iso-8859-2, or koi8-r, we just don't know because you're
    keeping us guessing) then you'll get the wrong answer. And if you
    _do_ tell Perl correctly what you got, there should be no problem with
    outputting utf-8 if that's what you wanted.

    So do you want to make any progress with this or not?

    > many thanks for your help.


    You don't seem to have used much of it yet, but I'm hopeful that it
    might be of some use to the occasional lurkers anyway.
    Alan J. Flavell, Jun 26, 2003
    #5
  6. On Thu, Jun 26, Alan J. Flavell inscribed on the eternal scroll:

    > > You're sort of correct. I am believing what I'm being told. After
    > > checking the converted XML against the Xerces parser, it reports
    > > errors as "invalid utf-8 sequence".

    >
    > You must have either told it, or at least implied, that it was to
    > expect utf-8 on input.


    If you're still reading this thread:

    http://xml.apache.org/xerces-c/faq-parse.html#faq-20

    | I keep getting an error: "invalid UTF-8 character". What's wrong?

    Sounds rather applicable, doesn't it?

    > > When I look at the character it's
    > > referring to, it's something along the lines of £.

    >
    > As I said, you're not correctly describing the input that you're
    > giving it.


    The FAQ says:

    Most commonly, the XML encoding = declaration is either incorrect or
    missing. Without a declaration, XML defaults to the use utf-8
    character encoding, which is not compatible with the default text
    file encoding on most systems.

    The XML declaration should look something like this:

    <?xml version="1.0" encoding="iso-8859-1"?>

    Make sure to specify the encoding that is actually used by file. The
    encoding for "plain" text files depends both on the operating system
    and the locale (country and language) in use.

    Clear?

    > > It has everything to do with Perl since I'm using Perl to convert the
    > > text files to XML.


    Didn't I say that it wasn't Perl-related? _Now_ would you believe me?

    FAQs are good for you: take some frequently, and especially when
    the symptoms occur. (SCNR).

    have fun.
    Alan J. Flavell, Jun 27, 2003
    #6
  7. Shambo

    Shambo Guest

    I guess I should start over.

    When we try to validate our XML, it tells us it doesn't like
    characters like £, calling them "invlaid UTF-8 sequences." I thought
    if I could get Perl to translate characters like that to numeric
    equivalent, the XML parser would not complain. These files will
    eventually be displayed as HTML, so those characters would need to be
    represented as numeric equivalent anyway.

    So I was trying to identify the character set for all characters like
    these, and I assumed that stuff like £ was out of the UTF-8 character
    set range. I admit I was getting confused on the encoding issue.

    And to answer one of your questions, I was telling Perl to output
    utf8.

    open(FILE, ">:eek:utf8", "$myfile");

    Using this method would simply remove any character like £, leading me
    to believe something like £ is a non-UTF-8 character.

    I have no idea what the input format is, and after lots of
    experimentation with :latin1, :text and the like, I let it go to the
    default.

    I now think I'll simply have to build my own mapping table to convert
    these characters to their numeirc equivalent so they will validate.

    >>thanks for all your help.

    > You don't seem to have used much of it yet, but I'm hopeful that it
    > might be of some use to the occasional lurkers anyway.


    I'm not sure why you say that, I've been reading your replies over and
    over to make sure I get what you're saying. This experience has been
    very informative, and I do sincerely appreciate it.

    best,
    S
    Shambo, Jun 27, 2003
    #7
  8. Shambo

    Shambo Guest

    After MUCH self-educating on encoding, XML and good old Perl, I've
    gained a lot of ground. Since these XML files will ultimitely be
    displayed in a web browser, I realized that ASCII was the best
    encoding, and all non-ASCII characters would have to be mapped to
    their numeric equivalent.

    I did find a module which would do exactly what I was looking for
    (more on that below), but could not get it to work properly, so I've
    resorted to searching for all non-ASCII characters, and mapping them
    myself. Not that hard. Still will try to get those modules working.

    "Alan J. Flavell" <> wrote in message
    > - convert the data to utf-8 coding before feeding it to the parser,
    > since that's evidently what the parser expects by default.


    This is where I was getting hung up first, not knowing really what
    encoding meant, and completely missing the fact that symbols such as £
    can be represented in UTF-8.

    > | Unknown open() mode '>:eek:utf8' [...]
    >
    > Something wrong, see?


    Ouch, duh, yes I do see it. Should be "utf8" instead of "outf8".

    > Did you ever confirm that you really _are_ using Perl 5.8 ?


    Perl 5.8 is in use. All modules are up to date as well.

    > I'm confident that Perl already has the mapping table waiting for you
    > to use it, if only you'd try to focus in on the issues.


    I've found this to be true with the XML::UM module. It will take an
    input stream and convert what it can to ASCII. Whatever doesn't
    convert to ASCII, it converts to the numeric equivalent, based on the
    XML::Encoding maps.

    From the XML::UM synopsis:
    # Create the encoding routine
    my $encode = XML::UM::get_encode (
    Encoding => 'US-ASCII',
    EncodeUnmapped => \&XML::UM::encode_unmapped_dec);

    # Convert a string from UTF-8 to the specified Encoding
    my $encoded_str = $encode->($utf8_str);

    However, the module seemed to have difficulty finding the paths to the
    XML::Encoding maps, even tho I declared it in the script just as the
    module instructed. I will continue to troubleshoot that particular
    problem.

    > You're just not giving us enough concrete detail here to be able to
    > advise you with actual code. Can't you put a sample of your input on
    > a web page or something, so that we at least know what we're talking
    > about?


    So the code I've resorted to using looks like:

    $string =~ s/\xA3/\&#163\;/g;

    which would convert a £ to its numeric equivalent. This gets past the
    parser, and also allows the character to be displayed in a web
    browser.

    I found a vastly helpful tutorial on encoding within Perl at
    http://www.xml.com/pub/a/2000/04/26/encodings/index.html. Along with
    exaplaining lots and lots about encoding, and how to encode within
    Perl, it highlights modules such as XML::DOM, XML::UM and XML::Code,
    all of which seem to be able to do what I (think I) want to do.

    From the XML::Code synopsis:
    This module is an experimental module, encoding various XML strings
    from UTF-8
    to ASCII + Unicode entities. Everything that is not pure ASCII (US) is
    encoded
    as &#<nnn>;

    Still trying to get these modules to work, but I at least have a
    solution to work with. I do intend to get these modules working.
    Shambo, Jul 9, 2003
    #8
  9. On Wed, Jul 9, Eric Schwartz inscribed on the eternal scroll:

    > (Shambo) writes:
    > > After MUCH self-educating on encoding, XML and good old Perl, I've
    > > gained a lot of ground. Since these XML files will ultimitely be
    > > displayed in a web browser, I realized that ASCII was the best
    > > encoding, and all non-ASCII characters would have to be mapped to
    > > their numeric equivalent.


    This is a total non-sequitur. Web browsers support a whole range of
    document codings; while it's certainly a _legal_ option to represent
    all characters by means of &-notation (e.g ) using nothing
    more interesting than us-ascii, there is surely no _need_ to do so.
    Indeed, XML is perfectly happy with utf-8, and so is any halfways
    decent current web browser.

    > One of the big advantages of XML is that it's completely independant
    > of display format. Optimising for one presentation format might well
    > make it more difficult to implement another later on.


    I've no argument with that, but I don't see what relevance it has to
    the above. The hon Usenaut is talking about how individual unicode
    characters might be represented in source code, not about any detail
    of their visual presentation.

    Come to that, neither of the issues are closely on-topic for
    comp.lang.perl.misc, so I won't pursue that avenue.

    cheers
    Alan J. Flavell, Jul 9, 2003
    #9
  10. On Wed, Jul 9, Shambo inscribed on the eternal scroll:

    > > You're just not giving us enough concrete detail here to be able to
    > > advise you with actual code. Can't you put a sample of your input on
    > > a web page or something, so that we at least know what we're talking
    > > about?

    >
    > So the code I've resorted to using looks like:
    >
    > $string =~ s/\xA3/\&#163\;/g;


    You haven't addressed the question, though. Here you're showing what
    you reckon to be part of a solution, but you still haven't shown us
    what your input data is like.

    Is it encoded in utf-8 ? iso-8859-1 ? (Windows-1252> shudder),
    utf-16LE or what?? If you won't show us, and you're not sure
    yourself, it's hard to advise.

    > I found a vastly helpful tutorial on encoding within Perl at
    > http://www.xml.com/pub/a/2000/04/26/encodings/index.html. Along with
    > exaplaining lots and lots about encoding, and how to encode within
    > Perl,


    But that's targetted at Perl 5.6 , where you still had to invoke
    the encoding modules explicitly. You're only making things (a bit)
    more complicated for yourself by doing that, when with Perl 5.8
    you can do it with the i/o encoding layers.

    As the article says: both XML and Perl are quite happy to work
    with unicode characters. The possible motivation for resorting to
    &-notations would be when you have to tangle with non-XML applications
    which might not be unicode-capable. If you have such a constraint, I
    must admit I don't recall you saying so. And XML-based tools can map
    between unicode characters and &-notation for you without fuss, if the
    need arises.

    > However, the module seemed to have difficulty finding the paths to
    > the XML::Encoding maps, even tho I declared it in the script just as
    > the module instructed.


    I'm not personally famliar with that module, but in the 3-year-old
    article that you cited, there's some notes on that very problem, did
    you see?

    > it highlights modules such as XML::DOM, XML::UM and XML::Code,
    > all of which seem to be able to do what I (think I) want to do.
    >
    > From the XML::Code synopsis:
    > This module is an experimental module, encoding various XML strings
    > from UTF-8
    > to ASCII + Unicode entities. Everything that is not pure ASCII (US) is
    > encoded
    > as &#<nnn>;


    Well, if you're more comfortable with that, and can get it to work,
    it's not technically wrong. I just don't think it's the way I'd want
    to do it myself, and particularly with the features that 5.8 contains.

    But maybe there's still features of your situation that you haven't
    shown yet, that makes it a preferable approach for you.

    good luck
    Alan J. Flavell, Jul 9, 2003
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. iksrazal
    Replies:
    0
    Views:
    366
    iksrazal
    Jan 28, 2004
  2. Replies:
    5
    Views:
    933
    X-Centric
    Jun 30, 2005
  3. darrel
    Replies:
    4
    Views:
    818
    darrel
    Jul 19, 2007
  4. jobs

    int to numeric numeric(18,2) ?

    jobs, Jul 21, 2007, in forum: ASP .Net
    Replies:
    2
    Views:
    962
    =?ISO-8859-1?Q?G=F6ran_Andersson?=
    Jul 22, 2007
  5. Jeremy
    Replies:
    1
    Views:
    804
    Alex Willmer
    Jan 11, 2011
Loading...

Share This Page