Finding and replacing Invalid Tokens in an XML document

Discussion in 'Perl' started by Ben Holness, Jan 6, 2006.

  1. Ben Holness

    Ben Holness Guest

    Hi all,

    I have a system which allows users to enter a message on a (PHP) website.
    This message is then put into a (MySQL) Database.

    A perl script then picks up the message and creates an XML document.

    The webpages, database and XML are all UTF-8, however every now and then I
    get an error in the XML parser that tells me I have an invalid token. This
    occurs when the message contains particular characters, although I don't
    know which characters - all I can see in the logs is the ANSI
    representation (e.g. @^C). If I copy & paste into word the I get a square
    box after the @ that takes two right cursor presses to go past.

    My script catches that there is an invalid token, but rather than fail the
    message completely, I would like to replace the bad characters with a
    space.
    Is there a simple way to find these characters, or do I have to
    write a function that looks at the output of $@ from the eval and work out
    where the character is from the line/column/byte information in order to
    fix it?

    FYI, the XML is created and parsed with XML::Simple and UTF-8 encoded with
    encode. I have included a simplified snippet (written into this post, so
    may contain typos) at the end of the email.

    Cheers,

    Ben

    -- Snippet of Code --

    # $MessageText is pulled from the database and may contain bad
    characters.

    # Build an array of the elements
    my %arr;
    $arr{'Message'}=encode("UTF-8", $MessageText);

    # Convert the array into an XML Document with XMLOut
    my $tempxml = new XML::Simple (NoAttr=>1, RootName=>'WebMessage');
    my $xmldoc = "<?xml version=\"1.0\" encoding=\"UTF-8\">";
    $xmldoc .= $tempxml->XMLout(\$arr);

    # Parse the XML Document
    my $tempxml2 = new XML::Simple (ForceArray => 1);
    eval ($tempxml2->XMLin($xmldoc);};
    if ($@)
    {
    # An error occurred. Usually an invalid token due to a bad character
    # in $MessageText
    }
    Ben Holness, Jan 6, 2006
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Steve Carrow
    Replies:
    0
    Views:
    547
    Steve Carrow
    Jul 28, 2003
  2. Steve Carrow
    Replies:
    0
    Views:
    620
    Steve Carrow
    Jul 28, 2003
  3. Tony Prichard
    Replies:
    0
    Views:
    723
    Tony Prichard
    Dec 12, 2003
  4. Rob Meade

    Replacing - and not Replacing...

    Rob Meade, Apr 5, 2005, in forum: ASP General
    Replies:
    5
    Views:
    273
    Chris Hohmann
    Apr 11, 2005
  5. Julius Mong

    Replacing XML document element

    Julius Mong, May 16, 2004, in forum: Javascript
    Replies:
    1
    Views:
    129
    Julius Mong
    May 19, 2004
Loading...

Share This Page