Ruby method to strip out XML codes?

Discussion in 'Ruby' started by Michael W. Ryder, Dec 6, 2007.

  1. I am trying to process an XML file that includes various codes. The
    problem I am running into is that some of these codes are inserted into
    the middle of an encrypted string. If I display the file using a
    browser these codes do not show up and copying and pasting the string
    work fine. The problem occurs when I try to strip out the string in a
    program and these "extraneous" XML codes are included. This of course
    makes the decryption routine crash.
    What I am looking for is a simple way to read through the file and
    remove all the XML codes leaving just plain text. I could probably
    write a series of regular expressions to remove each code that I can
    find in my text but am afraid I might miss some and it will come back to
    haunt me at a later time.
    Michael W. Ryder, Dec 6, 2007
    #1
    1. Advertising

  2. Michael W. Ryder

    Phrogz Guest

    On Dec 5, 6:13 pm, "Michael W. Ryder" <>
    wrote:
    > I am trying to process an XML file that includes various codes. The
    > problem I am running into is that some of these codes are inserted into
    > the middle of an encrypted string. If I display the file using a
    > browser these codes do not show up and copying and pasting the string
    > work fine. The problem occurs when I try to strip out the string in a
    > program and these "extraneous" XML codes are included. This of course
    > makes the decryption routine crash.
    > What I am looking for is a simple way to read through the file and
    > remove all the XML codes leaving just plain text. I could probably
    > write a series of regular expressions to remove each code that I can
    > find in my text but am afraid I might miss some and it will come back to
    > haunt me at a later time.


    str.gsub /</?[^>]+>/, ''

    This will only be a problem if your XML file is legal and has a CDATA
    section which has a literal < character (not &lt;), like:

    for ( var i=0, len=a.length; i<len; ++i )

    In that case you likely want a proper XML parser (like REXML) and to
    use it.

    Do you really want to remove the XML, or would it suffice to just:

    str.gsub! '&', '&amp;'
    str.gsub! '<', '&lt;'
    str.gsub! '>', '&gt;'
    (and maybe even)
    str.gsub! '"', '&quot;'
    str.gsub! "'", '&apos;'

    to make your string valid and escaped for use in an HTML context?
    Phrogz, Dec 6, 2007
    #2
    1. Advertising

  3. Phrogz wrote:
    > On Dec 5, 6:13 pm, "Michael W. Ryder" <>
    > wrote:
    >> I am trying to process an XML file that includes various codes. The
    >> problem I am running into is that some of these codes are inserted into
    >> the middle of an encrypted string. If I display the file using a
    >> browser these codes do not show up and copying and pasting the string
    >> work fine. The problem occurs when I try to strip out the string in a
    >> program and these "extraneous" XML codes are included. This of course
    >> makes the decryption routine crash.
    >> What I am looking for is a simple way to read through the file and
    >> remove all the XML codes leaving just plain text. I could probably
    >> write a series of regular expressions to remove each code that I can
    >> find in my text but am afraid I might miss some and it will come back to
    >> haunt me at a later time.

    >
    > str.gsub /</?[^>]+>/, ''
    >
    > This will only be a problem if your XML file is legal and has a CDATA
    > section which has a literal < character (not &lt;), like:
    >
    > for ( var i=0, len=a.length; i<len; ++i )
    >
    > In that case you likely want a proper XML parser (like REXML) and to
    > use it.
    >
    > Do you really want to remove the XML, or would it suffice to just:
    >
    > str.gsub! '&', '&amp;'
    > str.gsub! '<', '&lt;'
    > str.gsub! '>', '&gt;'
    > (and maybe even)
    > str.gsub! '"', '&quot;'
    > str.gsub! "'", '&apos;'
    >
    > to make your string valid and escaped for use in an HTML context?


    My problem is that the XML file includes
    in the middle of a
    couple of fields, especially in the encrypted fields. If I just strip
    out the encrypted field and try to decrypt it the program crashes as the
    key is invalid. I have to remove the "bad" character strings before
    sending it to my decryption program. I would prefer to do this removal
    before sending the file to my programs so that I don't have to deal with
    these codes.
    I assume that the string I am seeing is XML's way of saying CR/LF as DA
    in hex is CR/LF and the output in a browser shows the field being broken
    at that point. The problem is that is only the ones that I have noticed
    and there may be others hiding in the data. The XML file is being
    parsed for conversion to our accounts.
    Michael W. Ryder, Dec 6, 2007
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Greg  --
    Replies:
    4
    Views:
    2,152
  2. Replies:
    2
    Views:
    2,818
    Malcolm
    Aug 20, 2005
  3. Allen
    Replies:
    1
    Views:
    637
    Mark Rae [MVP]
    Dec 3, 2007
  4. Aquila
    Replies:
    35
    Views:
    450
    Mathieu Bouchard
    Mar 31, 2005
  5. yelipolok
    Replies:
    4
    Views:
    252
    John W. Krahn
    Jan 27, 2010
Loading...

Share This Page