Whitespace-preservating Search & Replace in multiple XML documents

Discussion in 'XML' started by François Robert, Jul 18, 2005.

  1. Dear Newsgroup,

    I am looking for a way to search and replace some strings inside various
    XML documents while at the same time binary-preserving all the
    whitespace of each document (in particular the line ending convention,
    the white space both *inside* the markup and inside the content).

    So far, this sounds more like a plain text search-and-replace, but the
    twist is that the strings should only be replaced if they match a
    certain XML context (say: replace attribute name "jarfile" in any
    element <jar> with attribute name "destfile", or change the entire
    content of element <value>, but only when when <value> immediately
    follows an element <key> with a content of "OutputFile" etc...)

    I even don't know if my problem has a "canonical" name, which pretty
    much precludes a meaningfull search on Google...

    I know XLST can do some (all ?) of that, but :
    a) These substitutions need to occur on many different XML files and the
    XML contexts / search strings may differ from file to file, so I will
    need many different stylesheets. (which could be generated
    automatically, I suppose)
    b) What guarantee do I have on binary-preservation of all whitespace ?
    (BTW this "weird" requirements arises from the need to keep the ability
    to make plain textual diff of those XML documents which are stored
    inside a source control system)

    I have also looked at SAX parsers, thinking that maybe I could rely on
    event notifications, but it seems that the events are not granular
    enough for my situation (eg : AFAICT, no notification will tell that I
    have encountered a block of contiguous whitespace inside an element tag
    and how is such a block made, for instance 3 SPC + LF + LF + TAB + TAB).
    Also, the SAX parser does not seem to be able to tell me the exact
    'slices' of input characters that it identified as element name,
    attribute name, attribute value, whitespaces, entity reference, etc...
    AFAICT, SAX will not tell me the difference between 'attr="!"' and
    'attr="!"' ?

    Pointers, suggestions & comments appreciated.
    Regards
    _______________________________________________________
    François Robert
    (to mail me, reverse character order in reply address)
    François Robert, Jul 18, 2005
    #1
    1. Advertising

  2. François Robert wrote:
    > Dear Newsgroup,
    >
    > I am looking for a way to search and replace some strings inside various
    > XML documents while at the same time binary-preserving all the
    > whitespace of each document (in particular the line ending convention,
    > the white space both *inside* the markup and inside the content).
    >
    > So far, this sounds more like a plain text search-and-replace, but the
    > twist is that the strings should only be replaced if they match a
    > certain XML context (say: replace attribute name "jarfile" in any
    > element <jar> with attribute name "destfile", or change the entire
    > content of element <value>, but only when when <value> immediately
    > follows an element <key> with a content of "OutputFile" etc...)
    >
    > I even don't know if my problem has a "canonical" name, which pretty
    > much precludes a meaningfull search on Google...
    >
    > I know XLST can do some (all ?) of that, but :
    > a) These substitutions need to occur on many different XML files and the
    > XML contexts / search strings may differ from file to file, so I will
    > need many different stylesheets. (which could be generated
    > automatically, I suppose)
    > b) What guarantee do I have on binary-preservation of all whitespace ?
    > (BTW this "weird" requirements arises from the need to keep the ability
    > to make plain textual diff of those XML documents which are stored
    > inside a source control system)
    >
    > I have also looked at SAX parsers, thinking that maybe I could rely on
    > event notifications, but it seems that the events are not granular
    > enough for my situation (eg : AFAICT, no notification will tell that I
    > have encountered a block of contiguous whitespace inside an element tag
    > and how is such a block made, for instance 3 SPC + LF + LF + TAB + TAB).
    > Also, the SAX parser does not seem to be able to tell me the exact
    > 'slices' of input characters that it identified as element name,
    > attribute name, attribute value, whitespaces, entity reference, etc...
    > AFAICT, SAX will not tell me the difference between 'attr="!"' and
    > 'attr="!"' ?
    >
    > Pointers, suggestions & comments appreciated.
    > Regards


    This might be useful for you:
    http://www.devx.com/xml/Article/22219/0/page/1
    "VTD-XML retains the XML message in memory intact and un-decoded, and
    tokens represents tokens using starting offset and length exclusively."

    But I personally don't recommend processing XML in a way that it "isn't
    meant to be processed". I would use real XML diff apps etc. because
    things might get more complicated when for example entity expansion and
    DTDs get involved...

    Hope this helps,
    Toni Uusitalo
    Toni Uusitalo, Jul 18, 2005
    #2
    1. Advertising

  3. Toni Uusitalo wrote :

    > This might be useful for you:
    > http://www.devx.com/xml/Article/22219/0/page/1


    Thanks ! Quite interesting, I must say. Is "non-extractive parsing"
    another name for "indexing" ?

    BTW, for various reasons, Perl was chosen in my project, so that's the
    realm I explored more thoroughly so far. I stumbled on XML::pYX and
    xml_grep (that comes with XML::Twigs). I think that PYX could easily be
    extended to become whitespace preserving.
    In fact, the key seems to be an intermediate serialized form which lends
    itself to substitution (and queries). In VTD-XML, this form is made of
    in-memory VTD records. In PYX, those are line-based format.
    The VTD article also mentions an "XMLCursor" Java API (which I suppose
    is org.apache.xmlbeans.XmlCursor ? If it is, then the concept seems to
    be rather close to VTD)

    > But I personally don't recommend processing XML in a way that it
    > "isn't meant to be processed".

    I can't agree more but...
    > I would use real XML diff apps etc.

    ....but unfortunately I am stuck with the built-in diff / merge tool
    that's part of our source control. When it cannot cope with XML (because
    of too many differences, for instance), we have to merge as text. Hence
    the requirement on whitespace.

    _______________________________________________________
    François Robert
    (to mail me, reverse character order in reply address)
    François Robert, Jul 19, 2005
    #3
  4. François Robert wrote:
    > Toni Uusitalo wrote :
    >
    >
    >>This might be useful for you:
    >>http://www.devx.com/xml/Article/22219/0/page/1

    >
    >
    > Thanks ! Quite interesting, I must say. Is "non-extractive parsing"
    > another name for "indexing" ?
    >
    > BTW, for various reasons, Perl was chosen in my project, so that's the
    > realm I explored more thoroughly so far. I stumbled on XML::pYX and
    > xml_grep (that comes with XML::Twigs). I think that PYX could easily be
    > extended to become whitespace preserving.
    > In fact, the key seems to be an intermediate serialized form which lends
    > itself to substitution (and queries). In VTD-XML, this form is made of
    > in-memory VTD records. In PYX, those are line-based format.
    > The VTD article also mentions an "XMLCursor" Java API (which I suppose
    > is org.apache.xmlbeans.XmlCursor ? If it is, then the concept seems to
    > be rather close to VTD)
    >
    >
    >>But I personally don't recommend processing XML in a way that it
    >>"isn't meant to be processed".

    >
    > I can't agree more but...
    >
    >>I would use real XML diff apps etc.

    >
    > ...but unfortunately I am stuck with the built-in diff / merge tool
    > that's part of our source control. When it cannot cope with XML (because
    > of too many differences, for instance), we have to merge as text. Hence
    > the requirement on whitespace.
    >


    Ok. Tasks vary and tools vary. I'm not familiar with VTD-XML myself
    (apart from readin that article and about non-extractiing processing
    principle). I'm not familiar with XML::twig either but it seems to be
    (by quick look) like very extensive xml processing framework. Maybe
    you can use that if that really gives you accurate location information
    for line-based "intact" and original xml input.

    I hope I don't ever have to process xml that way ;-)
    But good luck for you anyway.

    Toni Uusitalo
    Toni Uusitalo, Jul 19, 2005
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Oli Filth
    Replies:
    9
    Views:
    3,323
    Uncle Pirate
    Jan 17, 2005
  2. Replies:
    10
    Views:
    737
    Eric Brunel
    Dec 16, 2008
  3. MRAB
    Replies:
    3
    Views:
    383
  4. Brian Tully
    Replies:
    9
    Views:
    105
    Joel VanderWerf
    Jun 25, 2004
  5. Douglas Wells
    Replies:
    8
    Views:
    149
    Nobuyoshi Nakada
    Jan 27, 2007
Loading...

Share This Page