Whitespace-preservating Search & Replace in multiple XML documents

François Robert · Jul 18, 2005

Dear Newsgroup,

I am looking for a way to search and replace some strings inside various
XML documents while at the same time binary-preserving all the
whitespace of each document (in particular the line ending convention,
the white space both *inside* the markup and inside the content).

So far, this sounds more like a plain text search-and-replace, but the
twist is that the strings should only be replaced if they match a
certain XML context (say: replace attribute name "jarfile" in any
element <jar> with attribute name "destfile", or change the entire
content of element <value>, but only when when <value> immediately
follows an element <key> with a content of "OutputFile" etc...)

I even don't know if my problem has a "canonical" name, which pretty
much precludes a meaningfull search on Google...

I know XLST can do some (all ?) of that, but :
a) These substitutions need to occur on many different XML files and the
XML contexts / search strings may differ from file to file, so I will
need many different stylesheets. (which could be generated
automatically, I suppose)
b) What guarantee do I have on binary-preservation of all whitespace ?
(BTW this "weird" requirements arises from the need to keep the ability
to make plain textual diff of those XML documents which are stored
inside a source control system)

I have also looked at SAX parsers, thinking that maybe I could rely on
event notifications, but it seems that the events are not granular
enough for my situation (eg : AFAICT, no notification will tell that I
have encountered a block of contiguous whitespace inside an element tag
and how is such a block made, for instance 3 SPC + LF + LF + TAB + TAB).
Also, the SAX parser does not seem to be able to tell me the exact
'slices' of input characters that it identified as element name,
attribute name, attribute value, whitespaces, entity reference, etc...
AFAICT, SAX will not tell me the difference between 'attr="!"' and
'attr="!"' ?

Pointers, suggestions & comments appreciated.
Regards
_______________________________________________________
François Robert
(to mail me, reverse character order in reply address)

Toni Uusitalo · Jul 18, 2005

François Robert said:
Dear Newsgroup,

I am looking for a way to search and replace some strings inside various
XML documents while at the same time binary-preserving all the
whitespace of each document (in particular the line ending convention,
the white space both *inside* the markup and inside the content).

So far, this sounds more like a plain text search-and-replace, but the
twist is that the strings should only be replaced if they match a
certain XML context (say: replace attribute name "jarfile" in any
element <jar> with attribute name "destfile", or change the entire
content of element <value>, but only when when <value> immediately
follows an element <key> with a content of "OutputFile" etc...)

I even don't know if my problem has a "canonical" name, which pretty
much precludes a meaningfull search on Google...

I know XLST can do some (all ?) of that, but :
a) These substitutions need to occur on many different XML files and the
XML contexts / search strings may differ from file to file, so I will
need many different stylesheets. (which could be generated
automatically, I suppose)
b) What guarantee do I have on binary-preservation of all whitespace ?
(BTW this "weird" requirements arises from the need to keep the ability
to make plain textual diff of those XML documents which are stored
inside a source control system)

I have also looked at SAX parsers, thinking that maybe I could rely on
event notifications, but it seems that the events are not granular
enough for my situation (eg : AFAICT, no notification will tell that I
have encountered a block of contiguous whitespace inside an element tag
and how is such a block made, for instance 3 SPC + LF + LF + TAB + TAB).
Also, the SAX parser does not seem to be able to tell me the exact
'slices' of input characters that it identified as element name,
attribute name, attribute value, whitespaces, entity reference, etc...
AFAICT, SAX will not tell me the difference between 'attr="!"' and
'attr="!"' ?

Pointers, suggestions & comments appreciated.
Regards

This might be useful for you:
http://www.devx.com/xml/Article/22219/0/page/1
"VTD-XML retains the XML message in memory intact and un-decoded, and
tokens represents tokens using starting offset and length exclusively."

But I personally don't recommend processing XML in a way that it "isn't
meant to be processed". I would use real XML diff apps etc. because
things might get more complicated when for example entity expansion and
DTDs get involved...

Hope this helps,
Toni Uusitalo

François Robert · Jul 19, 2005

Toni Uusitalo wrote :

This might be useful for you:
http://www.devx.com/xml/Article/22219/0/page/1

Thanks ! Quite interesting, I must say. Is "non-extractive parsing"
another name for "indexing" ?

BTW, for various reasons, Perl was chosen in my project, so that's the
realm I explored more thoroughly so far. I stumbled on XML:

YX and
xml_grep (that comes with XML::Twigs). I think that PYX could easily be
extended to become whitespace preserving.
In fact, the key seems to be an intermediate serialized form which lends
itself to substitution (and queries). In VTD-XML, this form is made of
in-memory VTD records. In PYX, those are line-based format.
The VTD article also mentions an "XMLCursor" Java API (which I suppose
is org.apache.xmlbeans.XmlCursor ? If it is, then the concept seems to
be rather close to VTD)

But I personally don't recommend processing XML in a way that it
"isn't meant to be processed". I can't agree more but...
I would use real XML diff apps etc.

....but unfortunately I am stuck with the built-in diff / merge tool
that's part of our source control. When it cannot cope with XML (because
of too many differences, for instance), we have to merge as text. Hence
the requirement on whitespace.

_______________________________________________________
François Robert
(to mail me, reverse character order in reply address)

Toni Uusitalo · Jul 19, 2005

François Robert said:
Toni Uusitalo wrote :

Thanks ! Quite interesting, I must say. Is "non-extractive parsing"
another name for "indexing" ?

BTW, for various reasons, Perl was chosen in my project, so that's the
realm I explored more thoroughly so far. I stumbled on XML:YX and
xml_grep (that comes with XML::Twigs). I think that PYX could easily be
extended to become whitespace preserving.
In fact, the key seems to be an intermediate serialized form which lends
itself to substitution (and queries). In VTD-XML, this form is made of
in-memory VTD records. In PYX, those are line-based format.
The VTD article also mentions an "XMLCursor" Java API (which I suppose
is org.apache.xmlbeans.XmlCursor ? If it is, then the concept seems to
be rather close to VTD)

I can't agree more but...

...but unfortunately I am stuck with the built-in diff / merge tool
that's part of our source control. When it cannot cope with XML (because
of too many differences, for instance), we have to merge as text. Hence
the requirement on whitespace.

Ok. Tasks vary and tools vary. I'm not familiar with VTD-XML myself
(apart from readin that article and about non-extractiing processing
principle). I'm not familiar with XML::twig either but it seems to be
(by quick look) like very extensive xml processing framework. Maybe
you can use that if that really gives you accurate location information
for line-based "intact" and original xml input.

I hope I don't ever have to process xml that way ;-)
But good luck for you anyway.

Toni Uusitalo

problem of python whitespace XML dom	0	Jan 13, 2016
Search and replace text in XML file?	5	Jul 28, 2012
Controlling whitespace in XSL output - tutorial anywhere?	6	Oct 22, 2007
FOSS or Freeware, Prefferably Runs on Linux Mint: Search US Goverment Records, Legally to Find Literarary Work	8	Apr 5, 2023
converting org.w3c.dom.Element to String without losing whitespace	11	Jan 18, 2010
Whitespace problems, xml-parsing	5	Apr 15, 2008
convert text documents to XML	3	Aug 4, 2009
Search/Replace text in XML file	4	Jan 9, 2008

Whitespace-preservating Search & Replace in multiple XML documents

François Robert

Toni Uusitalo

François Robert

Toni Uusitalo

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads