Whitespace-preservating Search & Replace in multiple XML documents

F

François Robert

Dear Newsgroup,

I am looking for a way to search and replace some strings inside various
XML documents while at the same time binary-preserving all the
whitespace of each document (in particular the line ending convention,
the white space both *inside* the markup and inside the content).

So far, this sounds more like a plain text search-and-replace, but the
twist is that the strings should only be replaced if they match a
certain XML context (say: replace attribute name "jarfile" in any
element <jar> with attribute name "destfile", or change the entire
content of element <value>, but only when when <value> immediately
follows an element <key> with a content of "OutputFile" etc...)

I even don't know if my problem has a "canonical" name, which pretty
much precludes a meaningfull search on Google...

I know XLST can do some (all ?) of that, but :
a) These substitutions need to occur on many different XML files and the
XML contexts / search strings may differ from file to file, so I will
need many different stylesheets. (which could be generated
automatically, I suppose)
b) What guarantee do I have on binary-preservation of all whitespace ?
(BTW this "weird" requirements arises from the need to keep the ability
to make plain textual diff of those XML documents which are stored
inside a source control system)

I have also looked at SAX parsers, thinking that maybe I could rely on
event notifications, but it seems that the events are not granular
enough for my situation (eg : AFAICT, no notification will tell that I
have encountered a block of contiguous whitespace inside an element tag
and how is such a block made, for instance 3 SPC + LF + LF + TAB + TAB).
Also, the SAX parser does not seem to be able to tell me the exact
'slices' of input characters that it identified as element name,
attribute name, attribute value, whitespaces, entity reference, etc...
AFAICT, SAX will not tell me the difference between 'attr="!"' and
'attr="!"' ?

Pointers, suggestions & comments appreciated.
Regards
_______________________________________________________
François Robert
(to mail me, reverse character order in reply address)
 
T

Toni Uusitalo

François Robert said:
Dear Newsgroup,

I am looking for a way to search and replace some strings inside various
XML documents while at the same time binary-preserving all the
whitespace of each document (in particular the line ending convention,
the white space both *inside* the markup and inside the content).

So far, this sounds more like a plain text search-and-replace, but the
twist is that the strings should only be replaced if they match a
certain XML context (say: replace attribute name "jarfile" in any
element <jar> with attribute name "destfile", or change the entire
content of element <value>, but only when when <value> immediately
follows an element <key> with a content of "OutputFile" etc...)

I even don't know if my problem has a "canonical" name, which pretty
much precludes a meaningfull search on Google...

I know XLST can do some (all ?) of that, but :
a) These substitutions need to occur on many different XML files and the
XML contexts / search strings may differ from file to file, so I will
need many different stylesheets. (which could be generated
automatically, I suppose)
b) What guarantee do I have on binary-preservation of all whitespace ?
(BTW this "weird" requirements arises from the need to keep the ability
to make plain textual diff of those XML documents which are stored
inside a source control system)

I have also looked at SAX parsers, thinking that maybe I could rely on
event notifications, but it seems that the events are not granular
enough for my situation (eg : AFAICT, no notification will tell that I
have encountered a block of contiguous whitespace inside an element tag
and how is such a block made, for instance 3 SPC + LF + LF + TAB + TAB).
Also, the SAX parser does not seem to be able to tell me the exact
'slices' of input characters that it identified as element name,
attribute name, attribute value, whitespaces, entity reference, etc...
AFAICT, SAX will not tell me the difference between 'attr="!"' and
'attr="!"' ?

Pointers, suggestions & comments appreciated.
Regards

This might be useful for you:
http://www.devx.com/xml/Article/22219/0/page/1
"VTD-XML retains the XML message in memory intact and un-decoded, and
tokens represents tokens using starting offset and length exclusively."

But I personally don't recommend processing XML in a way that it "isn't
meant to be processed". I would use real XML diff apps etc. because
things might get more complicated when for example entity expansion and
DTDs get involved...

Hope this helps,
Toni Uusitalo
 
F

François Robert

Toni Uusitalo wrote :

Thanks ! Quite interesting, I must say. Is "non-extractive parsing"
another name for "indexing" ?

BTW, for various reasons, Perl was chosen in my project, so that's the
realm I explored more thoroughly so far. I stumbled on XML::pYX and
xml_grep (that comes with XML::Twigs). I think that PYX could easily be
extended to become whitespace preserving.
In fact, the key seems to be an intermediate serialized form which lends
itself to substitution (and queries). In VTD-XML, this form is made of
in-memory VTD records. In PYX, those are line-based format.
The VTD article also mentions an "XMLCursor" Java API (which I suppose
is org.apache.xmlbeans.XmlCursor ? If it is, then the concept seems to
be rather close to VTD)
But I personally don't recommend processing XML in a way that it
"isn't meant to be processed". I can't agree more but...
I would use real XML diff apps etc.
....but unfortunately I am stuck with the built-in diff / merge tool
that's part of our source control. When it cannot cope with XML (because
of too many differences, for instance), we have to merge as text. Hence
the requirement on whitespace.

_______________________________________________________
François Robert
(to mail me, reverse character order in reply address)
 
T

Toni Uusitalo

François Robert said:
Toni Uusitalo wrote :




Thanks ! Quite interesting, I must say. Is "non-extractive parsing"
another name for "indexing" ?

BTW, for various reasons, Perl was chosen in my project, so that's the
realm I explored more thoroughly so far. I stumbled on XML::pYX and
xml_grep (that comes with XML::Twigs). I think that PYX could easily be
extended to become whitespace preserving.
In fact, the key seems to be an intermediate serialized form which lends
itself to substitution (and queries). In VTD-XML, this form is made of
in-memory VTD records. In PYX, those are line-based format.
The VTD article also mentions an "XMLCursor" Java API (which I suppose
is org.apache.xmlbeans.XmlCursor ? If it is, then the concept seems to
be rather close to VTD)



I can't agree more but...


...but unfortunately I am stuck with the built-in diff / merge tool
that's part of our source control. When it cannot cope with XML (because
of too many differences, for instance), we have to merge as text. Hence
the requirement on whitespace.

Ok. Tasks vary and tools vary. I'm not familiar with VTD-XML myself
(apart from readin that article and about non-extractiing processing
principle). I'm not familiar with XML::twig either but it seems to be
(by quick look) like very extensive xml processing framework. Maybe
you can use that if that really gives you accurate location information
for line-based "intact" and original xml input.

I hope I don't ever have to process xml that way ;-)
But good luck for you anyway.

Toni Uusitalo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top