high-performance alternative to xsl:number

A

ajfish

Hi,

I am trying to allocate a unique ID to every instance of tag 'foo' in a
large XML document. currently I'm doing this:

<xsl:variable name="UniqueId">
<xsl:number count="foo" level="any"/>
</xsl:variable>

but with .Net framework 1.1 (using XPathDocument) it is very slow for
large documents (say 100mb with 100,000 foo tags in it). when I say
very slow, I am talking days and I would like it to take minutes !!

the only pure XSL alternative I've seen is to use position(). however,
the <foo> tags can occur at different levels within the document (and
might be nested), so I'm thinking that position would be difficult to
use. There are also other templates within the XSLT which perform other
processing.

the Id's I generate don't have to be contiguous but they must increase
the further you go down the document

is there any simple reliable solution, or should I just bite the bullet
and pre-process the document with C# to put in these Ids before running
the rest of the transform

Thanks

Andy
 
P

p.lepin

Bjoern said:
* (e-mail address removed) wrote in comp.text.xml:

Use <http://www.w3.org/TR/xslt#function-generate-id>.

generate-id() generates names, not numbers. The OP seems
to wants numbers; moreover, he elaborates that he wants
the id's to be ordered in document order:

I don't believe generate-id() guarantees that, not
according to the spec.

Assuming using a faster external XSLT processor is not an
option, I'd say pre-processing is the best bet (if it is
fast enough). For that matter, I'm not sure there are XSLT
processors that would be much faster at this sort of task,
although I seem to remember that Saxon8 includes lots of
nifty optimizations.
 
J

Joseph Kesselman

The fact that XSLT is nonprocedural does mean that numbering tasks are
generally best handled either by the built-in numbering mechanism or by
doing something based on position(); trying to do any other form of
counter requires either deep recursion (which is likely to be slow,
though not always) or extension functions (which is nonportable at best,
and may not be reliable for this purpose since XSLT may execute
out-of-order) or leveraging nonportable quirks of a particular processor
(Xalan's generate-id() strings do happen to correspond to a
document-order numbering, so one could kluge from them if you never need
to run on another processor).

But in this case, a hardcoded procedural processor -- I'd suggest a SAX
filter -- might indeed be worth considering. This is a classic example
where stream-editing is a good approach -- no persistant model beyond a
few state variables (the counter, in this case), no navigation needed
other than next-node-in-document-order, and large documents which could
be expensive to persist.
 
A

ajfish

Joseph said:
The fact that XSLT is nonprocedural does mean that numbering tasks are
generally best handled either by the built-in numbering mechanism or by
doing something based on position(); trying to do any other form of
counter requires either deep recursion (which is likely to be slow,
though not always) or extension functions (which is nonportable at best,
and may not be reliable for this purpose since XSLT may execute
out-of-order) or leveraging nonportable quirks of a particular processor
(Xalan's generate-id() strings do happen to correspond to a
document-order numbering, so one could kluge from them if you never need
to run on another processor).

But in this case, a hardcoded procedural processor -- I'd suggest a SAX
filter -- might indeed be worth considering. This is a classic example
where stream-editing is a good approach -- no persistant model beyond a
few state variables (the counter, in this case), no navigation needed
other than next-node-in-document-order, and large documents which could
be expensive to persist.

thanks for all the replies.

of course the fact that XSLT is declarative means that any given
pattern can be implemented in any way the processor sees fit. I would
have thought the processor could spot that <xsl:number count="..."
level="any"/> can be optmized down to incrementing a counter.

but anyway, I think the effort of finding an alternative XSLT processor
which might be faster is probably less than putting in my own
pre-processing step. as you suggest, it's an ideal candidate for sax
 
P

Peter Flynn

the Id's I generate don't have to be contiguous but they must increase
the further you go down the document

Others have provided a number of solutions to the original request,
but I feel I should take issue with this one. It's probably A Bad Idea
to trespass on the ID space by adding another meaning to it. An ID is
just an ID, nothing more: all it says is "This Is Me, I'm Unique".

Trying to make an ID value mean something in addition is almost always
wrong, and almost always the hallmark of poor data design. It's like the
traditional way of creating customer numbers: two digits for the area,
three digits for the industry code, then a dash because that company we
took over in 1954 always used them, then one digit for this and four
digits for that, then a check digit, and finally a "unique" sequence
number. Accounting offices and marketing offices *love* doing this, when
what they should be doing is recording all that information elsewhere
and assigning an arbitrary unique ID to the customer.

If your customer needs a sequence indicator, create an attribute and
make it reflect the numeric sequence position of each foo element in the
document. If the data is long-term important, just keep the ID as an ID
and your successors will thank you for it.

On the other hand, if like a lot of business data it's only important
for 10-15 minutes while a decision is made, then any old junk will do so
long as it satisfies the immediate conditions :)

///Peter
 
R

Richard Tobin

I am trying to allocate a unique ID to every instance of tag 'foo' in a
large XML document. currently I'm doing this:

<xsl:variable name="UniqueId">
<xsl:number count="foo" level="any"/>
</xsl:variable>

The only reasonably efficient XSLT solution I can come up with is a 3
step process. First assign ids (which won't be numbers, and could be
in any format) to the elements using generate-id(). Then generate a
file mapping ids to sequence numbers (using position() on a node-set
containing all the desired elements). Then use key() to look up the
sequence numbers in the map file. I think this should be order N (or
close) for most stylesheet processors.

Here are the stylesheets, which assume that you want to operate on all
"foo" elements, that you want to call the sequence number attribute
"seq", and that you don't already have attributes called "id".

(1) Assign ids:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>

<xsl:template match="foo">
<xsl:copy>
<xsl:attribute name="id"><xsl:value-of select="generate-id()"/></xsl:attrib\
ute>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>

(2) Create map file from the result of step 1:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="/">
<junk>
<xsl:apply-templates select="//foo"/>
</junk>
</xsl:template>

<xsl:template match="foo">
<map id="{@id}" seq="{position()}"/>
</xsl:template>

</xsl:stylesheet>

(3) Map ids to sequence numbers (pass in the URL of the map file as the
"mapfile" parameter, and use the file generated in step 1 as the input):

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:param name="mapfile"/>

<xsl:key name="id" match="map" use="@id"/>

<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>

<xsl:template match="foo">
<xsl:copy>
<xsl:apply-templates select="@*[name() != 'id']"/>
<xsl:attribute name="seq">
<xsl:variable name="id" select="@id"/>
<!-- the for-each is just to set the context node for key() -->
<xsl:for-each select="document($mapfile)">
<xsl:value-of select="key('id', $id)/@seq"/>
</xsl:for-each>
</xsl:attribute>
<xsl:apply-templates select="node()"/>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>

-- Richard
 
J

Joe Kesselman

Peter said:
It's probably A Bad Idea
to trespass on the ID space by adding another meaning to it. An ID is
just an ID, nothing more: all it says is "This Is Me, I'm Unique".

Granted. I was assuming that what was wanted here was just a sequence
identifier, not an ID in the ID/IDREF sense, since the request was
specifically that it be a montonically increasing numeric value.
 
G

Gadget

Hi,

I am trying to allocate a unique ID to every instance of tag 'foo' in a
large XML document. currently I'm doing this:

<xsl:variable name="UniqueId">
<xsl:number count="foo" level="any"/>
</xsl:variable>

but with .Net framework 1.1 (using XPathDocument) it is very slow for
large documents (say 100mb with 100,000 foo tags in it). when I say
very slow, I am talking days and I would like it to take minutes !!

the only pure XSL alternative I've seen is to use position(). however,
the <foo> tags can occur at different levels within the document (and
might be nested), so I'm thinking that position would be difficult to
use. There are also other templates within the XSLT which perform other
processing.

the Id's I generate don't have to be contiguous but they must increase
the further you go down the document

is there any simple reliable solution, or should I just bite the bullet
and pre-process the document with C# to put in these Ids before running
the rest of the transform

Thanks

Andy


Actually, although it seems to be relatively unknown, you can actually
include C# code in your XSLT.
For example, add the following at the bottom of your XSLT:

<ms:script language="C#" implements-prefix="ext">
<![CDATA[

int currentPosition = 0;

public string GetPosition(){
currentPosition = currentPosition + 1;
return currentPosition.ToString();
}

]]>
</ms:script>

Now you can get an incrementing ID using something like:
<xsl:value-of select="ext:GetPosition()"/>

Nothing is going to be as fast as opening an XmlTextReader/XmlTextWriter
pair and iterating through the document, adding the attributes when you
read a node.Name='foo', as the whole file has to be parsed and rewritten
anyway.

Cheers,
Gadget
 
J

Joe Kesselman

Gadget said:
Actually, although it seems to be relatively unknown, you can actually
include C# code in your XSLT.

Uhm... Only in MSXSL. This is an implementation-specific, nonportable
feature (which is why it's in Microsoft's own namespace). This is
basically the "extension functions" solution, except that MS has
provided a way to inline them. (Personally, I don't like it -- but I'm a
firm believer in sticking with portable solutions unless there is
absolutely no alternative.)

As I said, stateful extensions may work but they have issues. A
sufficiently smart XSLT processor may re-order code as part of its
optimization, counting on the fact that XSLT is a functional language;
extensions break that assumption and thus may either prevent
optimization (if the processor is smart and cautious) or fail to execute
in the way you expected them to (if the processor is assuming that the
extensions will also be functional and have no persistent state).

I honestly think the preprocessor approach is architecturally cleaner.
But, yeah, this may work.
 
G

Gadget

Uhm... Only in MSXSL. This is an implementation-specific, nonportable
feature (which is why it's in Microsoft's own namespace). This is
basically the "extension functions" solution, except that MS has
provided a way to inline them. (Personally, I don't like it -- but I'm a
firm believer in sticking with portable solutions unless there is
absolutely no alternative.)

As I said, stateful extensions may work but they have issues. A
sufficiently smart XSLT processor may re-order code as part of its
optimization, counting on the fact that XSLT is a functional language;
extensions break that assumption and thus may either prevent
optimization (if the processor is smart and cautious) or fail to execute
in the way you expected them to (if the processor is assuming that the
extensions will also be functional and have no persistent state).

I honestly think the preprocessor approach is architecturally cleaner.
But, yeah, this may work.

Well we obviously avoid vendor specific code when doing anything, but in
this case we're in a dotnet.xml group, and he's asking for a 'high
performance' solution, so that rules out native XSLT :)
The advantage of XSLT in this case is that it is the most flexible way to
manipulate XML, and does not require recompiling every time a change is
made, which is why the inclusion of the code in the XSLT is almost
certainly going to be his best flexible 'high performance' option.

If this is a single requirement that does not require flexibility, use an
XMLTextReader and XMLTextWriter, and manipulate the data as you copy it
from one stream to another. This is a 'one shot' solution that requires
compilation but is the fastest 'structured' method.

Insisting on using platform independent XSLT for code that will be running
under MSXML is a bit of an 'ivory tower' practise, and ideal if you believe
your solution might one day be ported to a Linux box, run another vendor's
engine, or be posted for the scrutiny of the open-source community, but the
chances are that this would require redesigning 90% of your application
anyway, in which this small part becomes negligible :)

It would be interesting to see if the XSLT processor did reorder any of the
code, but given that this solution was provided by Microsoft, and given
that there are standards for the order in which nodes are traversed, this
is rather unlikely.

I guess it's just a case of prioritizing speed, flexibility, and
standardization.

Cheers,
Gadget
 
P

Peter Flynn

Joe said:
Granted. I was assuming that what was wanted here was just a sequence
identifier, not an ID in the ID/IDREF sense, since the request was
specifically that it be a montonically increasing numeric value.

Yep, but in that case it would be better to call it SEQ or something,
just in case it gets accidentally misinterpreted as being an ID in the
XML sense of the term.

///Peter
 
J

Joe Kesselman

Gadget said:
this case we're in a dotnet.xml group, and he's asking for a 'high
performance' solution, so that rules out native XSLT :)

Some of you are in a dotnet.xml groups; the discussion's being crossposted.
 
W

W. Jordan

I agree with Gadget.
The cross-platform thing without code rewriting is still like a dream.
Yet I suggest that the file mapping solution is a better choice.
 
J

Joe Kesselman

W. Jordan said:
The cross-platform thing without code rewriting is still like a dream.

I'm not surprised to hear that opinion expressed in Microsoft-specific
group. The rest of the industry seems to be managing it pretty well.

There are certainly times that a portable solution doesn't matter and
extensions are the right answer. Up to the developer to decide whether
this is such a case. I'll let it rest at that.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top