Getting kind of abstract text snippets from text nodes

Andreas W. Wylach · Mar 8, 2007

Hi everybody,

I am about implementing a little search engine that searches a phrase
over xml text nodes. I got
that all working fine but what I want as the results is not the
complete text of the textnode,
I would like to make an abstract like result list (such output that
you get with google searches.

For eg

.... I am the <b>substring</b> from a complete text node ...

where "substring" is the search term.

The problem is simple (I think): I want to extract all the text parts
of the complete text node,
where search searchterm is highlighted, surrounded by the text like
30
characters.

I found an intersting post "cut down text" which is almost that what
I
am looking for, but there the
text is just trimmed by x characters.

Is anybody here, that has an "elegant" way to solve that or some
hints
that get me to the solution? I am not able to use regex (would be
nice
though)
My parser is Sablotron so I am restricted to the functions that I
get.
(1.0).

Any help is greatly appreciated.

regards,
Andreas W Wylach

Joe Kesselman · Mar 8, 2007

Think about dividing the text into three parts: before your target, the
target itself, and after the target. Process each appropriately. If you
want to report multiple instances within the same block of text, look at
the standard examples of recursive text processing.

Dimitre Novatchev · Mar 10, 2007

Andreas W. Wylach said:
Hi everybody,

I am about implementing a little search engine that searches a phrase
over xml text nodes. I got
that all working fine but what I want as the results is not the
complete text of the textnode,
I would like to make an abstract like result list (such output that
you get with google searches.

For eg

... I am the <b>substring</b> from a complete text node ...

where "substring" is the search term.

The problem is simple (I think): I want to extract all the text parts
of the complete text node,
where search searchterm is highlighted, surrounded by the text like
30
characters.

FXSL gives you exactly that (look for testConcordance.xsl).

As first shown here a year and a half ago:

http://www.stylusstudio.com/xsllist/200511/post00560.html

this was used to create a concordance of the text of the New Testament for
any word longer than three characters with frequency count in the document
not exceeding a given frequency count parameter (1280, which practically
leaves out mainly pronouns).

The code itself is 95 lines and on a 3GHz, 2GB Pentium IV PC with Saxon 8.6
(at that time) needed less than 92 seconds to produce the complete (huge)
concordance. The source xml document: "ot Ending Spaces.xml" is almost 50
000 (fifty thousand) lines long.

This is just one illustration of the reality of what can be done with XSLT,
disspelling the myths of "XSLT cannot do this or that
efficiently/elegantly".

Hope this helped.

Cheers,
Dimitre Novatchev

Measuring a string of text	1	Sep 15, 2022
CORS/Express: Getting data from server from domain html	2	Sep 3, 2022
xslt help needed with element nodes embedded in text node	4	Feb 22, 2012
XSLT Extract Text from Nodes	9	Oct 10, 2006
Attempting to implement "weird" kind of graph	4	Jan 25, 2012
Handling quotes in xml.dom text nodes	3	Apr 13, 2010
Why treat text nodes as nodes?	8	May 13, 2005
So I have (a sketch of) a universal system...	3	Sep 2, 2022

Getting kind of abstract text snippets from text nodes

Andreas W. Wylach

Joe Kesselman

Dimitre Novatchev

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads