XQuerying material between elements

patrik.nyman · Apr 25, 2007

I am working with marking up the text of old books,
and need to be able to present the result page-wise.
Problem is, sometimes the page breaks occurs in the
middle of a paragraph (or in some other element).
See the following example.

I shall not describe it to you, for in-
<lb/>deed I cannot. To delineate the truly aw-
<lb/>ful locality of Trollhättan, would
<lb/>baffle the powers of poetic fancy, and mock
<pb n="15" urn="urn:nbn:se:kb:digark-7886"/>
<lb/>the painter's daring pencil. I ran only af-
<lb/>ford you a faint idea of its characteristic
<lb/>features, and even that will he found
<lb/>arduous. Come, and see it, and you will
<lb/>applaud my modesty.


[...]
<lb/>of gold." Subscribing to the old Swedish
<lb/>proverb: When it rains down milk, the poor
<lb/>has no spoon," I silently dropped the theme,
<lb/>and would not have rementioned it now,
<pb n="16" urn="urn:nbn:se:kb:digark-7887"/>
<lb/>if I were not anxious to dis-play to you, what
<lb/>an able minister of state I might possibly
<lb/>be, if His Majesty should be pleased to
<lb/>invest me with that honor, which, you
<lb/>know, is as distant from me as the mitre
<lb/>and the slipper of the Pope of Rome.


Just separating out the material in between the <pb/>'s
gives non-wellformed XML.

So, is it possible to write an XQuery expression that
can fix this, i.e. 'detect' that the <pb/> occurs in
the middle of another element and take the appropriate
action? The result would have to look something like

<pb n="15" urn="urn:nbn:se:kb:digark-7886"/>
the painter's daring pencil. I ran only af-
<lb/>ford you a faint idea of its characteristic
<lb/>features, and even that will he found
<lb/>arduous. Come, and see it, and you will
<lb/>applaud my modesty.


[...]
<lb/>of gold." Subscribing to the old Swedish
<lb/>proverb: When it rains down milk, the poor
<lb/>has no spoon," I silently dropped the theme,
<lb/>and would not have rementioned it now,


Thanks.

Joseph Kesselman · Apr 25, 2007

I'm sure XQuery can do it, though I'm not sure of the syntax offhand.

In XPath, I would set up a template that matches on p[pb] (a paragraph
that contains a page break) and rewrites it appropriately by first
outputting a p containing the pb's preceeding siblings, then the pb,
then a p containing the following siblings. Very straightforward.

Pavel Lepin · Apr 25, 2007

Joseph Kesselman said:
So, is it possible to write an XQuery expression that
can fix this, i.e. 'detect' that the <pb/> occurs in
the middle of another element and take the appropriate
action? The result would have to look something like
I'm sure XQuery can do it, though I'm not sure of the
syntax offhand.

Click to expand...

In XPath, I would set up a template that matches on p[pb]
(a paragraph that contains a page break) and rewrites it
appropriately by first outputting a p containing the pb's
preceeding siblings, then the pb, then a p containing the
following siblings. Very straightforward.

XSLT does indeed seem like a better bet than XQuery in this
case, but if you try to generalize the problem a bit
(multiple page breaks and more than one level of ancestor
elements to be spliced) it gets kinda messy with XSLT1. On
the other hand, an XSLT2 solution would be fairly elegant
thanks to sequences--may FSM touch with his noodly
appendage whoever on XSLT WG came up with those.

Joseph Kesselman · Apr 25, 2007

Joseph said:
In XPath,

Meant to write XSLT, obviously. Sigh. Engage mind, THEN put fingers in
gear...

patrik.nyman · Apr 26, 2007

Thanks for the replies. I forgot to mention that the texts
are posited in the eXist database, hence the need for XQuery.
What I've managed to come up with is this.

1 <hit>
2

Check if the initial <pb> is the child of another element,
3 and print the name of that element.

4 {
5 let $i1 := //pb[@urn='urn:nbn:se:kb:digark-7886']
6 return
7 if ($i1[parent:

]) then
8 ''
9 else
10 if ($i1[parent::lg]) then
11 '<lg>'
12 else()
13 }
14

Print the material between the pagebreaks.

15 {
16 let $i1 := //pb[@urn='urn:nbn:se:kb:digark-7886'],
17 $i2 := //pb[@urn='urn:nbn:se:kb:digark-7887']
18 for $n in //text()
19 where $n >> $i1 and $n << $i2
20 return $n
21 }
22

Check if the final <pb> is the child of another element,
23 and print the name of that element.

24 {
25 let $i2 := //pb[@urn='urn:nbn:se:kb:digark-7887']
26 return
27 if ($i2[parent:

]) then
28 ''
29 else
30 if ($i2[parent::lg]) then
31 '</lg>'
32 else()
33 }
34 </hit>

This works fine, except of course for the 'text()' om line 18.
This outputs only the text, not the text and markup, which is what I
want.
Switching 'text()' for 'node()' or 'element()' doesn't give the
desired result either, naturally.

Any suggestions are welcome. Thanks.

Pierrick Brihaye · Apr 26, 2007

(e-mail address removed) a écrit :

For my curiosity, is :

1 <hit>
2 Check if the initial <pb> is the child of another element,
3 and print the name of that element.
4 {
5 let $i1 := //pb[@urn='urn:nbn:se:kb:digark-7886']
6 return
7 if ($i1[parent:]) then

this :

8 ''
9 else
10 if ($i1[parent::lg]) then

this :

11 '<lg>'
12 else()
13 }
14 Print the material between the pagebreaks.
15 {
16 let $i1 := //pb[@urn='urn:nbn:se:kb:digark-7886'],
17 $i2 := //pb[@urn='urn:nbn:se:kb:digark-7887']
18 for $n in //text()
19 where $n >> $i1 and $n << $i2
20 return $n
21 }
22 Check if the final <pb> is the child of another element,
23 and print the name of that element.
24 {
25 let $i2 := //pb[@urn='urn:nbn:se:kb:digark-7887']
26 return
27 if ($i2[parent:]) then

this :

28 ''
29 else
30 if ($i2[parent::lg]) then

and this :

31 '</lg>'
32 else()
33 }
34 </hit>

.... supposed to be mark-up in the resulting sequence ?

p.b.

Joseph Kesselman · Apr 26, 2007

Pierrick said:
For my curiosity, is :
this :
... supposed to be mark-up in the resulting sequence ?

I certainly hope not, because if so I'd consider it an abuse of XQuery,
akin to trying to hand-construct tags in XSLT.

If the goal is to construct document structure, construct structure, not
text that looks like structure.

Priscilla Walmsley · Apr 27, 2007

Hi,

How about something like this:

let $i1 := //pb[@urn='urn:nbn:se:kb:digark-7886'],
$i2 := //pb[@urn='urn:nbn:se:kb:digark-7887']
return <hit>{
if ($i1[parent:

])
then {$i1/following-sibling::node()}
else ()
,
for $n in //p
where $n >> $i1 and $n << $i2 and not($n/*[. is $i1]) and not($n/*[. is
$i2])
return $n
,
if ($i2[parent:

])
then {$i2/preceding-sibling::node()}
else ()

}</hit>

Hope that helps,
Priscilla

patrik.nyman · Apr 29, 2007

Hi,

How about something like this:

let $i1 := //pb[@urn='urn:nbn:se:kb:digark-7886'],
$i2 := //pb[@urn='urn:nbn:se:kb:digark-7887']
return <hit>{
if ($i1[parent:])
then {$i1/following-sibling::node()}
else ()
,
for $n in //p
where $n >> $i1 and $n << $i2 and not($n/*[. is $i1]) and not($n/*[. is
$i2])
return $n
,
if ($i2[parent:])
then {$i2/preceding-sibling::node()}
else ()

}</hit>

Hope that helps,
Priscilla

Thanks a lot for this. I cannot test it until wednesday, but then I'll
let you know.

/Patrik Nyman

patrik.nyman · May 3, 2007

Hi,

How about something like this:

let $i1 := //pb[@urn='urn:nbn:se:kb:digark-7886'],
$i2 := //pb[@urn='urn:nbn:se:kb:digark-7887']
return <hit>{
if ($i1[parent:])
then {$i1/following-sibling::node()}
else ()
,
for $n in //p
where $n >> $i1 and $n << $i2 and not($n/*[. is $i1]) and not($n/*[. is
$i2])
return $n
,
if ($i2[parent:])
then {$i2/preceding-sibling::node()}
else ()

}</hit>

Hope that helps,
Priscilla

Yes, it works, and is much better than my version!
Thanks a lot,
Patrik

XQuerying material between elements

patrik.nyman

Joseph Kesselman

Pavel Lepin

Joseph Kesselman

patrik.nyman

Pierrick Brihaye

Joseph Kesselman

Priscilla Walmsley

patrik.nyman

patrik.nyman

Members online

Forum statistics

Latest Threads