Mixed Content XML pattern matching

P

phaeton123

I was trying to use Xquery to try to do pattern matching over mixed
structured and unstructured content. For example consider the following
xml fragment:

.....
<article id="777">
<title>Massachusetts to use ODF Through Microsoft Office</title>
<body>
In a move sure to be unpopular with <organization>Sun</organization>
and <organization>IBM</organization>, <state>Mass.</state> has decided
to use the ODF plug-ins recently released for Microsoft Office rather
than move to OpenOffice.
</body>
</article>
.....

Suppose, I wanted to find all articles that contain references to a
state such as Massachusetts followed later in the text by Microsoft. In
other words, something like "<type=state>.*Microsoft"
What would be the easiest way to accomplish this with Xquery or Xpaths
if in fact it is possible? If it is possible, can we incorporate into
these "mixed regular expressions" arbitrarily nested structures and
regular text? The problem I am having is conceptualizing how one can
naturally combine searches over structure and content jointly as
opposed to
doing it in 2 passes: one pass to search for the paths of the form
//state, extracting all of the matching bodies and then searching for
plaintext regular expression matches of "Microsoft".
 
M

Martin Honnen

phaeton123 wrote:

<article id="777">
<title>Massachusetts to use ODF Through Microsoft Office</title>
<body>
In a move sure to be unpopular with <organization>Sun</organization>
and <organization>IBM</organization>, <state>Mass.</state> has decided
to use the ODF plug-ins recently released for Microsoft Office rather
than move to OpenOffice.
</body>
</article>
....

Suppose, I wanted to find all articles that contain references to a
state such as Massachusetts followed later in the text by Microsoft.

//article[body[state[. = 'Mass.' and (some $sibling in
following-sibling::node() satisfies contains($sibling, 'Microsoft'))]]]

should do to select those article elements which have a body child
element with a state child element whose content is 'Mass.' and which is
followed by some sibling containing 'Microsoft'.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top