Searching My XML File Using Keyword Searches?

P

pbd22

Hi.

I am somewhat new to this and would like some advice.
I want to search my xml file using "keyword" search and
return results based on "proximity matching" - in other words,
since the search string will often not produce a direct match,
the results will be based on proximity (50%, 20% 100%, etc).

are there any good examples out there on how to do keyword
searches on XML data? How should i set up my xml file so
as to make a tag as likely as possible to match a related
search term?

and finally, how is proximity determined?

I know that this is a heady question, but i am hoping that some
answers at least put me on the right track - possibly with
links, sample code, or examples.

thanks in advance.
 
J

Joe Kesselman

pbd22 said:
I want to search my xml file using "keyword" search and
return results based on "proximity matching"

I don't know of any off-the-shelf code for this purpose, so you may be
stuck with implementing it yourself based on basic XML APIs and/or as a
complicated stylesheet.
 
P

pbd22

Hi.

Thanks.

I figured this. I have done a bit of searching and it seems that XPath,
XML/XSLT and CSS
are the way to go.

I am very new to this and have a follow up question. I am trying to
just get going and
am having trouble getting an example to work.

I was attempting to load categories.xml from my server using the below
commented out code (at the bottom of the page). It looks right, but, it
fails at xmlDoc = new ActiveXObject(...

So, it seems that the ActiveXObject part is causing it to fail. I then
added the below
"testing" code from another web site to expore what to use for my
xmlhttp variable
and i get the categories.xml file from the server just fine.

but, i need to be able to use the MSXML API as in the commented code.
How do
i access "Msxml2.DOMDocument.6.0"? What am i doing wrong?

Thanks.

<script type="text/javascript">

function BuildDocument() {

var xmlhttp=false;

try {
xmlhttp = new ActiveXObject("Msxml2.DOMDocument.6.0");
} catch (e) {
try {
xmlhttp = new ActiveXObject("Msxml2.DOMDocument.6.0");
} catch (E) {
xmlhttp = false;
}
}

if (!xmlhttp && typeof XMLHttpRequest!='undefined') {
try {
xmlhttp = new XMLHttpRequest();
} catch (e) {
xmlhttp=false;
}
}
if (!xmlhttp && window.createRequest) {
try {
xmlhttp = window.createRequest();
} catch (e) {
xmlhttp=false;
}
}

xmlhttp.open("GET", "categories.xml", true);
xmlhttp.onreadystatechange=function() {

if (xmlhttp.readyState==4) {
document.getElementById('results').innerHTML =
xmlhttp.responseText;
}

}

xmlhttp.send(null)

__________________________________

COMMENTED OUT CODE:
__________________________________

/************************************************

// Load XML

var xmlDoc = new ActiveXObject("Msxml2.DOMDocument.6.0");
xmlDoc.async = false;
xmlDoc.validateOnParse = false;
xmlDoc.load("categories.xml");
xml.async = false;
xml.load("categories.xml");

// Load XSL
var xsl = new ActiveXObject("Msxml2.DOMDocument.6.0");
xsl.async = false;
xsl.load("categories.xsl");

// Transform
document.write(xml.transformNode(xsl));

*************************************************/

}
 
P

Peter Flynn

pbd22 said:
Hi.

Thanks.

I figured this. I have done a bit of searching and it seems that XPath,
XML/XSLT and CSS are the way to go.

For a single file this will probably work, but for anything bigger
(eg a folder-full) you really need an indexing engine, otherwise it
will take forever.

The problem with proximity search in marked text is to decide what
"proximate" means. If you allow proximity to bleed over markup
boundaries, you increase the number of hits but you risk them being
inaccurate or misleading. For example if you search for "character
function" with proximity set to more than 12 words, the text

<para>...stuff...and his character was by far the strongest
in the play.</para>
</section>
</chapter>
<chapter>
<head>Set Design</head>
<para>The function of set design in Restoration drama...</para>

will produce a hit which computer scientists may not expect. IMHE
the acceptable limit is to allow proximity to bleed across markup
in mixed content plus the first higher level of element content.
This would allow it to operate across (for example) adjacent
paragraphs, but not across adjacent sections or chapters.

This has implications for the indexing engine, as it needs to store
not only the character offsets of words but also their markup depth
and adjacency. Very few manage to do this correctly, despite the
original technique having been implemented a long time ago (PAT).

///Peter
 
P

pbd22

hi peter.

ok, thanks. well, i guess then i am in luck (kind of).
i am only doing a search on a single file (categories.xml).
the file, however, is very large an quite detailed - there are
sub categories of sub categories of sub categories and so on.

the good news is that the file does not take user input. or,
for that matter, any text at all. it simply servs as a way for
users to search a term, say, "Hard Drive" and find what
categories of the many available match that term. the response
from the server should be as many (remotely) related paths
as possible and their associated relevancy rank:

1) Technology > Hardware > Hard Drives
100%
2) Cinema > Movies > Features > "Hard Drive" 93%
3) Books > Politics > Elections > "Hard Drive" 90%
4) Books > Sports > Swimming > Biography 36%
5) Media > News > International > Art
30%
6) Music > New Age
4%

so, my example doesnt really match yours in the sense that
paragraphs with massive contextual differences could produce very
misleading
results.

What i do need to understand is how to rank such a search. How would
the logic
work for scoring number (2) as 93% and (3) as 90%, say? Should i be
including
a series of related words in the XML for each topic - those with "more"
related words get a higher rank? That seems very crude. I'll do
research on Indexing Engines but, based
on what you said, it seems like it may be overkill since i am working
with a single
file (categories.xml and categories.xsl) and am not dealing with wordy
paragraphs.

thanks again.
 
J

Joe Kesselman

pbd22 said:
What i do need to understand is how to rank such a search. How would
the logic
work for scoring number (2) as 93% and (3) as 90%, say?

That's an application design issue, not an XML issue per se.
 
P

pbd22

ok, fair enough.

i was just hoping that somebody could give me some ideas
about how to sturcture my categories.xml file for the kind of
search i am trying to do.

another poster provided some useful code for the XSL file (below).
but now, if somebody could show me how to pass the value from
the user's search string on the client to the XSL file and, how to
structure the XML file for the kind of "proximity searching" that
i was discussing wiht Peter. Should each node contain a string
of key words?

If this is an application design issue and not an XML problem, fair
enough. otherwise, advice appreciated.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:eek:utput method="xml" version="1.0" encoding="UTF-8"
indent="yes"/>
<!--change <xsl:variable name="data" select="'met sport baseball'"/>
in
<xsl:param name="data"/>-->
<xsl:variable name="data" select="'met sport baseball'"/>
<xsl:variable name="upperCase"
select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>
<xsl:variable name="lowerCase"
select="'abcdefghijklmnopqrstuvwxyz'"/>
<xsl:variable name="test"
select="translate($data,$upperCase,$lowerCase)"/>
<xsl:template match="/">
<xsl:apply-templates select="*/*">
<xsl:with-param name="search" select="$test"/>
</xsl:apply-templates>
</xsl:template>
<xsl:template match="*">
<xsl:param name="search"/>
<xsl:variable name="result">
<xsl:call-template name="searching">
<xsl:with-param name="Sdeb" select="$search"/>
<xsl:with-param name="Send" />
<xsl:with-param name="val" select="."/>
</xsl:call-template>
</xsl:variable>
<xsl:if test="string($result)=''">
trouvé <xsl:value-of select="."/>
</xsl:if>
</xsl:template>
<xsl:template match="*[@title]">
<xsl:param name="search"/>
<xsl:variable name="result">
<xsl:call-template name="searching">
<xsl:with-param name="Sdeb" select="$search"/>
<xsl:with-param name="Send" />
<xsl:with-param name="val" select="@title"/>
</xsl:call-template>
</xsl:variable>
<xsl:choose>
<xsl:when test="string($result)=''">
trouvé <xsl:value-of select="@title"/>
</xsl:when>
<xsl:eek:therwise>
<xsl:apply-templates select="*">
<xsl:with-param name="search" select="string($result)"/>
</xsl:apply-templates>
</xsl:eek:therwise>
</xsl:choose>
</xsl:template>
<xsl:template name="searching">
<xsl:param name="Sdeb"/>
<xsl:param name="Send"/>
<xsl:param name="val"/>
<xsl:variable name="trans">
<xsl:choose>
<xsl:when test="contains($Sdeb,' ')">
<xsl:value-of select="substring-before($Sdeb,' ')"/>
</xsl:when>
<xsl:eek:therwise>
<xsl:value-of select="$Sdeb"/>
</xsl:eek:therwise>
</xsl:choose>
</xsl:variable>
<xsl:variable name="word" select="string($trans)"/>
<xsl:choose>
<xsl:when
test="contains(translate($val,$upperCase,$lowerCase),$word)">
<xsl:choose>
<xsl:when test="$Sdeb=$word">
<xsl:value-of select="$Send"/>
</xsl:when>
<xsl:eek:therwise>
<xsl:call-template name="searching">
<xsl:with-param name="Sdeb"
select="substring-after($Sdeb,' ')"/>
<xsl:with-param name="Send" select="$Send"/>
<xsl:with-param name="val" select="$val"/>
</xsl:call-template>
</xsl:eek:therwise>
</xsl:choose>
</xsl:when>
<xsl:eek:therwise>
<xsl:choose>
<xsl:when test="$Sdeb=$word">
<xsl:value-of select="concat($Send,' ',$word)"/>
</xsl:when>
<xsl:eek:therwise>
<xsl:call-template name="searching">
<xsl:with-param name="Sdeb"
select="substring-after($Sdeb,' ')"/>
<xsl:with-param name="Send" select="concat($Send,'
',$word)"/>
<xsl:with-param name="val" select="$val"/>
</xsl:call-template>
</xsl:eek:therwise>
</xsl:choose>
</xsl:eek:therwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
 
J

Joe Kesselman

pbd22 said:
but now, if somebody could show me how to pass the value from
the user's search string on the client to the XSL file

Look up "stylesheet parameters". The exact syntax for passing them in
varies from one XSLT processor to another, but the XSL syntax is the
same in all processors.

Getting it from the client to a server is, presumably, standard client
forms and server programming.
and, how to
structure the XML file for the kind of "proximity searching" that
i was discussing wiht Peter.

As I say, I think that's drifting off from XML into basic programming
and data-structure design. Others may, of course, disagree.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top