Xml search

J

Jongware

Hi,
Can you please show the way to quickly search such big Xml file, in a Visual C++ project?
http://dl.dropbox.com/u/40211031/List.zip

Did you generate these 1,000,002 lines of XML data, or is this from the
real world?

In case someone does not like downloading 57 megs of zipped file, or
expanding it into 722 megs of rather pointless example lines: here is an
abbreviated version:

<?xml version="1.0" encoding="UTF-16"?>
<Appdata>
<Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04" Attr5="05"
Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"
Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"
Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>
<Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14" Attr5="15"
Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"
Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"
Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>
.... (999,998 similar lines omitted) ...
<Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"
Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"
Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"
Attr12="99999812" Attr13="99999813" Attr14="99999814" Attr15="99999815"
Attr16="99999816" Attr17="99999817" Attr18="99999818"
Attr19="99999819">Node_Number999998</Data>
<Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"
Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"
Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"
Attr12="99999912" Attr13="99999913" Attr14="99999914" Attr15="99999915"
Attr16="99999916" Attr17="99999917" Attr18="99999918"
Attr19="99999919">Node_Number999999</Data>
</Appdata>

I'm assuming you *generated* this file by way of example. If not, well,
it's so extremely structured that you could throw it away and use a
simple algorithm to generate the "data" for any line immediately. (And
then it would not be "data", it would be a calculation.)

Anyway, XML is a poor choice for this particular set of data. Write a
program to convert it into a binary format, where each "line" uses 10
integers and one string of a fixed length of 20 bytes. That takes up no
more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small
enough to be loaded into the RAM of today's computers.

Search "quickly" depends on what you want to search for. If, for
example, you may need to grab a single digit out of any attribute or
content (say, a '9' that can occur in the middle of 'Attr2="4593252"'),
you are better off storing everything as string. You could also sort the
list on one or more of the Attr fields, and, if you prefer lookup speed
over memory usage, you could even sort on *all* of the attribute fields
plus the data field, and save pointers to the 'actual' data.

[Jw]
 
W

wmedwardchan

Can you please show the way to quickly search such big Xml file, in a Visual C++ project?



Did you generate these 1,000,002 lines of XML data, or is this from the

real world?



In case someone does not like downloading 57 megs of zipped file, or

expanding it into 722 megs of rather pointless example lines: here is an

abbreviated version:



<?xml version="1.0" encoding="UTF-16"?>

<Appdata>

<Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04" Attr5="05"

Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"

Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"

Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>

<Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14" Attr5="15"

Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"

Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"

Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>

... (999,998 similar lines omitted) ...

<Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"

Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"

Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"

Attr12="99999812" Attr13="99999813" Attr14="99999814" Attr15="99999815"

Attr16="99999816" Attr17="99999817" Attr18="99999818"

Attr19="99999819">Node_Number999998</Data>

<Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"

Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"

Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"

Attr12="99999912" Attr13="99999913" Attr14="99999914" Attr15="99999915"

Attr16="99999916" Attr17="99999917" Attr18="99999918"

Attr19="99999919">Node_Number999999</Data>

</Appdata>



I'm assuming you *generated* this file by way of example. If not, well,

it's so extremely structured that you could throw it away and use a

simple algorithm to generate the "data" for any line immediately. (And

then it would not be "data", it would be a calculation.)



Anyway, XML is a poor choice for this particular set of data. Write a

program to convert it into a binary format, where each "line" uses 10

integers and one string of a fixed length of 20 bytes. That takes up no

more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small

enough to be loaded into the RAM of today's computers.



Search "quickly" depends on what you want to search for. If, for

example, you may need to grab a single digit out of any attribute or

content (say, a '9' that can occur in the middle of 'Attr2="4593252"'),

you are better off storing everything as string. You could also sort the

list on one or more of the Attr fields, and, if you prefer lookup speed

over memory usage, you could even sort on *all* of the attribute fields

plus the data field, and save pointers to the 'actual' data.



[Jw]

Many thanks Jong. Can I have the details in Visual C++ codes? To search the binary format in the way you suggested.

Many Thanks & Best Regards,
HuaMin
 
J

Jongware

Can you please show the way to quickly search such big Xml file, in a Visual C++ project?



Did you generate these 1,000,002 lines of XML data, or is this from the

real world?



In case someone does not like downloading 57 megs of zipped file, or

expanding it into 722 megs of rather pointless example lines: here is an

abbreviated version:



<?xml version="1.0" encoding="UTF-16"?>

<Appdata>

<Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04" Attr5="05"

Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"

Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"

Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>

<Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14" Attr5="15"

Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"

Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"

Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>

... (999,998 similar lines omitted) ...

<Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"

Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"

Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"

Attr12="99999812" Attr13="99999813" Attr14="99999814" Attr15="99999815"

Attr16="99999816" Attr17="99999817" Attr18="99999818"

Attr19="99999819">Node_Number999998</Data>

<Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"

Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"

Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"

Attr12="99999912" Attr13="99999913" Attr14="99999914" Attr15="99999915"

Attr16="99999916" Attr17="99999917" Attr18="99999918"

Attr19="99999919">Node_Number999999</Data>

</Appdata>



I'm assuming you *generated* this file by way of example. If not, well,

it's so extremely structured that you could throw it away and use a

simple algorithm to generate the "data" for any line immediately. (And

then it would not be "data", it would be a calculation.)



Anyway, XML is a poor choice for this particular set of data. Write a

program to convert it into a binary format, where each "line" uses 10

integers and one string of a fixed length of 20 bytes. That takes up no

more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small

enough to be loaded into the RAM of today's computers.



Search "quickly" depends on what you want to search for. If, for

example, you may need to grab a single digit out of any attribute or

content (say, a '9' that can occur in the middle of 'Attr2="4593252"'),

you are better off storing everything as string. You could also sort the

list on one or more of the Attr fields, and, if you prefer lookup speed

over memory usage, you could even sort on *all* of the attribute fields

plus the data field, and save pointers to the 'actual' data.



[Jw]

Many thanks Jong. Can I have the details in Visual C++ codes? To search the binary format in the way you suggested.

That would be

qsort (...);
result = bsearch (..);

-- you can look up the correct syntax for both qsort and bsearch
elsewhere. (It's beyond the scope of c.t.xml anyway.)

[Jw]
 
W

wmedwardchan

On 01-Nov-12 5:41 AM, (e-mail address removed) wrote:

Hi,

Can you please show the way to quickly search such big Xml file, in a Visual C++ project?

http://dl.dropbox.com/u/40211031/List.zip



Did you generate these 1,000,002 lines of XML data, or is this from the

real world?



In case someone does not like downloading 57 megs of zipped file, or

expanding it into 722 megs of rather pointless example lines: here is an

abbreviated version:



<?xml version="1.0" encoding="UTF-16"?>

<Appdata>

<Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04" Attr5="05"

Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"

Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"

Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>

<Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14" Attr5="15"

Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"

Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"

Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>

... (999,998 similar lines omitted) ...

<Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"

Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"

Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"

Attr12="99999812" Attr13="99999813" Attr14="99999814" Attr15="99999815"

Attr16="99999816" Attr17="99999817" Attr18="99999818"

Attr19="99999819">Node_Number999998</Data>

<Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"

Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"

Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"

Attr12="99999912" Attr13="99999913" Attr14="99999914" Attr15="99999915"

Attr16="99999916" Attr17="99999917" Attr18="99999918"

Attr19="99999919">Node_Number999999</Data>

</Appdata>



I'm assuming you *generated* this file by way of example. If not, well,

it's so extremely structured that you could throw it away and use a

simple algorithm to generate the "data" for any line immediately. (And

then it would not be "data", it would be a calculation.)



Anyway, XML is a poor choice for this particular set of data. Write a

program to convert it into a binary format, where each "line" uses 10

integers and one string of a fixed length of 20 bytes. That takes up no

more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small

enough to be loaded into the RAM of today's computers.



Search "quickly" depends on what you want to search for. If, for

example, you may need to grab a single digit out of any attribute or

content (say, a '9' that can occur in the middle of 'Attr2="4593252"'),

you are better off storing everything as string. You could also sort the

list on one or more of the Attr fields, and, if you prefer lookup speed

over memory usage, you could even sort on *all* of the attribute fields

plus the data field, and save pointers to the 'actual' data.



[Jw]
Many thanks Jong. Can I have the details in Visual C++ codes? To search the binary format in the way you suggested.



That would be



qsort (...);

result = bsearch (..);



-- you can look up the correct syntax for both qsort and bsearch

elsewhere. (It's beyond the scope of c.t.xml anyway.)



[Jw]

Thanks. But did you see my Xml file above? Qsort is to sort a list of items. How is it applicable to my Xml file?

Many Thanks & Best Regards,
Edward Chan
 
H

huamin_chen

On 01-Nov-12 5:41 AM, (e-mail address removed) wrote:

Hi,

Can you please show the way to quickly search such big Xml file, in a Visual C++ project?

http://dl.dropbox.com/u/40211031/List.zip



Did you generate these 1,000,002 lines of XML data, or is this from the

real world?



In case someone does not like downloading 57 megs of zipped file, or

expanding it into 722 megs of rather pointless example lines: here is an

abbreviated version:



<?xml version="1.0" encoding="UTF-16"?>

<Appdata>

<Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04" Attr5="05"

Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"

Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"

Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>

<Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14" Attr5="15"

Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"

Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"

Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>

... (999,998 similar lines omitted) ...

<Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"

Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"

Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"

Attr12="99999812" Attr13="99999813" Attr14="99999814" Attr15="99999815"

Attr16="99999816" Attr17="99999817" Attr18="99999818"

Attr19="99999819">Node_Number999998</Data>

<Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"

Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"

Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"

Attr12="99999912" Attr13="99999913" Attr14="99999914" Attr15="99999915"

Attr16="99999916" Attr17="99999917" Attr18="99999918"

Attr19="99999919">Node_Number999999</Data>

</Appdata>



I'm assuming you *generated* this file by way of example. If not, well,

it's so extremely structured that you could throw it away and use a

simple algorithm to generate the "data" for any line immediately. (And

then it would not be "data", it would be a calculation.)



Anyway, XML is a poor choice for this particular set of data. Write a

program to convert it into a binary format, where each "line" uses 10

integers and one string of a fixed length of 20 bytes. That takes up no

more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small

enough to be loaded into the RAM of today's computers.



Search "quickly" depends on what you want to search for. If, for

example, you may need to grab a single digit out of any attribute or

content (say, a '9' that can occur in the middle of 'Attr2="4593252"'),

you are better off storing everything as string. You could also sort the

list on one or more of the Attr fields, and, if you prefer lookup speed

over memory usage, you could even sort on *all* of the attribute fields

plus the data field, and save pointers to the 'actual' data.



[Jw]
Many thanks Jong. Can I have the details in Visual C++ codes? To search the binary format in the way you suggested.



That would be



qsort (...);

result = bsearch (..);



-- you can look up the correct syntax for both qsort and bsearch

elsewhere. (It's beyond the scope of c.t.xml anyway.)



[Jw]

JW,
Furthermore, do you think it is feasible to load the very long list (shown above) into an array, like what you said

Many Thanks & Best Regards,
HuaMin
 
J

Jongware

On 03-Nov-12 16:27 PM, (e-mail address removed) wrote:> On Friday, November
On 01-Nov-12 5:41 AM, (e-mail address removed) wrote:
Hi,
Can you please show the way to quickly search such big Xml file, in a Visual C++ project?
http://dl.dropbox.com/u/40211031/List.zip

Did you generate these 1,000,002 lines of XML data, or is this from the
real world?

In case someone does not like downloading 57 megs of zipped file, or
expanding it into 722 megs of rather pointless example lines: here is an
abbreviated version:

<?xml version="1.0" encoding="UTF-16"?>
<Appdata>
<Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04" Attr5="05"
Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"
Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"
Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>
<Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14" Attr5="15"
Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"
Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"
Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>
... (999,998 similar lines omitted) ...
<Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"
Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"
Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"
Attr12="99999812" Attr13="99999813" Attr14="99999814" Attr15="99999815"
Attr16="99999816" Attr17="99999817" Attr18="99999818"
Attr19="99999819">Node_Number999998</Data>
<Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"
Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"
Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"
Attr12="99999912" Attr13="99999913" Attr14="99999914" Attr15="99999915"
Attr16="99999916" Attr17="99999917" Attr18="99999918"
Attr19="99999919">Node_Number999999</Data>
</Appdata>

I'm assuming you *generated* this file by way of example. If not, well,
it's so extremely structured that you could throw it away and use a
simple algorithm to generate the "data" for any line immediately. (And
then it would not be "data", it would be a calculation.)

Anyway, XML is a poor choice for this particular set of data. Write a
program to convert it into a binary format, where each "line" uses 10
integers and one string of a fixed length of 20 bytes. That takes up no
more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small
enough to be loaded into the RAM of today's computers.

Search "quickly" depends on what you want to search for. If, for
example, you may need to grab a single digit out of any attribute or
content (say, a '9' that can occur in the middle of 'Attr2="4593252"'),
you are better off storing everything as string. You could also sort the
list on one or more of the Attr fields, and, if you prefer lookup speed
over memory usage, you could even sort on *all* of the attribute fields
plus the data field, and save pointers to the 'actual' data.

[Jw]
Many thanks Jong. Can I have the details in Visual C++ codes? To
search the binary format in the way you suggested.

That would be

qsort (...);

result = bsearch (..);

-- you can look up the correct syntax for both qsort and bsearch
elsewhere. (It's beyond the scope of c.t.xml anyway.)
On 01-Nov-12 5:41 AM, (e-mail address removed) wrote:
[..] did you see my Xml file above? Qsort is to sort a list of items.
How >>is it applicable to my Xml file?

bsearch is a function for very quickly looking up any item, but the
items have to be sorted first.
That's also the reason you have to pick a single key to sort on -- the
key you want to look up 'quickly'. If you want to be able to look up
*any* value of the 20 attributes, plus the content string, make 21
sorted lists.
To be able to give a less generic answer, we'd need to know much more of
the data set and what data item(s) need to be looked up.

Furthermore, do you think it is feasible to load the very long list
(shown above) into an array, like what you said

Why would it not be feasible? It seems a very simple data array, with 20
integers and a string content (possibly of a limited length).

I advise you to ask on one of the comp.programming groups; preferably
NOT on one dealing with 'Windows', because the requirement for Visual C
is virtually unimportant here, but on one of the generic C/C++ groups.

[Jw]
 
W

wmedwardchan

On 03-Nov-12 16:27 PM, (e-mail address removed) wrote:> On Friday, November


in a Visual C++ project?

from the

is an
Attr5="05"

Attr5="15"

Attr15="99999815"

Attr15="99999915"

well,

up no
'Attr2="4593252"'),

sort the
list on one or more of the Attr fields, and, if you prefer lookup
speed
over memory usage, you could even sort on *all* of the attribute
fields
plus the data field, and save pointers to the 'actual' data.

[Jw]

Many thanks Jong. Can I have the details in Visual C++ codes? To

search the binary format in the way you suggested.
That would be

qsort (...);

result = bsearch (..);

-- you can look up the correct syntax for both qsort and bsearch
elsewhere. (It's beyond the scope of c.t.xml anyway.)



On 01-Nov-12 5:41 AM, (e-mail address removed) wrote:

[..] did you see my Xml file above? Qsort is to sort a list of items.

How >>is it applicable to my Xml file?



bsearch is a function for very quickly looking up any item, but the

items have to be sorted first.

That's also the reason you have to pick a single key to sort on -- the

key you want to look up 'quickly'. If you want to be able to look up

*any* value of the 20 attributes, plus the content string, make 21

sorted lists.

To be able to give a less generic answer, we'd need to know much more of

the data set and what data item(s) need to be looked up.




Furthermore, do you think it is feasible to load the very long list
(shown above) into an array, like what you said



Why would it not be feasible? It seems a very simple data array, with 20

integers and a string content (possibly of a limited length).



I advise you to ask on one of the comp.programming groups; preferably

NOT on one dealing with 'Windows', because the requirement for Visual C

is virtually unimportant here, but on one of the generic C/C++ groups.



[Jw]

Thanks a lot. What is the algorithm to sort my sample Xml file above? Which other group is better for me to have any other related question for my current issue?
 
J

Joe Kesselman

Can you please show the way to quickly search such big Xml file, in a Visual C++ project?

Depends entirely on what kind of search you're doing and how often
you're going to be searching the same document.

A simple SAX parser feeding a SAX handler which discards everything but
the data you're interested in would be one solution.

Or XPath/XQuery if you need a serious search language. (XQuery or XSLT
if your goal is to generate an XML report document.)

Or use the SAX parser to load the XML into an in-memory data structure
optimized for whatever kinds of searches you're performing, and run the
search against that. Which is what most full implementations of
XPath/XSLT/XQuery do under the covers.


--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
 
H

huamin_chen

Depends entirely on what kind of search you're doing and how often

you're going to be searching the same document.



A simple SAX parser feeding a SAX handler which discards everything but

the data you're interested in would be one solution.



Or XPath/XQuery if you need a serious search language. (XQuery or XSLT

if your goal is to generate an XML report document.)



Or use the SAX parser to load the XML into an in-memory data structure

optimized for whatever kinds of searches you're performing, and run the

search against that. Which is what most full implementations of

XPath/XSLT/XQuery do under the covers.





--

Joe Kesselman,

http://www.love-song-productions.com/people/keshlam/index.html



{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --

/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."

Thanks a lot. Can you please provide me with one c++ sample project using SAX parser?
 
J

Joe Kesselman

Thanks a lot. Can you please provide me with one c++ sample project using SAX parser?

Most SAX parsers come with sample programs. Pick your favorite (I'm
biased toward Apache Xerces) and look at those. And/or try checking the
many tutorials and articles on http://www.ibm.com/DeveloperWorks/xml --
those are mostly slanted toward Java, but the same principles apply.

(I cite DeveloperWorks for several reasons. I admit to an association
with IBM... but DeveloperWorks really is run almost as an independent
web magazine, and the content is fairly extensive, better than average,
and not noticeably biased. In fact, I've had arguments with the editors
on occasion when they've included something that I thought conflicted
with IBM's interests; their response was to keep the article I was
complaining about up and invite a separate article disagreeing with it.)

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
 
W

wmedwardchan

Most SAX parsers come with sample programs. Pick your favorite (I'm

biased toward Apache Xerces) and look at those. And/or try checking the

many tutorials and articles on http://www.ibm.com/DeveloperWorks/xml --

those are mostly slanted toward Java, but the same principles apply.



(I cite DeveloperWorks for several reasons. I admit to an association

with IBM... but DeveloperWorks really is run almost as an independent

web magazine, and the content is fairly extensive, better than average,

and not noticeably biased. In fact, I've had arguments with the editors

on occasion when they've included something that I thought conflicted

with IBM's interests; their response was to keep the article I was

complaining about up and invite a separate article disagreeing with it.)



--

Joe Kesselman,

http://www.love-song-productions.com/people/keshlam/index.html



{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --

/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."

Many thanks Joe. Is it possible to load the very big Xml fiile (I did originally showed in this thread) into Sax parser? Will it lead to bad processing speed? Have a great weekend!
 
J

Joe Kesselman

Many thanks Joe. Is it possible to load the very big Xml fiile (I did
originally showed in this thread) into Sax parser? Will it lead to
bad processing speed? Have a great weekend!

Depends entirely on what you need to do with the document. Sax is just a
parser; it produces events, and it's up to you to decide what to do in
response to those events. One obvious thing you can do is build a
complete in-memory model such as the DOM. On the other hand, for some
tasks you may be able to note and discard most of the data as it goes
by, keeping only the parts your task actually needs -- and possibly
doing the computation as you go, to further minimize how much you need
to keep.

This is sometimes referred to as "streaming" processing. There are
streaming subset implementations of XPath (Xerces-J comes with one; I'm
not sure about Xerces-C), or of course you can hand-code the search logic.

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
 
H

huamin_chen

Depends entirely on what you need to do with the document. Sax is just a

parser; it produces events, and it's up to you to decide what to do in

response to those events. One obvious thing you can do is build a

complete in-memory model such as the DOM. On the other hand, for some

tasks you may be able to note and discard most of the data as it goes

by, keeping only the parts your task actually needs -- and possibly

doing the computation as you go, to further minimize how much you need

to keep.



This is sometimes referred to as "streaming" processing. There are

streaming subset implementations of XPath (Xerces-J comes with one; I'm

not sure about Xerces-C), or of course you can hand-code the search logic.



--

Joe Kesselman,

http://www.love-song-productions.com/people/keshlam/index.html



{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --

/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."

Thanks. Is there any sample to build a DOM from a given Xml file?
 
J

Joe Kesselman

Thanks. Is there any sample to build a DOM from a given Xml file?

Most parsers come with sample programs. Start with that. See also the
many XML tutorials and articles on the web -- standard citation here for
the resources at http://developerworks.ibm.com/xml, which I consider
better than most.


--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
 
J

Joe Kesselman

(Reminder: if the document is "very big", you may find that the standard
DOM is not the best answer. See my previous comments, and the resources
mentioned.)
 
W

wmedwardchan

(Reminder: if the document is "very big", you may find that the standard

DOM is not the best answer. See my previous comments, and the resources

mentioned.)

Thanks Joe. Did you ever open my Xml file? Is it possible to work with it against DOM?
 
J

Joe Kesselman

Thanks Joe. Did you ever open my Xml file? Is it possible to work with it against DOM?

No, and yes assuming it's well-formed XML, respectively. Whether the DOM
is the *best* way to work with it depends on what you're doing and on
what kind of machine.

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top