Xml search

Discussion in 'XML' started by huamin_chen@ymail.com, Nov 1, 2012.

  1. Guest

    , Nov 1, 2012
    #1
    1. Advertising

  2. Jongware Guest

    On 01-Nov-12 5:41 AM, wrote:
    > Hi,
    > Can you please show the way to quickly search such big Xml file, in a Visual C++ project?
    > http://dl.dropbox.com/u/40211031/List.zip


    Did you generate these 1,000,002 lines of XML data, or is this from the
    real world?

    In case someone does not like downloading 57 megs of zipped file, or
    expanding it into 722 megs of rather pointless example lines: here is an
    abbreviated version:

    <?xml version="1.0" encoding="UTF-16"?>
    <Appdata>
    <Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04" Attr5="05"
    Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"
    Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"
    Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>
    <Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14" Attr5="15"
    Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"
    Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"
    Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>
    .... (999,998 similar lines omitted) ...
    <Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"
    Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"
    Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"
    Attr12="99999812" Attr13="99999813" Attr14="99999814" Attr15="99999815"
    Attr16="99999816" Attr17="99999817" Attr18="99999818"
    Attr19="99999819">Node_Number999998</Data>
    <Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"
    Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"
    Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"
    Attr12="99999912" Attr13="99999913" Attr14="99999914" Attr15="99999915"
    Attr16="99999916" Attr17="99999917" Attr18="99999918"
    Attr19="99999919">Node_Number999999</Data>
    </Appdata>

    I'm assuming you *generated* this file by way of example. If not, well,
    it's so extremely structured that you could throw it away and use a
    simple algorithm to generate the "data" for any line immediately. (And
    then it would not be "data", it would be a calculation.)

    Anyway, XML is a poor choice for this particular set of data. Write a
    program to convert it into a binary format, where each "line" uses 10
    integers and one string of a fixed length of 20 bytes. That takes up no
    more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small
    enough to be loaded into the RAM of today's computers.

    Search "quickly" depends on what you want to search for. If, for
    example, you may need to grab a single digit out of any attribute or
    content (say, a '9' that can occur in the middle of 'Attr2="4593252"'),
    you are better off storing everything as string. You could also sort the
    list on one or more of the Attr fields, and, if you prefer lookup speed
    over memory usage, you could even sort on *all* of the attribute fields
    plus the data field, and save pointers to the 'actual' data.

    [Jw]
    Jongware, Nov 1, 2012
    #2
    1. Advertising

  3. Guest

    On Thursday, November 1, 2012 7:15:09 PM UTC+8, Jongware wrote:
    > On 01-Nov-12 5:41 AM, wrote:
    >
    > > Hi,

    >
    > > Can you please show the way to quickly search such big Xml file, in a Visual C++ project?

    >
    > > http://dl.dropbox.com/u/40211031/List.zip

    >
    >
    >
    > Did you generate these 1,000,002 lines of XML data, or is this from the
    >
    > real world?
    >
    >
    >
    > In case someone does not like downloading 57 megs of zipped file, or
    >
    > expanding it into 722 megs of rather pointless example lines: here is an
    >
    > abbreviated version:
    >
    >
    >
    > <?xml version="1.0" encoding="UTF-16"?>
    >
    > <Appdata>
    >
    > <Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04" Attr5="05"
    >
    > Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"
    >
    > Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"
    >
    > Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>
    >
    > <Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14" Attr5="15"
    >
    > Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"
    >
    > Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"
    >
    > Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>
    >
    > ... (999,998 similar lines omitted) ...
    >
    > <Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"
    >
    > Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"
    >
    > Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"
    >
    > Attr12="99999812" Attr13="99999813" Attr14="99999814" Attr15="99999815"
    >
    > Attr16="99999816" Attr17="99999817" Attr18="99999818"
    >
    > Attr19="99999819">Node_Number999998</Data>
    >
    > <Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"
    >
    > Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"
    >
    > Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"
    >
    > Attr12="99999912" Attr13="99999913" Attr14="99999914" Attr15="99999915"
    >
    > Attr16="99999916" Attr17="99999917" Attr18="99999918"
    >
    > Attr19="99999919">Node_Number999999</Data>
    >
    > </Appdata>
    >
    >
    >
    > I'm assuming you *generated* this file by way of example. If not, well,
    >
    > it's so extremely structured that you could throw it away and use a
    >
    > simple algorithm to generate the "data" for any line immediately. (And
    >
    > then it would not be "data", it would be a calculation.)
    >
    >
    >
    > Anyway, XML is a poor choice for this particular set of data. Write a
    >
    > program to convert it into a binary format, where each "line" uses 10
    >
    > integers and one string of a fixed length of 20 bytes. That takes up no
    >
    > more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small
    >
    > enough to be loaded into the RAM of today's computers.
    >
    >
    >
    > Search "quickly" depends on what you want to search for. If, for
    >
    > example, you may need to grab a single digit out of any attribute or
    >
    > content (say, a '9' that can occur in the middle of 'Attr2="4593252"'),
    >
    > you are better off storing everything as string. You could also sort the
    >
    > list on one or more of the Attr fields, and, if you prefer lookup speed
    >
    > over memory usage, you could even sort on *all* of the attribute fields
    >
    > plus the data field, and save pointers to the 'actual' data.
    >
    >
    >
    > [Jw]


    Many thanks Jong. Can I have the details in Visual C++ codes? To search the binary format in the way you suggested.

    Many Thanks & Best Regards,
    HuaMin
    , Nov 2, 2012
    #3
  4. Jongware Guest

    On 02-Nov-12 3:20 AM, wrote:
    > On Thursday, November 1, 2012 7:15:09 PM UTC+8, Jongware wrote:
    >> On 01-Nov-12 5:41 AM, wrote:
    >>
    >>> Hi,

    >>
    >>> Can you please show the way to quickly search such big Xml file, in a Visual C++ project?

    >>
    >>> http://dl.dropbox.com/u/40211031/List.zip

    >>
    >>
    >>
    >> Did you generate these 1,000,002 lines of XML data, or is this from the
    >>
    >> real world?
    >>
    >>
    >>
    >> In case someone does not like downloading 57 megs of zipped file, or
    >>
    >> expanding it into 722 megs of rather pointless example lines: here is an
    >>
    >> abbreviated version:
    >>
    >>
    >>
    >> <?xml version="1.0" encoding="UTF-16"?>
    >>
    >> <Appdata>
    >>
    >> <Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04" Attr5="05"
    >>
    >> Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"
    >>
    >> Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"
    >>
    >> Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>
    >>
    >> <Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14" Attr5="15"
    >>
    >> Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"
    >>
    >> Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"
    >>
    >> Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>
    >>
    >> ... (999,998 similar lines omitted) ...
    >>
    >> <Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"
    >>
    >> Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"
    >>
    >> Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"
    >>
    >> Attr12="99999812" Attr13="99999813" Attr14="99999814" Attr15="99999815"
    >>
    >> Attr16="99999816" Attr17="99999817" Attr18="99999818"
    >>
    >> Attr19="99999819">Node_Number999998</Data>
    >>
    >> <Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"
    >>
    >> Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"
    >>
    >> Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"
    >>
    >> Attr12="99999912" Attr13="99999913" Attr14="99999914" Attr15="99999915"
    >>
    >> Attr16="99999916" Attr17="99999917" Attr18="99999918"
    >>
    >> Attr19="99999919">Node_Number999999</Data>
    >>
    >> </Appdata>
    >>
    >>
    >>
    >> I'm assuming you *generated* this file by way of example. If not, well,
    >>
    >> it's so extremely structured that you could throw it away and use a
    >>
    >> simple algorithm to generate the "data" for any line immediately. (And
    >>
    >> then it would not be "data", it would be a calculation.)
    >>
    >>
    >>
    >> Anyway, XML is a poor choice for this particular set of data. Write a
    >>
    >> program to convert it into a binary format, where each "line" uses 10
    >>
    >> integers and one string of a fixed length of 20 bytes. That takes up no
    >>
    >> more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small
    >>
    >> enough to be loaded into the RAM of today's computers.
    >>
    >>
    >>
    >> Search "quickly" depends on what you want to search for. If, for
    >>
    >> example, you may need to grab a single digit out of any attribute or
    >>
    >> content (say, a '9' that can occur in the middle of 'Attr2="4593252"'),
    >>
    >> you are better off storing everything as string. You could also sort the
    >>
    >> list on one or more of the Attr fields, and, if you prefer lookup speed
    >>
    >> over memory usage, you could even sort on *all* of the attribute fields
    >>
    >> plus the data field, and save pointers to the 'actual' data.
    >>
    >>
    >>
    >> [Jw]

    >
    > Many thanks Jong. Can I have the details in Visual C++ codes? To search the binary format in the way you suggested.


    That would be

    qsort (...);
    result = bsearch (..);

    -- you can look up the correct syntax for both qsort and bsearch
    elsewhere. (It's beyond the scope of c.t.xml anyway.)

    [Jw]
    Jongware, Nov 2, 2012
    #4
  5. Guest

    On Friday, November 2, 2012 5:22:05 PM UTC+8, Jongware wrote:
    > On 02-Nov-12 3:20 AM, wrote:
    >
    > > On Thursday, November 1, 2012 7:15:09 PM UTC+8, Jongware wrote:

    >
    > >> On 01-Nov-12 5:41 AM, wrote:

    >
    > >>

    >
    > >>> Hi,

    >
    > >>

    >
    > >>> Can you please show the way to quickly search such big Xml file, in a Visual C++ project?

    >
    > >>

    >
    > >>> http://dl.dropbox.com/u/40211031/List.zip

    >
    > >>

    >
    > >>

    >
    > >>

    >
    > >> Did you generate these 1,000,002 lines of XML data, or is this from the

    >
    > >>

    >
    > >> real world?

    >
    > >>

    >
    > >>

    >
    > >>

    >
    > >> In case someone does not like downloading 57 megs of zipped file, or

    >
    > >>

    >
    > >> expanding it into 722 megs of rather pointless example lines: here is an

    >
    > >>

    >
    > >> abbreviated version:

    >
    > >>

    >
    > >>

    >
    > >>

    >
    > >> <?xml version="1.0" encoding="UTF-16"?>

    >
    > >>

    >
    > >> <Appdata>

    >
    > >>

    >
    > >> <Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04" Attr5="05"

    >
    > >>

    >
    > >> Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"

    >
    > >>

    >
    > >> Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"

    >
    > >>

    >
    > >> Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>

    >
    > >>

    >
    > >> <Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14" Attr5="15"

    >
    > >>

    >
    > >> Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"

    >
    > >>

    >
    > >> Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"

    >
    > >>

    >
    > >> Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>

    >
    > >>

    >
    > >> ... (999,998 similar lines omitted) ...

    >
    > >>

    >
    > >> <Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"

    >
    > >>

    >
    > >> Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"

    >
    > >>

    >
    > >> Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"

    >
    > >>

    >
    > >> Attr12="99999812" Attr13="99999813" Attr14="99999814" Attr15="99999815"

    >
    > >>

    >
    > >> Attr16="99999816" Attr17="99999817" Attr18="99999818"

    >
    > >>

    >
    > >> Attr19="99999819">Node_Number999998</Data>

    >
    > >>

    >
    > >> <Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"

    >
    > >>

    >
    > >> Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"

    >
    > >>

    >
    > >> Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"

    >
    > >>

    >
    > >> Attr12="99999912" Attr13="99999913" Attr14="99999914" Attr15="99999915"

    >
    > >>

    >
    > >> Attr16="99999916" Attr17="99999917" Attr18="99999918"

    >
    > >>

    >
    > >> Attr19="99999919">Node_Number999999</Data>

    >
    > >>

    >
    > >> </Appdata>

    >
    > >>

    >
    > >>

    >
    > >>

    >
    > >> I'm assuming you *generated* this file by way of example. If not, well,

    >
    > >>

    >
    > >> it's so extremely structured that you could throw it away and use a

    >
    > >>

    >
    > >> simple algorithm to generate the "data" for any line immediately. (And

    >
    > >>

    >
    > >> then it would not be "data", it would be a calculation.)

    >
    > >>

    >
    > >>

    >
    > >>

    >
    > >> Anyway, XML is a poor choice for this particular set of data. Write a

    >
    > >>

    >
    > >> program to convert it into a binary format, where each "line" uses 10

    >
    > >>

    >
    > >> integers and one string of a fixed length of 20 bytes. That takes up no

    >
    > >>

    >
    > >> more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small

    >
    > >>

    >
    > >> enough to be loaded into the RAM of today's computers.

    >
    > >>

    >
    > >>

    >
    > >>

    >
    > >> Search "quickly" depends on what you want to search for. If, for

    >
    > >>

    >
    > >> example, you may need to grab a single digit out of any attribute or

    >
    > >>

    >
    > >> content (say, a '9' that can occur in the middle of 'Attr2="4593252"'),

    >
    > >>

    >
    > >> you are better off storing everything as string. You could also sort the

    >
    > >>

    >
    > >> list on one or more of the Attr fields, and, if you prefer lookup speed

    >
    > >>

    >
    > >> over memory usage, you could even sort on *all* of the attribute fields

    >
    > >>

    >
    > >> plus the data field, and save pointers to the 'actual' data.

    >
    > >>

    >
    > >>

    >
    > >>

    >
    > >> [Jw]

    >
    > >

    >
    > > Many thanks Jong. Can I have the details in Visual C++ codes? To search the binary format in the way you suggested.

    >
    >
    >
    > That would be
    >
    >
    >
    > qsort (...);
    >
    > result = bsearch (..);
    >
    >
    >
    > -- you can look up the correct syntax for both qsort and bsearch
    >
    > elsewhere. (It's beyond the scope of c.t.xml anyway.)
    >
    >
    >
    > [Jw]


    Thanks. But did you see my Xml file above? Qsort is to sort a list of items. How is it applicable to my Xml file?

    Many Thanks & Best Regards,
    Edward Chan
    , Nov 3, 2012
    #5
  6. Guest

    On Friday, November 2, 2012 5:22:05 PM UTC+8, Jongware wrote:
    > On 02-Nov-12 3:20 AM, wrote:
    >
    > > On Thursday, November 1, 2012 7:15:09 PM UTC+8, Jongware wrote:

    >
    > >> On 01-Nov-12 5:41 AM, wrote:

    >
    > >>

    >
    > >>> Hi,

    >
    > >>

    >
    > >>> Can you please show the way to quickly search such big Xml file, in a Visual C++ project?

    >
    > >>

    >
    > >>> http://dl.dropbox.com/u/40211031/List.zip

    >
    > >>

    >
    > >>

    >
    > >>

    >
    > >> Did you generate these 1,000,002 lines of XML data, or is this from the

    >
    > >>

    >
    > >> real world?

    >
    > >>

    >
    > >>

    >
    > >>

    >
    > >> In case someone does not like downloading 57 megs of zipped file, or

    >
    > >>

    >
    > >> expanding it into 722 megs of rather pointless example lines: here is an

    >
    > >>

    >
    > >> abbreviated version:

    >
    > >>

    >
    > >>

    >
    > >>

    >
    > >> <?xml version="1.0" encoding="UTF-16"?>

    >
    > >>

    >
    > >> <Appdata>

    >
    > >>

    >
    > >> <Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04" Attr5="05"

    >
    > >>

    >
    > >> Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"

    >
    > >>

    >
    > >> Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"

    >
    > >>

    >
    > >> Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>

    >
    > >>

    >
    > >> <Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14" Attr5="15"

    >
    > >>

    >
    > >> Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"

    >
    > >>

    >
    > >> Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"

    >
    > >>

    >
    > >> Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>

    >
    > >>

    >
    > >> ... (999,998 similar lines omitted) ...

    >
    > >>

    >
    > >> <Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"

    >
    > >>

    >
    > >> Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"

    >
    > >>

    >
    > >> Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"

    >
    > >>

    >
    > >> Attr12="99999812" Attr13="99999813" Attr14="99999814" Attr15="99999815"

    >
    > >>

    >
    > >> Attr16="99999816" Attr17="99999817" Attr18="99999818"

    >
    > >>

    >
    > >> Attr19="99999819">Node_Number999998</Data>

    >
    > >>

    >
    > >> <Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"

    >
    > >>

    >
    > >> Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"

    >
    > >>

    >
    > >> Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"

    >
    > >>

    >
    > >> Attr12="99999912" Attr13="99999913" Attr14="99999914" Attr15="99999915"

    >
    > >>

    >
    > >> Attr16="99999916" Attr17="99999917" Attr18="99999918"

    >
    > >>

    >
    > >> Attr19="99999919">Node_Number999999</Data>

    >
    > >>

    >
    > >> </Appdata>

    >
    > >>

    >
    > >>

    >
    > >>

    >
    > >> I'm assuming you *generated* this file by way of example. If not, well,

    >
    > >>

    >
    > >> it's so extremely structured that you could throw it away and use a

    >
    > >>

    >
    > >> simple algorithm to generate the "data" for any line immediately. (And

    >
    > >>

    >
    > >> then it would not be "data", it would be a calculation.)

    >
    > >>

    >
    > >>

    >
    > >>

    >
    > >> Anyway, XML is a poor choice for this particular set of data. Write a

    >
    > >>

    >
    > >> program to convert it into a binary format, where each "line" uses 10

    >
    > >>

    >
    > >> integers and one string of a fixed length of 20 bytes. That takes up no

    >
    > >>

    >
    > >> more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small

    >
    > >>

    >
    > >> enough to be loaded into the RAM of today's computers.

    >
    > >>

    >
    > >>

    >
    > >>

    >
    > >> Search "quickly" depends on what you want to search for. If, for

    >
    > >>

    >
    > >> example, you may need to grab a single digit out of any attribute or

    >
    > >>

    >
    > >> content (say, a '9' that can occur in the middle of 'Attr2="4593252"'),

    >
    > >>

    >
    > >> you are better off storing everything as string. You could also sort the

    >
    > >>

    >
    > >> list on one or more of the Attr fields, and, if you prefer lookup speed

    >
    > >>

    >
    > >> over memory usage, you could even sort on *all* of the attribute fields

    >
    > >>

    >
    > >> plus the data field, and save pointers to the 'actual' data.

    >
    > >>

    >
    > >>

    >
    > >>

    >
    > >> [Jw]

    >
    > >

    >
    > > Many thanks Jong. Can I have the details in Visual C++ codes? To search the binary format in the way you suggested.

    >
    >
    >
    > That would be
    >
    >
    >
    > qsort (...);
    >
    > result = bsearch (..);
    >
    >
    >
    > -- you can look up the correct syntax for both qsort and bsearch
    >
    > elsewhere. (It's beyond the scope of c.t.xml anyway.)
    >
    >
    >
    > [Jw]


    JW,
    Furthermore, do you think it is feasible to load the very long list (shown above) into an array, like what you said

    Many Thanks & Best Regards,
    HuaMin
    , Nov 3, 2012
    #6
  7. Jongware Guest

    On 03-Nov-12 16:27 PM, wrote:> On Friday, November
    2, 2012 5:22:05 PM UTC+8, Jongware wrote:
    >> On 02-Nov-12 3:20 AM, wrote:
    >>
    >>> On Thursday, November 1, 2012 7:15:09 PM UTC+8, Jongware wrote:

    >>
    >>>> On 01-Nov-12 5:41 AM, wrote:
    >>>>> Hi,
    >>>>> Can you please show the way to quickly search such big Xml file,

    in a Visual C++ project?
    >>>>> http://dl.dropbox.com/u/40211031/List.zip
    >>>>
    >>>> Did you generate these 1,000,002 lines of XML data, or is this

    from the
    >>>> real world?
    >>>>
    >>>> In case someone does not like downloading 57 megs of zipped file, or
    >>>> expanding it into 722 megs of rather pointless example lines: here

    is an
    >>>> abbreviated version:
    >>>>
    >>>> <?xml version="1.0" encoding="UTF-16"?>
    >>>> <Appdata>
    >>>> <Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04"

    Attr5="05"
    >>>> Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"
    >>>> Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"
    >>>> Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>
    >>>> <Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14"

    Attr5="15"
    >>>> Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"
    >>>> Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"
    >>>> Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>
    >>>> ... (999,998 similar lines omitted) ...
    >>>> <Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"
    >>>> Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"
    >>>> Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"
    >>>> Attr12="99999812" Attr13="99999813" Attr14="99999814"

    Attr15="99999815"
    >>>> Attr16="99999816" Attr17="99999817" Attr18="99999818"
    >>>> Attr19="99999819">Node_Number999998</Data>
    >>>> <Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"
    >>>> Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"
    >>>> Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"
    >>>> Attr12="99999912" Attr13="99999913" Attr14="99999914"

    Attr15="99999915"
    >>>> Attr16="99999916" Attr17="99999917" Attr18="99999918"
    >>>> Attr19="99999919">Node_Number999999</Data>
    >>>> </Appdata>
    >>>>
    >>>> I'm assuming you *generated* this file by way of example. If not,

    well,
    >>>> it's so extremely structured that you could throw it away and use a
    >>>> simple algorithm to generate the "data" for any line immediately. (And
    >>>> then it would not be "data", it would be a calculation.)
    >>>>
    >>>> Anyway, XML is a poor choice for this particular set of data. Write a
    >>>> program to convert it into a binary format, where each "line" uses 10
    >>>> integers and one string of a fixed length of 20 bytes. That takes

    up no
    >>>> more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small
    >>>> enough to be loaded into the RAM of today's computers.
    >>>>
    >>>> Search "quickly" depends on what you want to search for. If, for
    >>>> example, you may need to grab a single digit out of any attribute or
    >>>> content (say, a '9' that can occur in the middle of

    'Attr2="4593252"'),
    >>>> you are better off storing everything as string. You could also

    sort the
    >>>> list on one or more of the Attr fields, and, if you prefer lookup

    speed
    >>>> over memory usage, you could even sort on *all* of the attribute

    fields
    >>>> plus the data field, and save pointers to the 'actual' data.
    >>>>
    >>>> [Jw]

    >>
    >>> Many thanks Jong. Can I have the details in Visual C++ codes? To

    search the binary format in the way you suggested.
    >>
    >> That would be
    >>
    >> qsort (...);
    >>
    >> result = bsearch (..);
    >>
    >> -- you can look up the correct syntax for both qsort and bsearch
    >> elsewhere. (It's beyond the scope of c.t.xml anyway.)



    >>>> On 01-Nov-12 5:41 AM, wrote:


    >>[..] did you see my Xml file above? Qsort is to sort a list of items.

    How >>is it applicable to my Xml file?

    bsearch is a function for very quickly looking up any item, but the
    items have to be sorted first.
    That's also the reason you have to pick a single key to sort on -- the
    key you want to look up 'quickly'. If you want to be able to look up
    *any* value of the 20 attributes, plus the content string, make 21
    sorted lists.
    To be able to give a less generic answer, we'd need to know much more of
    the data set and what data item(s) need to be looked up.


    > Furthermore, do you think it is feasible to load the very long list
    > (shown above) into an array, like what you said


    Why would it not be feasible? It seems a very simple data array, with 20
    integers and a string content (possibly of a limited length).

    I advise you to ask on one of the comp.programming groups; preferably
    NOT on one dealing with 'Windows', because the requirement for Visual C
    is virtually unimportant here, but on one of the generic C/C++ groups.

    [Jw]
    Jongware, Nov 5, 2012
    #7
  8. Guest

    On Monday, November 5, 2012 5:40:28 PM UTC+8, Jongware wrote:
    > On 03-Nov-12 16:27 PM, wrote:> On Friday, November
    >
    > 2, 2012 5:22:05 PM UTC+8, Jongware wrote:
    >
    > >> On 02-Nov-12 3:20 AM, wrote:

    >
    > >>

    >
    > >>> On Thursday, November 1, 2012 7:15:09 PM UTC+8, Jongware wrote:

    >
    > >>

    >
    > >>>> On 01-Nov-12 5:41 AM, wrote:

    >
    > >>>>> Hi,

    >
    > >>>>> Can you please show the way to quickly search such big Xml file,

    >
    > in a Visual C++ project?
    >
    > >>>>> http://dl.dropbox.com/u/40211031/List.zip

    >
    > >>>>

    >
    > >>>> Did you generate these 1,000,002 lines of XML data, or is this

    >
    > from the
    >
    > >>>> real world?

    >
    > >>>>

    >
    > >>>> In case someone does not like downloading 57 megs of zipped file, or

    >
    > >>>> expanding it into 722 megs of rather pointless example lines: here

    >
    > is an
    >
    > >>>> abbreviated version:

    >
    > >>>>

    >
    > >>>> <?xml version="1.0" encoding="UTF-16"?>

    >
    > >>>> <Appdata>

    >
    > >>>> <Data Attr0="00" Attr1="01" Attr2="02" Attr3="03" Attr4="04"

    >
    > Attr5="05"
    >
    > >>>> Attr6="06" Attr7="07" Attr8="08" Attr9="09" Attr10="010" Attr11="011"

    >
    > >>>> Attr12="012" Attr13="013" Attr14="014" Attr15="015" Attr16="016"

    >
    > >>>> Attr17="017" Attr18="018" Attr19="019">Node_Number0</Data>

    >
    > >>>> <Data Attr0="10" Attr1="11" Attr2="12" Attr3="13" Attr4="14"

    >
    > Attr5="15"
    >
    > >>>> Attr6="16" Attr7="17" Attr8="18" Attr9="19" Attr10="110" Attr11="111"

    >
    > >>>> Attr12="112" Attr13="113" Attr14="114" Attr15="115" Attr16="116"

    >
    > >>>> Attr17="117" Attr18="118" Attr19="119">Node_Number1</Data>

    >
    > >>>> ... (999,998 similar lines omitted) ...

    >
    > >>>> <Data Attr0="9999980" Attr1="9999981" Attr2="9999982" Attr3="9999983"

    >
    > >>>> Attr4="9999984" Attr5="9999985" Attr6="9999986" Attr7="9999987"

    >
    > >>>> Attr8="9999988" Attr9="9999989" Attr10="99999810" Attr11="99999811"

    >
    > >>>> Attr12="99999812" Attr13="99999813" Attr14="99999814"

    >
    > Attr15="99999815"
    >
    > >>>> Attr16="99999816" Attr17="99999817" Attr18="99999818"

    >
    > >>>> Attr19="99999819">Node_Number999998</Data>

    >
    > >>>> <Data Attr0="9999990" Attr1="9999991" Attr2="9999992" Attr3="9999993"

    >
    > >>>> Attr4="9999994" Attr5="9999995" Attr6="9999996" Attr7="9999997"

    >
    > >>>> Attr8="9999998" Attr9="9999999" Attr10="99999910" Attr11="99999911"

    >
    > >>>> Attr12="99999912" Attr13="99999913" Attr14="99999914"

    >
    > Attr15="99999915"
    >
    > >>>> Attr16="99999916" Attr17="99999917" Attr18="99999918"

    >
    > >>>> Attr19="99999919">Node_Number999999</Data>

    >
    > >>>> </Appdata>

    >
    > >>>>

    >
    > >>>> I'm assuming you *generated* this file by way of example. If not,

    >
    > well,
    >
    > >>>> it's so extremely structured that you could throw it away and use a

    >
    > >>>> simple algorithm to generate the "data" for any line immediately. (And

    >
    > >>>> then it would not be "data", it would be a calculation.)

    >
    > >>>>

    >
    > >>>> Anyway, XML is a poor choice for this particular set of data. Write a

    >
    > >>>> program to convert it into a binary format, where each "line" uses 10

    >
    > >>>> integers and one string of a fixed length of 20 bytes. That takes

    >
    > up no
    >
    > >>>> more than 1,000,000 x (10 * sizeof(int) + 20) ~ 60 MB of memory. Small

    >
    > >>>> enough to be loaded into the RAM of today's computers.

    >
    > >>>>

    >
    > >>>> Search "quickly" depends on what you want to search for. If, for

    >
    > >>>> example, you may need to grab a single digit out of any attribute or

    >
    > >>>> content (say, a '9' that can occur in the middle of

    >
    > 'Attr2="4593252"'),
    >
    > >>>> you are better off storing everything as string. You could also

    >
    > sort the
    >
    > >>>> list on one or more of the Attr fields, and, if you prefer lookup

    >
    > speed
    >
    > >>>> over memory usage, you could even sort on *all* of the attribute

    >
    > fields
    >
    > >>>> plus the data field, and save pointers to the 'actual' data.

    >
    > >>>>

    >
    > >>>> [Jw]

    >
    > >>

    >
    > >>> Many thanks Jong. Can I have the details in Visual C++ codes? To

    >
    > search the binary format in the way you suggested.
    >
    > >>

    >
    > >> That would be

    >
    > >>

    >
    > >> qsort (...);

    >
    > >>

    >
    > >> result = bsearch (..);

    >
    > >>

    >
    > >> -- you can look up the correct syntax for both qsort and bsearch

    >
    > >> elsewhere. (It's beyond the scope of c.t.xml anyway.)

    >
    >
    >
    >
    >
    > >>>> On 01-Nov-12 5:41 AM, wrote:

    >
    >
    >
    > >>[..] did you see my Xml file above? Qsort is to sort a list of items.

    >
    > How >>is it applicable to my Xml file?
    >
    >
    >
    > bsearch is a function for very quickly looking up any item, but the
    >
    > items have to be sorted first.
    >
    > That's also the reason you have to pick a single key to sort on -- the
    >
    > key you want to look up 'quickly'. If you want to be able to look up
    >
    > *any* value of the 20 attributes, plus the content string, make 21
    >
    > sorted lists.
    >
    > To be able to give a less generic answer, we'd need to know much more of
    >
    > the data set and what data item(s) need to be looked up.
    >
    >
    >
    >
    >
    > > Furthermore, do you think it is feasible to load the very long list

    >
    > > (shown above) into an array, like what you said

    >
    >
    >
    > Why would it not be feasible? It seems a very simple data array, with 20
    >
    > integers and a string content (possibly of a limited length).
    >
    >
    >
    > I advise you to ask on one of the comp.programming groups; preferably
    >
    > NOT on one dealing with 'Windows', because the requirement for Visual C
    >
    > is virtually unimportant here, but on one of the generic C/C++ groups.
    >
    >
    >
    > [Jw]


    Thanks a lot. What is the algorithm to sort my sample Xml file above? Which other group is better for me to have any other related question for my current issue?
    , Nov 6, 2012
    #8
  9. Guest

    On Thursday, November 1, 2012 12:41:11 PM UTC+8, wrote:
    > Hi,
    >
    > Can you please show the way to quickly search such big Xml file, in a Visual C++ project?
    >
    > http://dl.dropbox.com/u/40211031/List.zip
    >
    >
    >
    > Many Thanks & Best Regards,
    >
    > HuaMin


    JW,
    Any advice to this?
    , Nov 7, 2012
    #9
  10. Manuel Collado, Nov 7, 2012
    #10
  11. On 11/1/2012 12:41 AM, wrote:
    > Can you please show the way to quickly search such big Xml file, in a Visual C++ project?


    Depends entirely on what kind of search you're doing and how often
    you're going to be searching the same document.

    A simple SAX parser feeding a SAX handler which discards everything but
    the data you're interested in would be one solution.

    Or XPath/XQuery if you need a serious search language. (XQuery or XSLT
    if your goal is to generate an XML report document.)

    Or use the SAX parser to load the XML into an in-memory data structure
    optimized for whatever kinds of searches you're performing, and run the
    search against that. Which is what most full implementations of
    XPath/XSLT/XQuery do under the covers.


    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
    Joe Kesselman, Nov 7, 2012
    #11
  12. Guest

    On Thursday, November 8, 2012 6:13:12 AM UTC+8, Joe Kesselman wrote:
    > On 11/1/2012 12:41 AM, wrote:
    >
    > > Can you please show the way to quickly search such big Xml file, in a Visual C++ project?

    >
    >
    >
    > Depends entirely on what kind of search you're doing and how often
    >
    > you're going to be searching the same document.
    >
    >
    >
    > A simple SAX parser feeding a SAX handler which discards everything but
    >
    > the data you're interested in would be one solution.
    >
    >
    >
    > Or XPath/XQuery if you need a serious search language. (XQuery or XSLT
    >
    > if your goal is to generate an XML report document.)
    >
    >
    >
    > Or use the SAX parser to load the XML into an in-memory data structure
    >
    > optimized for whatever kinds of searches you're performing, and run the
    >
    > search against that. Which is what most full implementations of
    >
    > XPath/XSLT/XQuery do under the covers.
    >
    >
    >
    >
    >
    > --
    >
    > Joe Kesselman,
    >
    > http://www.love-song-productions.com/people/keshlam/index.html
    >
    >
    >
    > {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    >
    > /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."


    Thanks a lot. Can you please provide me with one c++ sample project using SAX parser?
    , Nov 13, 2012
    #12
  13. On 11/13/2012 3:29 AM, wrote:
    > Thanks a lot. Can you please provide me with one c++ sample project using SAX parser?


    Most SAX parsers come with sample programs. Pick your favorite (I'm
    biased toward Apache Xerces) and look at those. And/or try checking the
    many tutorials and articles on http://www.ibm.com/DeveloperWorks/xml --
    those are mostly slanted toward Java, but the same principles apply.

    (I cite DeveloperWorks for several reasons. I admit to an association
    with IBM... but DeveloperWorks really is run almost as an independent
    web magazine, and the content is fairly extensive, better than average,
    and not noticeably biased. In fact, I've had arguments with the editors
    on occasion when they've included something that I thought conflicted
    with IBM's interests; their response was to keep the article I was
    complaining about up and invite a separate article disagreeing with it.)

    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
    Joe Kesselman, Nov 22, 2012
    #13
  14. Guest

    On Friday, November 23, 2012 6:35:16 AM UTC+8, Joe Kesselman wrote:
    > On 11/13/2012 3:29 AM, wrote:
    >
    > > Thanks a lot. Can you please provide me with one c++ sample project using SAX parser?

    >
    >
    >
    > Most SAX parsers come with sample programs. Pick your favorite (I'm
    >
    > biased toward Apache Xerces) and look at those. And/or try checking the
    >
    > many tutorials and articles on http://www.ibm.com/DeveloperWorks/xml --
    >
    > those are mostly slanted toward Java, but the same principles apply.
    >
    >
    >
    > (I cite DeveloperWorks for several reasons. I admit to an association
    >
    > with IBM... but DeveloperWorks really is run almost as an independent
    >
    > web magazine, and the content is fairly extensive, better than average,
    >
    > and not noticeably biased. In fact, I've had arguments with the editors
    >
    > on occasion when they've included something that I thought conflicted
    >
    > with IBM's interests; their response was to keep the article I was
    >
    > complaining about up and invite a separate article disagreeing with it.)
    >
    >
    >
    > --
    >
    > Joe Kesselman,
    >
    > http://www.love-song-productions.com/people/keshlam/index.html
    >
    >
    >
    > {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    >
    > /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."


    Many thanks Joe. Is it possible to load the very big Xml fiile (I did originally showed in this thread) into Sax parser? Will it lead to bad processing speed? Have a great weekend!
    , Nov 23, 2012
    #14
  15. On 11/22/2012 10:23 PM, wrote:
    > Many thanks Joe. Is it possible to load the very big Xml fiile (I did
    > originally showed in this thread) into Sax parser? Will it lead to
    > bad processing speed? Have a great weekend!


    Depends entirely on what you need to do with the document. Sax is just a
    parser; it produces events, and it's up to you to decide what to do in
    response to those events. One obvious thing you can do is build a
    complete in-memory model such as the DOM. On the other hand, for some
    tasks you may be able to note and discard most of the data as it goes
    by, keeping only the parts your task actually needs -- and possibly
    doing the computation as you go, to further minimize how much you need
    to keep.

    This is sometimes referred to as "streaming" processing. There are
    streaming subset implementations of XPath (Xerces-J comes with one; I'm
    not sure about Xerces-C), or of course you can hand-code the search logic.

    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
    Joe Kesselman, Nov 24, 2012
    #15
  16. Guest

    On Saturday, November 24, 2012 2:06:26 PM UTC+8, Joe Kesselman wrote:
    > On 11/22/2012 10:23 PM, wrote:
    >
    > > Many thanks Joe. Is it possible to load the very big Xml fiile (I did

    >
    > > originally showed in this thread) into Sax parser? Will it lead to

    >
    > > bad processing speed? Have a great weekend!

    >
    >
    >
    > Depends entirely on what you need to do with the document. Sax is just a
    >
    > parser; it produces events, and it's up to you to decide what to do in
    >
    > response to those events. One obvious thing you can do is build a
    >
    > complete in-memory model such as the DOM. On the other hand, for some
    >
    > tasks you may be able to note and discard most of the data as it goes
    >
    > by, keeping only the parts your task actually needs -- and possibly
    >
    > doing the computation as you go, to further minimize how much you need
    >
    > to keep.
    >
    >
    >
    > This is sometimes referred to as "streaming" processing. There are
    >
    > streaming subset implementations of XPath (Xerces-J comes with one; I'm
    >
    > not sure about Xerces-C), or of course you can hand-code the search logic.
    >
    >
    >
    > --
    >
    > Joe Kesselman,
    >
    > http://www.love-song-productions.com/people/keshlam/index.html
    >
    >
    >
    > {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    >
    > /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."


    Thanks. Is there any sample to build a DOM from a given Xml file?
    , Nov 27, 2012
    #16
  17. On 11/26/2012 11:32 PM, wrote:
    > Thanks. Is there any sample to build a DOM from a given Xml file?


    Most parsers come with sample programs. Start with that. See also the
    many XML tutorials and articles on the web -- standard citation here for
    the resources at http://developerworks.ibm.com/xml, which I consider
    better than most.


    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
    Joe Kesselman, Nov 28, 2012
    #17
  18. (Reminder: if the document is "very big", you may find that the standard
    DOM is not the best answer. See my previous comments, and the resources
    mentioned.)
    Joe Kesselman, Nov 28, 2012
    #18
  19. Guest

    On Wednesday, November 28, 2012 12:29:59 PM UTC+8, Joe Kesselman wrote:
    > (Reminder: if the document is "very big", you may find that the standard
    >
    > DOM is not the best answer. See my previous comments, and the resources
    >
    > mentioned.)


    Thanks Joe. Did you ever open my Xml file? Is it possible to work with it against DOM?
    , Nov 29, 2012
    #19
  20. On 11/28/2012 10:57 PM, wrote:
    > Thanks Joe. Did you ever open my Xml file? Is it possible to work with it against DOM?


    No, and yes assuming it's well-formed XML, respectively. Whether the DOM
    is the *best* way to work with it depends on what you're doing and on
    what kind of machine.

    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
    Joe Kesselman, Nov 29, 2012
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?TGFrc2htaSBOYXJheWFuYW4uUg==?=

    Google search result like site search!! How?

    =?Utf-8?B?TGFrc2htaSBOYXJheWFuYW4uUg==?=, May 5, 2005, in forum: ASP .Net
    Replies:
    3
    Views:
    655
    Lucas Tam
    May 6, 2005
  2. Edwin Dankert
    Replies:
    7
    Views:
    453
    Peter Flynn
    Oct 6, 2007
  3. Edwin Dankert
    Replies:
    0
    Views:
    395
    Edwin Dankert
    Jan 23, 2008
  4. Abby Lee
    Replies:
    5
    Views:
    375
    Abby Lee
    Aug 2, 2004
  5. Erik Wasser
    Replies:
    5
    Views:
    428
    Peter J. Holzer
    Mar 5, 2006
Loading...

Share This Page