Data analysis of large collection of XML files

Discussion in 'XML' started by Sengly, Nov 23, 2008.

  1. Sengly

    Sengly Guest

    Dear all,

    I have a large collection of about 50 million xml files. I would like
    to do some basic statistic about the data. For instance, what is the
    average value of the tag <A>, discover the co-occurence of some values
    inside the data, etc

    Do we have any automated if not not semi-automatic tools that allow us
    to define some rules to understand our data. Tools with visualizations
    would be the best.

    Do you have any ideas?

    Ps: I was suggested to look at SPSS. Anyone has some experience
    before?

    Thanks,

    Sengly
    Sengly, Nov 23, 2008
    #1
    1. Advertising

  2. Sengly wrote:
    >
    > I have a large collection of about 50 million xml files. I would like
    > to do some basic statistic about the data. For instance, what is the
    > average value of the tag <A>, discover the co-occurence of some values
    > inside the data, etc
    >


    50 million files is quite a number. How big are they, on average?

    If the source data in XML format is not supposed to change (that often), I would suggest:
    - get the values out of the XML files, into csv files or similar txt format
    - calculate your statistics from the generated txt files
    - by using an appropriate application or scripting language

    > Ps: I was suggested to look at SPSS. Anyone has some experience before?


    I remember vaguely having used SPSS in the mid-80's. At the time, it was a rather complex statistical application which I used for linear and non-linear regression, multivariate analysis (main components, factor and cluster analysis), and the like. IMHO, a bit too heavy-weight for calculating an average value.

    Hermann
    Hermann Peifer, Nov 23, 2008
    #2
    1. Advertising

  3. Sengly

    Sengly Guest

    On Nov 23, 8:11 pm, Hermann Peifer <> wrote:
    > Sengly wrote:
    >
    > > I have a large collection of about 50 million xml files. I would like
    > > to do some basic statistic about the data. For instance, what is the
    > > average value of the tag <A>, discover the co-occurence of some values
    > > inside the data, etc

    >
    > 50 million files is quite a number. How big are they, on average?
    >
    > If the source data in XML format is not supposed to change (that often), I would suggest:
    > - get the values out of the XML files, into csv files or similar txt format
    > - calculate your statistics from the generated txt files
    > - by using an appropriate application or scripting language
    >
    > > Ps: I was suggested to look at SPSS. Anyone has some experience before?

    >
    > I remember vaguely having used SPSS in the mid-80's. At the time, it was a rather complex statistical application which I used for linear and non-linear regression, multivariate analysis (main components, factor and cluster analysis), and the like. IMHO, a bit too heavy-weight for calculating an average value.
    >
    > Hermann


    Thanks.

    The xml file is not big. It's about 3 kb each. I was also thinking of
    transforming those files into csv files then using excel to analyse
    the data.

    Any other ideas are welcomed,

    Sengly
    Sengly, Nov 23, 2008
    #3
  4. Sengly

    Ken Starks Guest

    Sengly wrote:
    > Dear all,
    >
    > I have a large collection of about 50 million xml files. I would like
    > to do some basic statistic about the data. For instance, what is the
    > average value of the tag <A>, discover the co-occurence of some values
    > inside the data, etc
    >
    > Do we have any automated if not not semi-automatic tools that allow us
    > to define some rules to understand our data. Tools with visualizations
    > would be the best.
    >
    > Do you have any ideas?
    >
    > Ps: I was suggested to look at SPSS. Anyone has some experience
    > before?
    >
    > Thanks,
    >
    > Sengly


    This looks like more of a problem in statistics than XML, so perhaps you
    are asking in the wrong place. But, ...

    For an open-source statistics package, try R ( a derivative of S ):
    http://www.r-project.org/

    For R wrapped-up with many other mathematical tools (using python)
    try Sage:
    http://www.sagemath.org/index.html


    <quote>

    Sage is a free open-source mathematics software system licensed under
    the GPL. It combines the power of many existing open-source packages
    into a common Python-based interface.
    Mission: Creating a viable free open source alternative to Magma, Maple,
    Mathematica and Matlab.

    </quote>

    Bye for now,
    Ken.
    Ken Starks, Nov 23, 2008
    #4
  5. Sengly

    Ken Starks Guest

    Sengly wrote:
    > On Nov 23, 8:11 pm, Hermann Peifer <> wrote:
    >> Sengly wrote:
    >>
    >>> I have a large collection of about 50 million xml files. I would like
    >>> to do some basic statistic about the data. For instance, what is the
    >>> average value of the tag <A>, discover the co-occurence of some values
    >>> inside the data, etc

    >> 50 million files is quite a number. How big are they, on average?
    >>
    >> If the source data in XML format is not supposed to change (that often), I would suggest:
    >> - get the values out of the XML files, into csv files or similar txt format
    >> - calculate your statistics from the generated txt files
    >> - by using an appropriate application or scripting language
    >>
    >>> Ps: I was suggested to look at SPSS. Anyone has some experience before?

    >> I remember vaguely having used SPSS in the mid-80's. At the time, it was a rather complex statistical application which I used for linear and non-linear regression, multivariate analysis (main components, factor and cluster analysis), and the like. IMHO, a bit too heavy-weight for calculating an average value.
    >>
    >> Hermann

    >
    > Thanks.
    >
    > The xml file is not big. It's about 3 kb each. I was also thinking of
    > transforming those files into csv files then using excel to analyse
    > the data.
    >
    > Any other ideas are welcomed,
    >
    > Sengly


    The problem is not in finding an average, but in __sampling__ the
    50 million files first.
    You won't gain much by using the whole collection of 50 million
    compared with using a much smaller random sample, so long as
    your sample is a valid random representative. That is where you
    need proper statistics rather than just arithmetic.

    How big is 50 million ? Well, suppose each file takes one second to
    parse and convert to csv, you might like to know that

    50 million seconds = 578.703704 days
    http://www.google.co.uk/search?hl=en&q=50000000 seconds in days
    Ken Starks, Nov 23, 2008
    #5
  6. Sengly

    Peyo Guest

    Sengly a écrit :

    > Do we have any automated if not not semi-automatic tools that allow us
    > to define some rules to understand our data. Tools with visualizations
    > would be the best.
    >
    > Do you have any ideas?


    Use an XML-database and build your statistics or even SVG visualizations
    using XQuery ?

    Cheers,

    p.
    Peyo, Nov 23, 2008
    #6
  7. Sengly

    Sengly Guest

    On Nov 23, 8:32 pm, Peyo <> wrote:

    > 50 million seconds = 578.703704 dayshttp://www.google.co.uk/search?hl=en&q=50000000+seconds+in+days


    Thanks Ken for pointing out this fact. Now, I really dont know how to
    cope with this.

    > Use an XML-database and build your statistics or even SVG visualizations
    > using XQuery ?


    P., which XML database can produce statistic or even SVG
    visualization? and can it handle this very large number of files?

    Once again, everybody, thank you!

    Sengly
    Sengly, Nov 23, 2008
    #7
  8. Sengly

    Peyo Guest

    Sengly a écrit :

    >> Use an XML-database and build your statistics or even SVG visualizations
    >> using XQuery ?

    >
    > P., which XML database can produce statistic or even SVG
    > visualization? and can it handle this very large number of files?


    Personnally, I use eXist. I'm not sure it can handle 50M files easily,
    at least without a good collection design and a good fine-tuning of
    indexes. 50M x 3 kb should make it though... but do not expect a
    tremendous performance.

    Regarding statistics and vizualisation, it's just a matter of
    application design. XQuery can do very much.

    Cheers,

    p.
    Peyo, Nov 23, 2008
    #8
  9. Sengly

    Peter Flynn Guest

    Ken Starks wrote:
    [...]
    > The problem is not in finding an average, but in __sampling__ the
    > 50 million files first.


    Right. If the value of the A element is all you want, and it occupies a
    single line of each XML file, it would probably be much faster to use
    the standard text tools to extract the data without using an XML parser,
    but you would need to be *very* certain of the location and nature of
    the element; eg

    for f in *.xml; do grep '<A>'|sed "+<[/]?A>++" >>a.dat; done

    If the element is embedded within other markup (per line) it gets more
    complex, and a formal parse-and-extract would be preferable; but that's
    what creates the bigger time overhead.

    > You won't gain much by using the whole collection of 50 million
    > compared with using a much smaller random sample, so long as
    > your sample is a valid random representative. That is where you
    > need proper statistics rather than just arithmetic.


    If you use a formally-constructed sample, then you would certainly need
    stats package to process the data reliably. Spreadsheets really aren't
    much good for reliable stats beyond simple means, and then only to one
    or two significant places.

    P-Stat is a good general-purpose command-line package (the fourth of the
    "big four" which include SAS, SPSS, and BMDP) and is available for all
    platforms (with a free downloadable demo, limited to small files) from
    www.pstat.com [1]

    > How big is 50 million ? Well, suppose each file takes one second to
    > parse and convert to csv, you might like to know that
    >
    > 50 million seconds = 578.703704 days
    > http://www.google.co.uk/search?hl=en&q=50000000 seconds in days


    I think that no matter which way you do it, it's going to take a
    significant number of days to wade through them.

    ///Peter
    --
    [1] [Dis]claimer: I'm a long-time customer.

    XML FAQ: http://xml.silmaril.ie/
    Peter Flynn, Nov 23, 2008
    #9
  10. Sengly

    Sengly Guest

    Thank you very much everyone. Actually, I would like more than the
    average value of an element but some basic statistic of the whole
    dataset. Plus, the structure of the XML files is not very simple since
    there are some nested elements and attributes that need a great care.

    For your information, I have just been suggested to use this tool:
    http://simile.mit.edu/wiki/Gadget

    Please let me know if you have other ideas.

    Thank you.

    Sengly
    Sengly, Nov 24, 2008
    #10
  11. Well, I did a small experiment. The W3C XML Schema Test Suite [1] has
    approximately 40000 XML files in it (test schemas and files). The
    following UN*X pipeline

    > find *Data -type f -regex '.*\.\(xml\|xsd\)$' | xargs -n 1 wc -c | cut -d ' ' -f 1 | stats


    run at the root of the suite produced the following output:

    n = 39377
    NA = 0
    min = 7
    max = 878065
    sum = 3.57162e+07
    ss = 3.35058e+12
    mean = 907.033
    var = 8.42692e+07
    sd = 9179.83
    se = 46.2608
    skew = 89.0122
    kurt = 8280.51

    and took 1 minute 18 seconds real time on a modest Sun.

    How much overhead does XML parsing add to this? I used lxprintf, one
    of the down-translation tools in the LT XML toolkit [2] [3]:

    > time find *Data -type f -regex '.*\.\(xml\|xsd\)$' | xargs -n 1 lxprintf -e '*[@name]' '%s\n' '@name' 2>/dev/null | while read l; do echo $l | wc -c; done | stats


    (average the length of every 'name' attribute anywhere in the corpus)

    with the following output:

    n = 57418
    NA = 0
    min = 1
    max = 2004
    sum = 700156
    ss = 3.29944e+07
    mean = 12.194
    var = 425.949
    sd = 20.6385
    se = 0.0861301
    skew = 39.7485
    kurt = 3428.03

    and this took 8 minutes 50 seconds real time on the same machine.

    So, to parse and traverse 39377 XML documents, averaging 900 bytes
    long, took at most a factor of 6.8 longer than to run them all through
    wc. Or, it took 00.0135 seconds per document, on average, to do the
    statistics using a fast XML tool to do the data extraction.

    Maybe this helps you plan.

    ht

    [1] http://www.w3.org/XML/2004/xml-schema-test-suite/index.html
    [2] http://www.ltg.ed.ac.uk/~richard/ltxml2/ltxml2.html
    [3] http://www.ltg.ed.ac.uk/software/ltxml2
    --
    Henry S. Thompson, School of Informatics, University of Edinburgh
    Half-time member of W3C Team
    10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
    Fax: (44) 131 651-1426, e-mail:
    URL: http://www.ltg.ed.ac.uk/~ht/
    [mail really from me _always_ has this .sig -- mail without it is forged spam]
    Henry S. Thompson, Nov 24, 2008
    #11
  12. On Nov 24, 3:39 pm, (Henry S. Thompson) wrote:
    >
    > So, to parse and traverse 39377 XML documents, averaging 900 bytes
    > long, took at most a factor of 6.8 longer than to run them all through wc..
    >


    This more or less confirms my personal (unofficial and undocumented)
    statistics, which say that on average, XML documents contain 10% data
    and 90% "packaging waste". Often enough, one just has to get rid of
    the packaging in order to do something meaningful with the actual
    values.

    On the other hand: processing data in XML format with dedicated XML
    tools has also its advantages, no doubt about that.

    Hermann
    Hermann Peifer, Nov 24, 2008
    #12
  13. ht writes:

    > So, to parse and traverse 39377 XML documents, averaging 900 bytes
    > long, took at most a factor of 6.8 longer than to run them all through
    > wc. Or, it took 00.0135 seconds per document, on average, to do the
    > statistics using a fast XML tool to do the data extraction.


    Well, I decided the two pipes I used were too different, so I did a
    more nearly equivalent comparison:

    > time find *Data -type f -regex '.*\.\(xml\|xsd\)$' | \

    xargs -n 1 wc -c | \
    cut -d ' ' -f 1 > /tmp/fsize

    [count the length in chars of each file]

    > time find *Data -type f -regex '.*\.\(xml\|xsd\)$' | \

    xargs -n 1 lxprintf -e '*[@name]' '%s\n' '@name' 2>/dev/null > /tmp/names

    [just extract the value of all the 'name' attributes anywhere in any
    of the files]

    The first (no parse, just wc) case took 1min17secs, the second, XML
    parse and XPath evaluate, took 3mins40secs, so the relevant measures
    are

    2.857 times as long to for the XML condition
    0.006 seconds per XML file (so we're down to realistic times for
    your 50M file collection, given that a) this machine is slow and b)
    the files are coming in via NFS)

    ht
    --
    Henry S. Thompson, School of Informatics, University of Edinburgh
    Half-time member of W3C Team
    10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
    Fax: (44) 131 651-1426, e-mail:
    URL: http://www.ltg.ed.ac.uk/~ht/
    [mail really from me _always_ has this .sig -- mail without it is forged spam]
    Henry S. Thompson, Nov 24, 2008
    #13
  14. On Nov 23, 4:17 pm, Peter Flynn <> wrote:
    > Ken Starks wrote:
    >
    > > How big is 50 million ? Well, suppose each file takes one second to
    > > parse and convert to csv, you might like to know that

    >
    > > 50 million seconds = 578.703704 days
    > >http://www.google.co.uk/search?hl=en&q=50000000 seconds in days

    >
    > I think that no matter which way you do it, it's going to take a
    > significant number of days to wade through them.
    >


    Well, I think that the estimate of 1 second for processing a 3K file
    is not very realistic. It is far too high and subsequently, it
    wouldn't take a significant number of days to process 50M XML files.

    I just took a 3K sample XML file and copied it 1M times: 1000
    directories, with 1000 XML files each. Calculating the average value
    of some element across all 1000000 files took me:

    - 10 minutes with text processing tools (grep and awk)
    - 20 minutes with XML parsing tools (xmlgawk)

    Extrapolated to 50M files, this would mean a processing time of 8 and
    17 hours respectively.

    Hermann
    Hermann Peifer, Nov 24, 2008
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    6
    Views:
    514
    Carlos Eduardo Lima Borges
    Jul 7, 2006
  2. Øyvind Isaksen
    Replies:
    1
    Views:
    949
    Øyvind Isaksen
    May 18, 2007
  3. ssubbarayan
    Replies:
    5
    Views:
    2,304
    Dave Hansen
    Nov 3, 2009
  4. Replies:
    1
    Views:
    276
    Tim Dot NoSpam
    May 19, 2006
  5. Woland99

    indexing large collection of HTML files

    Woland99, Feb 6, 2007, in forum: Perl Misc
    Replies:
    1
    Views:
    94
Loading...

Share This Page