Data analysis of large collection of XML files

Sengly · Nov 23, 2008

Dear all,

I have a large collection of about 50 million xml files. I would like
to do some basic statistic about the data. For instance, what is the
average value of the tag <A>, discover the co-occurence of some values
inside the data, etc

Do we have any automated if not not semi-automatic tools that allow us
to define some rules to understand our data. Tools with visualizations
would be the best.

Do you have any ideas?

Ps: I was suggested to look at SPSS. Anyone has some experience
before?

Thanks,

Sengly

Hermann Peifer · Nov 23, 2008

Sengly said:
I have a large collection of about 50 million xml files. I would like
to do some basic statistic about the data. For instance, what is the
average value of the tag <A>, discover the co-occurence of some values
inside the data, etc

50 million files is quite a number. How big are they, on average?

If the source data in XML format is not supposed to change (that often), I would suggest:
- get the values out of the XML files, into csv files or similar txt format
- calculate your statistics from the generated txt files
- by using an appropriate application or scripting language

Ps: I was suggested to look at SPSS. Anyone has some experience before?

I remember vaguely having used SPSS in the mid-80's. At the time, it was a rather complex statistical application which I used for linear and non-linear regression, multivariate analysis (main components, factor and cluster analysis), and the like. IMHO, a bit too heavy-weight for calculating an average value.

Hermann

Sengly · Nov 23, 2008

50 million files is quite a number. How big are they, on average?

If the source data in XML format is not supposed to change (that often), I would suggest:
- get the values out of the XML files, into csv files or similar txt format
- calculate your statistics from the generated txt files
- by using an appropriate application or scripting language

I remember vaguely having used SPSS in the mid-80's. At the time, it was a rather complex statistical application which I used for linear and non-linear regression, multivariate analysis (main components, factor and cluster analysis), and the like. IMHO, a bit too heavy-weight for calculating an average value.

Hermann

Thanks.

The xml file is not big. It's about 3 kb each. I was also thinking of
transforming those files into csv files then using excel to analyse
the data.

Any other ideas are welcomed,

Sengly

Ken Starks · Nov 23, 2008

Sengly said:
Dear all,

I have a large collection of about 50 million xml files. I would like
to do some basic statistic about the data. For instance, what is the
average value of the tag <A>, discover the co-occurence of some values
inside the data, etc

Do we have any automated if not not semi-automatic tools that allow us
to define some rules to understand our data. Tools with visualizations
would be the best.

Do you have any ideas?

Ps: I was suggested to look at SPSS. Anyone has some experience
before?

Thanks,

Sengly

This looks like more of a problem in statistics than XML, so perhaps you
are asking in the wrong place. But, ...

For an open-source statistics package, try R ( a derivative of S ):
http://www.r-project.org/

For R wrapped-up with many other mathematical tools (using python)
try Sage:
http://www.sagemath.org/index.html

<quote>

Sage is a free open-source mathematics software system licensed under
the GPL. It combines the power of many existing open-source packages
into a common Python-based interface.
Mission: Creating a viable free open source alternative to Magma, Maple,
Mathematica and Matlab.

</quote>

Bye for now,
Ken.

Ken Starks · Nov 23, 2008

Sengly said:
Thanks.

The xml file is not big. It's about 3 kb each. I was also thinking of
transforming those files into csv files then using excel to analyse
the data.

Any other ideas are welcomed,

Sengly

The problem is not in finding an average, but in __sampling__ the
50 million files first.
You won't gain much by using the whole collection of 50 million
compared with using a much smaller random sample, so long as
your sample is a valid random representative. That is where you
need proper statistics rather than just arithmetic.

How big is 50 million ? Well, suppose each file takes one second to
parse and convert to csv, you might like to know that

50 million seconds = 578.703704 days
http://www.google.co.uk/search?hl=en&q=50000000+seconds+in+days

Peyo · Nov 23, 2008

Sengly a écrit :

Do we have any automated if not not semi-automatic tools that allow us
to define some rules to understand our data. Tools with visualizations
would be the best.

Do you have any ideas?

Use an XML-database and build your statistics or even SVG visualizations
using XQuery ?

Cheers,

p.

Sengly · Nov 23, 2008

50 million seconds = 578.703704 dayshttp://www.google.co.uk/search?hl=en&q=50000000+seconds+in+days

Thanks Ken for pointing out this fact. Now, I really dont know how to
cope with this.

Use an XML-database and build your statistics or even SVG visualizations
using XQuery ?

P., which XML database can produce statistic or even SVG
visualization? and can it handle this very large number of files?

Once again, everybody, thank you!

Sengly

Peyo · Nov 23, 2008

Sengly a écrit :

P., which XML database can produce statistic or even SVG
visualization? and can it handle this very large number of files?

Personnally, I use eXist. I'm not sure it can handle 50M files easily,
at least without a good collection design and a good fine-tuning of
indexes. 50M x 3 kb should make it though... but do not expect a
tremendous performance.

Regarding statistics and vizualisation, it's just a matter of
application design. XQuery can do very much.

Cheers,

p.

Peter Flynn · Nov 23, 2008

Ken Starks wrote:
[...]

The problem is not in finding an average, but in __sampling__ the
50 million files first.

Right. If the value of the A element is all you want, and it occupies a
single line of each XML file, it would probably be much faster to use
the standard text tools to extract the data without using an XML parser,
but you would need to be *very* certain of the location and nature of
the element; eg

for f in *.xml; do grep '<A>'|sed "+<[/]?A>++" >>a.dat; done

If the element is embedded within other markup (per line) it gets more
complex, and a formal parse-and-extract would be preferable; but that's
what creates the bigger time overhead.

You won't gain much by using the whole collection of 50 million
compared with using a much smaller random sample, so long as
your sample is a valid random representative. That is where you
need proper statistics rather than just arithmetic.

If you use a formally-constructed sample, then you would certainly need
stats package to process the data reliably. Spreadsheets really aren't
much good for reliable stats beyond simple means, and then only to one
or two significant places.

P-Stat is a good general-purpose command-line package (the fourth of the
"big four" which include SAS, SPSS, and BMDP) and is available for all
platforms (with a free downloadable demo, limited to small files) from
www.pstat.com [1]

How big is 50 million ? Well, suppose each file takes one second to
parse and convert to csv, you might like to know that

50 million seconds = 578.703704 days
http://www.google.co.uk/search?hl=en&q=50000000+seconds+in+days

I think that no matter which way you do it, it's going to take a
significant number of days to wade through them.

///Peter

Sengly · Nov 24, 2008

Thank you very much everyone. Actually, I would like more than the
average value of an element but some basic statistic of the whole
dataset. Plus, the structure of the XML files is not very simple since
there are some nested elements and attributes that need a great care.

For your information, I have just been suggested to use this tool:
http://simile.mit.edu/wiki/Gadget

Please let me know if you have other ideas.

Thank you.

Sengly

Henry S. Thompson · Nov 24, 2008

Well, I did a small experiment. The W3C XML Schema Test Suite [1] has
approximately 40000 XML files in it (test schemas and files). The
following UN*X pipeline

> find *Data -type f -regex '.*\.$xml\|xsd$$' | xargs -n 1 wc -c | cut -d ' ' -f 1 | stats

run at the root of the suite produced the following output:

n = 39377
NA = 0
min = 7
max = 878065
sum = 3.57162e+07
ss = 3.35058e+12
mean = 907.033
var = 8.42692e+07
sd = 9179.83
se = 46.2608
skew = 89.0122
kurt = 8280.51

and took 1 minute 18 seconds real time on a modest Sun.

How much overhead does XML parsing add to this? I used lxprintf, one
of the down-translation tools in the LT XML toolkit [2] [3]:

> time find *Data -type f -regex '.*\.$xml\|xsd$$' | xargs -n 1 lxprintf -e '*[@name]' '%s\n' '@name' 2>/dev/null | while read l; do echo $l | wc -c; done | stats

(average the length of every 'name' attribute anywhere in the corpus)

with the following output:

n = 57418
NA = 0
min = 1
max = 2004
sum = 700156
ss = 3.29944e+07
mean = 12.194
var = 425.949
sd = 20.6385
se = 0.0861301
skew = 39.7485
kurt = 3428.03

and this took 8 minutes 50 seconds real time on the same machine.

So, to parse and traverse 39377 XML documents, averaging 900 bytes
long, took at most a factor of 6.8 longer than to run them all through
wc. Or, it took 00.0135 seconds per document, on average, to do the
statistics using a fast XML tool to do the data extraction.

Maybe this helps you plan.

ht

[1] http://www.w3.org/XML/2004/xml-schema-test-suite/index.html
[2] http://www.ltg.ed.ac.uk/~richard/ltxml2/ltxml2.html
[3] http://www.ltg.ed.ac.uk/software/ltxml2
--
Henry S. Thompson, School of Informatics, University of Edinburgh
Half-time member of W3C Team
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 651-1426, e-mail: (e-mail address removed)
URL: http://www.ltg.ed.ac.uk/~ht/
[mail really from me _always_ has this .sig -- mail without it is forged spam]

Hermann Peifer · Nov 24, 2008

So, to parse and traverse 39377 XML documents, averaging 900 bytes
long, took at most a factor of 6.8 longer than to run them all through wc..

This more or less confirms my personal (unofficial and undocumented)
statistics, which say that on average, XML documents contain 10% data
and 90% "packaging waste". Often enough, one just has to get rid of
the packaging in order to do something meaningful with the actual
values.

On the other hand: processing data in XML format with dedicated XML
tools has also its advantages, no doubt about that.

Hermann

Henry S. Thompson · Nov 24, 2008

ht said:
So, to parse and traverse 39377 XML documents, averaging 900 bytes
long, took at most a factor of 6.8 longer than to run them all through
wc. Or, it took 00.0135 seconds per document, on average, to do the
statistics using a fast XML tool to do the data extraction.

Well, I decided the two pipes I used were too different, so I did a
more nearly equivalent comparison:

> time find *Data -type f -regex '.*\.$xml\|xsd$$' | \

xargs -n 1 wc -c | \
cut -d ' ' -f 1 > /tmp/fsize

[count the length in chars of each file]

> time find *Data -type f -regex '.*\.$xml\|xsd$$' | \

xargs -n 1 lxprintf -e '*[@name]' '%s\n' '@name' 2>/dev/null > /tmp/names

[just extract the value of all the 'name' attributes anywhere in any
of the files]

The first (no parse, just wc) case took 1min17secs, the second, XML
parse and XPath evaluate, took 3mins40secs, so the relevant measures
are

2.857 times as long to for the XML condition
0.006 seconds per XML file (so we're down to realistic times for
your 50M file collection, given that a) this machine is slow and b)
the files are coming in via NFS)

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
Half-time member of W3C Team
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 651-1426, e-mail: (e-mail address removed)
URL: http://www.ltg.ed.ac.uk/~ht/
[mail really from me _always_ has this .sig -- mail without it is forged spam]

Hermann Peifer · Nov 24, 2008

I think that no matter which way you do it, it's going to take a
significant number of days to wade through them.

Well, I think that the estimate of 1 second for processing a 3K file
is not very realistic. It is far too high and subsequently, it
wouldn't take a significant number of days to process 50M XML files.

I just took a 3K sample XML file and copied it 1M times: 1000
directories, with 1000 XML files each. Calculating the average value
of some element across all 1000000 files took me:

- 10 minutes with text processing tools (grep and awk)
- 20 minutes with XML parsing tools (xmlgawk)

Extrapolated to 50M files, this would mean a processing time of 8 and
17 hours respectively.

Hermann

Static code analysis tool	0	Jan 22, 2020
Edit large xml files	11	Feb 18, 2008
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
analysis of java application logs	41	May 23, 2011
Large XML files	2	Dec 20, 2005
large xml file...	11	Aug 23, 2011
Buffer pair for lexical analysis of raw binary data	3	Jun 27, 2009
Tools for Ruby code analysis	15	May 23, 2011

Data analysis of large collection of XML files

Sengly

Hermann Peifer

Sengly

Ken Starks

Ken Starks

Peyo

Sengly

Peyo

Peter Flynn

Sengly

Henry S. Thompson

Hermann Peifer

Henry S. Thompson

Hermann Peifer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads