Data analysis of large collection of XML files

S

Sengly

Dear all,

I have a large collection of about 50 million xml files. I would like
to do some basic statistic about the data. For instance, what is the
average value of the tag <A>, discover the co-occurence of some values
inside the data, etc

Do we have any automated if not not semi-automatic tools that allow us
to define some rules to understand our data. Tools with visualizations
would be the best.

Do you have any ideas?

Ps: I was suggested to look at SPSS. Anyone has some experience
before?

Thanks,

Sengly
 
H

Hermann Peifer

Sengly said:
I have a large collection of about 50 million xml files. I would like
to do some basic statistic about the data. For instance, what is the
average value of the tag <A>, discover the co-occurence of some values
inside the data, etc

50 million files is quite a number. How big are they, on average?

If the source data in XML format is not supposed to change (that often), I would suggest:
- get the values out of the XML files, into csv files or similar txt format
- calculate your statistics from the generated txt files
- by using an appropriate application or scripting language
Ps: I was suggested to look at SPSS. Anyone has some experience before?

I remember vaguely having used SPSS in the mid-80's. At the time, it was a rather complex statistical application which I used for linear and non-linear regression, multivariate analysis (main components, factor and cluster analysis), and the like. IMHO, a bit too heavy-weight for calculating an average value.

Hermann
 
S

Sengly

50 million files is quite a number. How big are they, on average?

If the source data in XML format is not supposed to change (that often), I would suggest:
- get the values out of the XML files, into csv files or similar txt format
- calculate your statistics from the generated txt files
- by using an appropriate application or scripting language


I remember vaguely having used SPSS in the mid-80's. At the time, it was a rather complex statistical application which I used for linear and non-linear regression, multivariate analysis (main components, factor and cluster analysis), and the like. IMHO, a bit too heavy-weight for calculating an average value.

Hermann

Thanks.

The xml file is not big. It's about 3 kb each. I was also thinking of
transforming those files into csv files then using excel to analyse
the data.

Any other ideas are welcomed,

Sengly
 
K

Ken Starks

Sengly said:
Dear all,

I have a large collection of about 50 million xml files. I would like
to do some basic statistic about the data. For instance, what is the
average value of the tag <A>, discover the co-occurence of some values
inside the data, etc

Do we have any automated if not not semi-automatic tools that allow us
to define some rules to understand our data. Tools with visualizations
would be the best.

Do you have any ideas?

Ps: I was suggested to look at SPSS. Anyone has some experience
before?

Thanks,

Sengly

This looks like more of a problem in statistics than XML, so perhaps you
are asking in the wrong place. But, ...

For an open-source statistics package, try R ( a derivative of S ):
http://www.r-project.org/

For R wrapped-up with many other mathematical tools (using python)
try Sage:
http://www.sagemath.org/index.html


<quote>

Sage is a free open-source mathematics software system licensed under
the GPL. It combines the power of many existing open-source packages
into a common Python-based interface.
Mission: Creating a viable free open source alternative to Magma, Maple,
Mathematica and Matlab.

</quote>

Bye for now,
Ken.
 
K

Ken Starks

Sengly said:
Thanks.

The xml file is not big. It's about 3 kb each. I was also thinking of
transforming those files into csv files then using excel to analyse
the data.

Any other ideas are welcomed,

Sengly

The problem is not in finding an average, but in __sampling__ the
50 million files first.
You won't gain much by using the whole collection of 50 million
compared with using a much smaller random sample, so long as
your sample is a valid random representative. That is where you
need proper statistics rather than just arithmetic.

How big is 50 million ? Well, suppose each file takes one second to
parse and convert to csv, you might like to know that

50 million seconds = 578.703704 days
http://www.google.co.uk/search?hl=en&q=50000000+seconds+in+days
 
P

Peyo

Sengly a écrit :
Do we have any automated if not not semi-automatic tools that allow us
to define some rules to understand our data. Tools with visualizations
would be the best.

Do you have any ideas?

Use an XML-database and build your statistics or even SVG visualizations
using XQuery ?

Cheers,

p.
 
S

Sengly


Thanks Ken for pointing out this fact. Now, I really dont know how to
cope with this.
Use an XML-database and build your statistics or even SVG visualizations
using XQuery ?

P., which XML database can produce statistic or even SVG
visualization? and can it handle this very large number of files?

Once again, everybody, thank you!

Sengly
 
P

Peyo

Sengly a écrit :
P., which XML database can produce statistic or even SVG
visualization? and can it handle this very large number of files?

Personnally, I use eXist. I'm not sure it can handle 50M files easily,
at least without a good collection design and a good fine-tuning of
indexes. 50M x 3 kb should make it though... but do not expect a
tremendous performance.

Regarding statistics and vizualisation, it's just a matter of
application design. XQuery can do very much.

Cheers,

p.
 
P

Peter Flynn

Ken Starks wrote:
[...]
The problem is not in finding an average, but in __sampling__ the
50 million files first.

Right. If the value of the A element is all you want, and it occupies a
single line of each XML file, it would probably be much faster to use
the standard text tools to extract the data without using an XML parser,
but you would need to be *very* certain of the location and nature of
the element; eg

for f in *.xml; do grep '<A>'|sed "+<[/]?A>++" >>a.dat; done

If the element is embedded within other markup (per line) it gets more
complex, and a formal parse-and-extract would be preferable; but that's
what creates the bigger time overhead.
You won't gain much by using the whole collection of 50 million
compared with using a much smaller random sample, so long as
your sample is a valid random representative. That is where you
need proper statistics rather than just arithmetic.

If you use a formally-constructed sample, then you would certainly need
stats package to process the data reliably. Spreadsheets really aren't
much good for reliable stats beyond simple means, and then only to one
or two significant places.

P-Stat is a good general-purpose command-line package (the fourth of the
"big four" which include SAS, SPSS, and BMDP) and is available for all
platforms (with a free downloadable demo, limited to small files) from
www.pstat.com [1]
How big is 50 million ? Well, suppose each file takes one second to
parse and convert to csv, you might like to know that

50 million seconds = 578.703704 days
http://www.google.co.uk/search?hl=en&q=50000000+seconds+in+days

I think that no matter which way you do it, it's going to take a
significant number of days to wade through them.

///Peter
 
S

Sengly

Thank you very much everyone. Actually, I would like more than the
average value of an element but some basic statistic of the whole
dataset. Plus, the structure of the XML files is not very simple since
there are some nested elements and attributes that need a great care.

For your information, I have just been suggested to use this tool:
http://simile.mit.edu/wiki/Gadget

Please let me know if you have other ideas.

Thank you.

Sengly
 
H

Henry S. Thompson

Well, I did a small experiment. The W3C XML Schema Test Suite [1] has
approximately 40000 XML files in it (test schemas and files). The
following UN*X pipeline
> find *Data -type f -regex '.*\.\(xml\|xsd\)$' | xargs -n 1 wc -c | cut -d ' ' -f 1 | stats

run at the root of the suite produced the following output:

n = 39377
NA = 0
min = 7
max = 878065
sum = 3.57162e+07
ss = 3.35058e+12
mean = 907.033
var = 8.42692e+07
sd = 9179.83
se = 46.2608
skew = 89.0122
kurt = 8280.51

and took 1 minute 18 seconds real time on a modest Sun.

How much overhead does XML parsing add to this? I used lxprintf, one
of the down-translation tools in the LT XML toolkit [2] [3]:
> time find *Data -type f -regex '.*\.\(xml\|xsd\)$' | xargs -n 1 lxprintf -e '*[@name]' '%s\n' '@name' 2>/dev/null | while read l; do echo $l | wc -c; done | stats

(average the length of every 'name' attribute anywhere in the corpus)

with the following output:

n = 57418
NA = 0
min = 1
max = 2004
sum = 700156
ss = 3.29944e+07
mean = 12.194
var = 425.949
sd = 20.6385
se = 0.0861301
skew = 39.7485
kurt = 3428.03

and this took 8 minutes 50 seconds real time on the same machine.

So, to parse and traverse 39377 XML documents, averaging 900 bytes
long, took at most a factor of 6.8 longer than to run them all through
wc. Or, it took 00.0135 seconds per document, on average, to do the
statistics using a fast XML tool to do the data extraction.

Maybe this helps you plan.

ht

[1] http://www.w3.org/XML/2004/xml-schema-test-suite/index.html
[2] http://www.ltg.ed.ac.uk/~richard/ltxml2/ltxml2.html
[3] http://www.ltg.ed.ac.uk/software/ltxml2
--
Henry S. Thompson, School of Informatics, University of Edinburgh
Half-time member of W3C Team
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 651-1426, e-mail: (e-mail address removed)
URL: http://www.ltg.ed.ac.uk/~ht/
[mail really from me _always_ has this .sig -- mail without it is forged spam]
 
H

Hermann Peifer

So, to parse and traverse 39377 XML documents, averaging 900 bytes
long, took at most a factor of 6.8 longer than to run them all through wc..

This more or less confirms my personal (unofficial and undocumented)
statistics, which say that on average, XML documents contain 10% data
and 90% "packaging waste". Often enough, one just has to get rid of
the packaging in order to do something meaningful with the actual
values.

On the other hand: processing data in XML format with dedicated XML
tools has also its advantages, no doubt about that.

Hermann
 
H

Henry S. Thompson

ht said:
So, to parse and traverse 39377 XML documents, averaging 900 bytes
long, took at most a factor of 6.8 longer than to run them all through
wc. Or, it took 00.0135 seconds per document, on average, to do the
statistics using a fast XML tool to do the data extraction.

Well, I decided the two pipes I used were too different, so I did a
more nearly equivalent comparison:
> time find *Data -type f -regex '.*\.\(xml\|xsd\)$' | \
xargs -n 1 wc -c | \
cut -d ' ' -f 1 > /tmp/fsize

[count the length in chars of each file]
> time find *Data -type f -regex '.*\.\(xml\|xsd\)$' | \
xargs -n 1 lxprintf -e '*[@name]' '%s\n' '@name' 2>/dev/null > /tmp/names

[just extract the value of all the 'name' attributes anywhere in any
of the files]

The first (no parse, just wc) case took 1min17secs, the second, XML
parse and XPath evaluate, took 3mins40secs, so the relevant measures
are

2.857 times as long to for the XML condition
0.006 seconds per XML file (so we're down to realistic times for
your 50M file collection, given that a) this machine is slow and b)
the files are coming in via NFS)

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
Half-time member of W3C Team
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 651-1426, e-mail: (e-mail address removed)
URL: http://www.ltg.ed.ac.uk/~ht/
[mail really from me _always_ has this .sig -- mail without it is forged spam]
 
H

Hermann Peifer

I think that no matter which way you do it, it's going to take a
significant number of days to wade through them.

Well, I think that the estimate of 1 second for processing a 3K file
is not very realistic. It is far too high and subsequently, it
wouldn't take a significant number of days to process 50M XML files.

I just took a 3K sample XML file and copied it 1M times: 1000
directories, with 1000 XML files each. Calculating the average value
of some element across all 1000000 files took me:

- 10 minutes with text processing tools (grep and awk)
- 20 minutes with XML parsing tools (xmlgawk)

Extrapolated to 50M files, this would mean a processing time of 8 and
17 hours respectively.

Hermann
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top