searching and storing large quantities of xml!

dads · Jan 16, 2010

I work in as 1st line support and python is one of my hobbies. We get
quite a few requests for xml from our website and its a long strung
out process. So I thought I'd try and create a system that deals with
it for fun.

I've been tidying up the archived xml and have been thinking what's
the best way to approach this issue as it took a long time to deal
with big quantities of xml. If you have 5/6 years worth of 26000+
5-20k xml files per year. The archived stuff is zipped but what is
better, 26000 files in one big zip file, 26000 files in one big zip
file but in folders for months and days, or zip files in zip files!

I created an app in wxpython to search the unzipped xml files by the
modified date and just open them up and just using the something like
l.find('>%s<' % fiveDigitNumber) != -1: is this quicker than parsing
the xml?

Generally the requests are less than 3 months old so that got me into
thinking should I create a script that finds all the file names and
corresponding web number of old xml and bungs them into a db table one
for each year and another script that after everyday archives the xml
and after 3months zip it up, bungs info into table etc. Sorry for the
ramble I just want other peoples opinions on the matter. =)

Steve Holden · Jan 16, 2010

dads said:
I work in as 1st line support and python is one of my hobbies. We get
quite a few requests for xml from our website and its a long strung
out process. So I thought I'd try and create a system that deals with
it for fun.

I've been tidying up the archived xml and have been thinking what's
the best way to approach this issue as it took a long time to deal
with big quantities of xml. If you have 5/6 years worth of 26000+
5-20k xml files per year. The archived stuff is zipped but what is
better, 26000 files in one big zip file, 26000 files in one big zip
file but in folders for months and days, or zip files in zip files!

I created an app in wxpython to search the unzipped xml files by the
modified date and just open them up and just using the something like
l.find('>%s<' % fiveDigitNumber) != -1: is this quicker than parsing
the xml?

Generally the requests are less than 3 months old so that got me into
thinking should I create a script that finds all the file names and
corresponding web number of old xml and bungs them into a db table one
for each year and another script that after everyday archives the xml
and after 3months zip it up, bungs info into table etc. Sorry for the
ramble I just want other peoples opinions on the matter. =)

The first question I'd ask is what library you are using for the XML
processing. If you aren't using cElementTree it would definitely be
worth checking to see if it improves your processing speed. You can test
with ElementTree if you want, but cElementTree is an extension module,
and therefore much faster.

Fredrik Lundh wrote it so it's pretty solid stuff (he was one of the
minds behind the RE engine).

regards
Steve

Steve Holden · Jan 16, 2010

dads said:
I work in as 1st line support and python is one of my hobbies. We get
quite a few requests for xml from our website and its a long strung
out process. So I thought I'd try and create a system that deals with
it for fun.

I've been tidying up the archived xml and have been thinking what's
the best way to approach this issue as it took a long time to deal
with big quantities of xml. If you have 5/6 years worth of 26000+
5-20k xml files per year. The archived stuff is zipped but what is
better, 26000 files in one big zip file, 26000 files in one big zip
file but in folders for months and days, or zip files in zip files!

I created an app in wxpython to search the unzipped xml files by the
modified date and just open them up and just using the something like
l.find('>%s<' % fiveDigitNumber) != -1: is this quicker than parsing
the xml?

Generally the requests are less than 3 months old so that got me into
thinking should I create a script that finds all the file names and
corresponding web number of old xml and bungs them into a db table one
for each year and another script that after everyday archives the xml
and after 3months zip it up, bungs info into table etc. Sorry for the
ramble I just want other peoples opinions on the matter. =)

The first question I'd ask is what library you are using for the XML
processing. If you aren't using cElementTree it would definitely be
worth checking to see if it improves your processing speed. You can test
with ElementTree if you want, but cElementTree is an extension module,
and therefore much faster.

Fredrik Lundh wrote it so it's pretty solid stuff (he was one of the
minds behind the RE engine).

regards
Steve

Paul Rubin · Jan 17, 2010

dads said:
I've been tidying up the archived xml and have been thinking what's
the best way to approach this issue as it took a long time to deal
with big quantities of xml. If you have 5/6 years worth of 26000+
5-20k xml files per year. The archived stuff is zipped but what is
better, 26000 files in one big zip file, 26000 files in one big zip
file but in folders for months and days, or zip files in zip files!

If I'm reading that properly, you have 5-6 years worth of files, 26000
files per year, 5-20k bytes per file? At 10k bytes/file that's about
1.3GB which isn't all that much data by today's standards.

Generally the requests are less than 3 months old so that got me into
thinking should I create a script that finds all the file names and
corresponding web number of old xml and bungs them into a db table one
for each year and another script that after everyday archives the xml
and after 3months zip it up, bungs info into table etc. Sorry for the
ramble I just want other peoples opinions on the matter. =)

Extract all the files and put them into some kind of indexed database or
search engine. I've used solr (http://lucene.apache.org/solr) for this
purpose and while it has limitations, it's fairly easy to set up and use
for basic purposes.

Stefan Behnel · Jan 17, 2010

dads, 16.01.2010 19:10:

I work in as 1st line support and python is one of my hobbies. We get
quite a few requests for xml from our website and its a long strung
out process. So I thought I'd try and create a system that deals with
it for fun.

I've been tidying up the archived xml and have been thinking what's
the best way to approach this issue as it took a long time to deal
with big quantities of xml. If you have 5/6 years worth of 26000+
5-20k xml files per year. The archived stuff is zipped but what is
better, 26000 files in one big zip file, 26000 files in one big zip
file but in folders for months and days, or zip files in zip files!

As Paul suggested, that doesn't sound like a lot of data at all.

Personally, I guess I'd keep separate folders of .gz files, but that also
depends on platform characteristics (e.g. I/O performance and file system).
If you already have them archived in .zip files, that's fine also. Your
files are fairly small, so combined .zip archival will likely yield better
space efficiency.

I created an app in wxpython to search the unzipped xml files by the
modified date and just open them up and just using the something like
l.find('>%s<' % fiveDigitNumber) != -1: is this quicker than parsing
the xml?

Sounds like all you need is an index. Look at the various dbm database
modules that come with Python. They are basically persistent dictionaries.
To achieve the above, all you have to do is parse the XML files once (use
cElementTree's iterparse() for that, as Steve suggested) and store a path
(i.e. zip file name and archive entry name) for each 'modified date' that
you find in the file. When your app then needs to access data for a
specific modification date, you can look up the file path in the dbm and
parse it directly from the open zip file.

That's a simple solution, at least. Depending on your expected load and
additional feature set, e.g. full text indexing, a more complex database
setup (potentially even copying all data into the DB) may still be worth
going for.

Stefan

dads · Jan 18, 2010

Thanks all, took your advice and have been playing all weekend which
has been great fun. ElementTree is awesome. I created a script that
organises the xml as they're in year blocks and I didn't realise the
required xml is mixed up with other xml. Plus the volumes are much
greater than I realised, I checked as back at work and it was
something like 600,000 files in a year, just over a gig for each
year.

I'm going to add zipping up of the files and getting the required info
and putting it in a db this week hopefully. It's been completely
overhauled, originally I used modified date now it gets the date from
the parsed xml, safer that way. The code is below but word of caution,
it's hobbyist code so it'll probably make your eyes bleed =), thanks
again:

There was one thing that I forgot about - when ElementTree fails to
parse due to an element not being closed why doesn't it close the file
like object. As later on I would raise 'WindowsError: [Error
32] ...file being used by other process' when using shutil.move(). I
got round this by using a 'try except' block.

from __future__ import print_function
import xml.etree.cElementTree as ET
import calendar
import zipfile
import os.path
import shutil
import zlib
import os

class Xmlorg(object):

def __init__(self):

self.cwd = os.getcwd()
self.year = os.path.basename(self.cwd)

def _mkMonthAndDaysDirs(self):

''' creates dirs for every month and day of a of specidifed
year.
Works for leap years as well.

(specified)year/(year)month/day

...2010/201001/01
...2010/201001/02
...2010/201001/03 '''

def addZero(n):

if len(str(n)) < 2:
return '0' + str(n)
else:
return str(n)

dim = [ calendar.monthrange(year,month)[1] for year in \
[int(self.year)] for month in range(1,13) ]

count = 1
for n in dim:
month = addZero(count)
count += 1
ym = os.path.join(self.cwd, self.year + month)
os.mkdir(ym)
for x in range(1,n+1):
x = addZero(x)
os.mkdir(os.path.join(ym, x))

def ParseAndOrg(self):

'''requires dir and zip struct:

.../(year)/(year).zip - example .../2008/2008.zip '''

def movef(fp1,fp2):

'''moves files with exception handling'''

try:
shutil.move(fp1,fp2)
except IOError, e:
print(e)
except WindowsError, e:
print(e)

self._mkMonthAndDaysDirs()
os.mkdir(os.path.join(self.cwd, 'otherFileType'))

# dir struct .../(year)/(year).zip - ex. .../2008/2008.zip
zf = zipfile.ZipFile(os.path.join(self.cwd, self.year +
'.zip'))
zf.extractall()
ld = os.listdir(self.cwd)
for i in ld:
if os.path.isfile(i) and i.endswith('.xml'):
try:
tree = ET.parse(i)
except:
print('%s np' % i) #not parsed
root = tree.getroot()
if root.findtext('Summary/FileType') == 'Order':
date = root.findtext('OrderHeader/OrderDate')[:10]
#dd/mm/yyyy
dc = date.split('/')
fp1 = os.path.join(self.cwd, i)
fp2 = os.path.join(self.cwd, dc[2] + dc[1], dc[0])
movef(fp1,fp2)
else:
fp1 = os.path.join(self.cwd, i)
fp2 = os.path.join(self.cwd, 'otherFileType')
movef(fp1,fp2)

if __name__ == '__main__':
os.chdir('c:/sv_zip_test/2010/') #remove
xo = Xmlorg()
xo.ParseAndOrg()

MRAB · Jan 18, 2010

dads wrote:
[snip]

import os.path
import shutil
import zlib
import os

There's no need to import both os.path and os. Import just os; you can
still refer to os.path in the rest of the code.

[snip]

def _mkMonthAndDaysDirs(self):

''' creates dirs for every month and day of a of specidifed
year.
Works for leap years as well.

(specified)year/(year)month/day

...2010/201001/01
...2010/201001/02
...2010/201001/03 '''

There's nothing except a docstring in this method!

def addZero(n):

if len(str(n)) < 2:
return '0' + str(n)
else:
return str(n)

A shorter and quicker way is:

return "%02d" % n

which means "pad with leading zeros to be at least 2 characters".

[snip]

# dir struct .../(year)/(year).zip - ex. .../2008/2008.zip
zf = zipfile.ZipFile(os.path.join(self.cwd, self.year +
'.zip'))
zf.extractall()

You might want to close the zipfile after use.

[snip]

if __name__ == '__main__':
os.chdir('c:/sv_zip_test/2010/') #remove

I recommend that you work with the full paths instead of changing the
current directory and then using relative paths.

Stefan Behnel · Jan 19, 2010

dads, 18.01.2010 22:39:

There was one thing that I forgot about - when ElementTree fails to
parse due to an element not being closed why doesn't it close the file
like object.

Because it didn't open it?

Stefan

Optimal way to make a table for large lists	2	Jul 7, 2022
I'm tempted to quit out of frustration	1	Aug 13, 2023
How to get education and coding job coming from abroad starting new in the US? Advice of courses or places to look?	2	May 18, 2023
psycopg2 and large queries	9	Dec 18, 2008
A Look At The Advantages and Drawbacks of XML	13	Jan 22, 2013
Large XML files	2	Dec 20, 2005
Syncro Soft Announces New Release of Oxygen XML Editor version 15.1	0	Oct 7, 2013
[ANN] Syncro Soft Announces New Release of Oxygen XML Editor, OxygenXML Author and Oxygen XML Develo	0	Jan 12, 2012

searching and storing large quantities of xml!

dads

Steve Holden

Steve Holden

Paul Rubin

Stefan Behnel

dads

MRAB

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads