fast method accessing large, simple structured data

agc · Feb 2, 2008

Hi,

I'm looking for a fast way of accessing some simple (structured) data.

The data is like this:
Approx 6 - 10 GB simple XML files with the only elements
I really care about are the <title> and <article> ones.

So what I'm hoping to do is put this data in a format so
that I can access it as fast as possible for a given request
(http request, Python web server) that specifies just the title,
and I return the article content.

Is there some good format that is optimized for search for
just 1 attribute (title) and then returning the corresponding article?

I've thought about putting this data in a SQLite database because
from what I know SQLite has very fast reads (no network latency, etc)
but not as fast writes, which is fine because I probably wont be doing
much writing (I wont ever care about the speed of any writes).

So is a database the way to go, or is there some other,
more specialized format that would be better?

Thanks,
Alex

Diez B. Roggisch · Feb 2, 2008

agc said:
Hi,

I'm looking for a fast way of accessing some simple (structured) data.

The data is like this:
Approx 6 - 10 GB simple XML files with the only elements
I really care about are the <title> and <article> ones.

So what I'm hoping to do is put this data in a format so
that I can access it as fast as possible for a given request
(http request, Python web server) that specifies just the title,
and I return the article content.

Is there some good format that is optimized for search for
just 1 attribute (title) and then returning the corresponding article?

I've thought about putting this data in a SQLite database because
from what I know SQLite has very fast reads (no network latency, etc)
but not as fast writes, which is fine because I probably wont be doing
much writing (I wont ever care about the speed of any writes).

So is a database the way to go, or is there some other,
more specialized format that would be better?

Database it is. Make sure you have proper indexing.

Diez

John Machin · Feb 2, 2008

agc said:
Hi,

I'm looking for a fast way of accessing some simple (structured) data.

The data is like this:
Approx 6 - 10 GB simple XML files with the only elements
I really care about are the <title> and <article> ones.

So what I'm hoping to do is put this data in a format so
that I can access it as fast as possible for a given request
(http request, Python web server) that specifies just the title,
and I return the article content.

Is there some good format that is optimized for search for
just 1 attribute (title) and then returning the corresponding article?

I've thought about putting this data in a SQLite database because
from what I know SQLite has very fast reads (no network latency, etc)
but not as fast writes, which is fine because I probably wont be doing
much writing (I wont ever care about the speed of any writes).

So is a database the way to go, or is there some other,
more specialized format that would be better?

"Database" without any further qualification indicates exact matching,
which doesn't seem to be very practical in the context of titles of
articles. There is an enormous body of literature on inexact/fuzzy
matching, and lots of deployed applications -- it's not a Python-related
question, really.

M.-A. Lemburg · Feb 2, 2008

Hi,

I'm looking for a fast way of accessing some simple (structured) data.

The data is like this:
Approx 6 - 10 GB simple XML files with the only elements
I really care about are the <title> and <article> ones.

So what I'm hoping to do is put this data in a format so
that I can access it as fast as possible for a given request
(http request, Python web server) that specifies just the title,
and I return the article content.

Is there some good format that is optimized for search for
just 1 attribute (title) and then returning the corresponding article?

I've thought about putting this data in a SQLite database because
from what I know SQLite has very fast reads (no network latency, etc)
but not as fast writes, which is fine because I probably wont be doing
much writing (I wont ever care about the speed of any writes).

So is a database the way to go, or is there some other,
more specialized format that would be better?

Depends on what you want to search and how, e.g. whether
a search for title substrings should give results, whether
stemming is needed, etc.

If all you want is a simple mapping of full title to article
string, an on-disk dictionary is probably the way to go,
e.g. mxBeeBase (part of the eGenix mx Base Distribution).

For more complex search, you're better off with a tool that
indexes the titles based on words, ie. a full-text search
engine such as Lucene.

Databases can also handle this, but they often have problems when
it comes to more complex queries where their indexes no longer
help them to speed up the query and they have to resort to
doing a table scan - a sequential search of all rows.

Some databases provide special full-text extensions, but
those are of varying quality. Better use a specialized
tool such as Lucene for this.

For more background on the problems of full-text search, see e.g.

http://www.ibm.com/developerworks/opensource/library/l-pyind.html

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Feb 03 2008)________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611

agc · Feb 3, 2008

"Database" without any further qualification indicates exact matching,
which doesn't seem to be very practical in the context of titles of
articles. There is an enormous body of literature on inexact/fuzzy
matching, and lots of deployed applications -- it's not a Python-related
question, really.

Yes, you are right that in some sense this question is not truly
Python related,
but I am looking to solve this problem in a way that plays as nicely
as
possible with Python:

I guess an important feature of what I'm looking for is
some kind of mapping from *exact* title to corresponding article,
i.e. if my data set wasn't so large, I would just keep all my
data in a in-memory Python dictionary, which would be very fast.

But I have about 2 million article titles mapping to approx. 6-10 GB
of article bodies, so I think this would be just to big for a
simple Python dictionary.

Does anyone have any advice on the feasibility of using
just an in memory dictionary? The dataset just seems to big,
but maybe there is a related method?

Thanks,
Alex

Ivan Illarionov · Feb 3, 2008

Is there some good format that is optimized for search for

just 1 attribute (title) and then returning the corresponding article?

I would use Durus (http://www.mems-exchange.org/software/durus/) -
simple pythonic object database - and store this data as persistent
python dict with Title keys and Article values.

Stefan Behnel · Feb 3, 2008

agc said:
I guess an important feature of what I'm looking for is
some kind of mapping from *exact* title to corresponding article,
i.e. if my data set wasn't so large, I would just keep all my
data in a in-memory Python dictionary, which would be very fast.

But I have about 2 million article titles mapping to approx. 6-10 GB
of article bodies, so I think this would be just to big for a
simple Python dictionary.

Then use a database table that maps titles to articles, and make sure you
create an index over the title column.

Stefan

'Needless flexibilities' and structured records [very long]	10	Mar 15, 2013
parsing of structured text	2	Oct 27, 2010
Fast forward-backward (write-read)	7	Oct 23, 2012
Large data arrays?	9	Apr 23, 2009
Fast Container	8	Nov 2, 2011
Collect Excel Data from Website	5	Apr 30, 2022
fast video encoding	4	Jul 29, 2009
Trying to get JSON data from API into HTML table	7	Feb 1, 2021

fast method accessing large, simple structured data

agc

Diez B. Roggisch

John Machin

M.-A. Lemburg

agc

Ivan Illarionov

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads