fast method accessing large, simple structured data

A

agc

Hi,

I'm looking for a fast way of accessing some simple (structured) data.

The data is like this:
Approx 6 - 10 GB simple XML files with the only elements
I really care about are the <title> and <article> ones.

So what I'm hoping to do is put this data in a format so
that I can access it as fast as possible for a given request
(http request, Python web server) that specifies just the title,
and I return the article content.

Is there some good format that is optimized for search for
just 1 attribute (title) and then returning the corresponding article?

I've thought about putting this data in a SQLite database because
from what I know SQLite has very fast reads (no network latency, etc)
but not as fast writes, which is fine because I probably wont be doing
much writing (I wont ever care about the speed of any writes).

So is a database the way to go, or is there some other,
more specialized format that would be better?

Thanks,
Alex
 
D

Diez B. Roggisch

agc said:
Hi,

I'm looking for a fast way of accessing some simple (structured) data.

The data is like this:
Approx 6 - 10 GB simple XML files with the only elements
I really care about are the <title> and <article> ones.

So what I'm hoping to do is put this data in a format so
that I can access it as fast as possible for a given request
(http request, Python web server) that specifies just the title,
and I return the article content.

Is there some good format that is optimized for search for
just 1 attribute (title) and then returning the corresponding article?

I've thought about putting this data in a SQLite database because
from what I know SQLite has very fast reads (no network latency, etc)
but not as fast writes, which is fine because I probably wont be doing
much writing (I wont ever care about the speed of any writes).

So is a database the way to go, or is there some other,
more specialized format that would be better?


Database it is. Make sure you have proper indexing.


Diez
 
J

John Machin

agc said:
Hi,

I'm looking for a fast way of accessing some simple (structured) data.

The data is like this:
Approx 6 - 10 GB simple XML files with the only elements
I really care about are the <title> and <article> ones.

So what I'm hoping to do is put this data in a format so
that I can access it as fast as possible for a given request
(http request, Python web server) that specifies just the title,
and I return the article content.

Is there some good format that is optimized for search for
just 1 attribute (title) and then returning the corresponding article?

I've thought about putting this data in a SQLite database because
from what I know SQLite has very fast reads (no network latency, etc)
but not as fast writes, which is fine because I probably wont be doing
much writing (I wont ever care about the speed of any writes).

So is a database the way to go, or is there some other,
more specialized format that would be better?

"Database" without any further qualification indicates exact matching,
which doesn't seem to be very practical in the context of titles of
articles. There is an enormous body of literature on inexact/fuzzy
matching, and lots of deployed applications -- it's not a Python-related
question, really.
 
M

M.-A. Lemburg

Hi,

I'm looking for a fast way of accessing some simple (structured) data.

The data is like this:
Approx 6 - 10 GB simple XML files with the only elements
I really care about are the <title> and <article> ones.

So what I'm hoping to do is put this data in a format so
that I can access it as fast as possible for a given request
(http request, Python web server) that specifies just the title,
and I return the article content.

Is there some good format that is optimized for search for
just 1 attribute (title) and then returning the corresponding article?

I've thought about putting this data in a SQLite database because
from what I know SQLite has very fast reads (no network latency, etc)
but not as fast writes, which is fine because I probably wont be doing
much writing (I wont ever care about the speed of any writes).

So is a database the way to go, or is there some other,
more specialized format that would be better?

Depends on what you want to search and how, e.g. whether
a search for title substrings should give results, whether
stemming is needed, etc.

If all you want is a simple mapping of full title to article
string, an on-disk dictionary is probably the way to go,
e.g. mxBeeBase (part of the eGenix mx Base Distribution).

For more complex search, you're better off with a tool that
indexes the titles based on words, ie. a full-text search
engine such as Lucene.

Databases can also handle this, but they often have problems when
it comes to more complex queries where their indexes no longer
help them to speed up the query and they have to resort to
doing a table scan - a sequential search of all rows.

Some databases provide special full-text extensions, but
those are of varying quality. Better use a specialized
tool such as Lucene for this.

For more background on the problems of full-text search, see e.g.

http://www.ibm.com/developerworks/opensource/library/l-pyind.html

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source (#1, Feb 03 2008)________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
 
A

agc

"Database" without any further qualification indicates exact matching,
which doesn't seem to be very practical in the context of titles of
articles. There is an enormous body of literature on inexact/fuzzy
matching, and lots of deployed applications -- it's not a Python-related
question, really.

Yes, you are right that in some sense this question is not truly
Python related,
but I am looking to solve this problem in a way that plays as nicely
as
possible with Python:

I guess an important feature of what I'm looking for is
some kind of mapping from *exact* title to corresponding article,
i.e. if my data set wasn't so large, I would just keep all my
data in a in-memory Python dictionary, which would be very fast.

But I have about 2 million article titles mapping to approx. 6-10 GB
of article bodies, so I think this would be just to big for a
simple Python dictionary.

Does anyone have any advice on the feasibility of using
just an in memory dictionary? The dataset just seems to big,
but maybe there is a related method?

Thanks,
Alex
 
S

Stefan Behnel

agc said:
I guess an important feature of what I'm looking for is
some kind of mapping from *exact* title to corresponding article,
i.e. if my data set wasn't so large, I would just keep all my
data in a in-memory Python dictionary, which would be very fast.

But I have about 2 million article titles mapping to approx. 6-10 GB
of article bodies, so I think this would be just to big for a
simple Python dictionary.

Then use a database table that maps titles to articles, and make sure you
create an index over the title column.

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top