Data/File Structure and Algorithm for Retrieving Sorted Data Chunk Efficiently

J

Jane Austine

Hello.

I have student_names and their scores, which are between 0..100.

I want to retrieve in the sorted order in terms of score, Nth
student's name to N+10th student's name along with their scores.
(suppose a web page showing top 10 students and there are "next page"
link and if you follow the link there shows next top 10 students with
"prev" and "next" page links.)

The list is dynamically resized -- elements are added or deleted. The
list could go fairly long enough that pulling all the data on memory
at once is not a good choice.

What algorithm and data/file structure are most efficient? (the
environment is "cgi" based web environment)

Simplest approach would be using btree from bsddb. But there are only
"first"/"last" and "next"/"prev" moves. If I have to retrieve 20..29th
students, I have got to "first" and then "next" 19 times before I
finally get to 20th student, which is a waste of time(and memory).
Maybe I could keep the cursor position among sessions if I were using
a standalone server. However, it is in CGI environment.

What can I do instead? What data structure allows you to access
nth..n+10th data from the sorted list in a constant time?

Jane
 
P

Peter L Hansen

Jane said:
I have student_names and their scores, which are between 0..100.

I want to retrieve in the sorted order in terms of score, Nth
student's name to N+10th student's name along with their scores.
(suppose a web page showing top 10 students and there are "next page"
link and if you follow the link there shows next top 10 students with
"prev" and "next" page links.)

The list is dynamically resized -- elements are added or deleted. The
list could go fairly long enough that pulling all the data on memory
at once is not a good choice.

"Fairly long enough"? I imagine you'd have to have thousands of
students before just doing the simplest thing (ignore performance
and read the whole list) would become a noticable problem.

It's rare in an environment like the web, with all the user and
network delays that are present, that optimizing at the level
you are trying to do is worth the effort.
What algorithm and data/file structure are most efficient? (the
environment is "cgi" based web environment)

And unless mod_python is involved, CGI adds yet another significant
delay, masking any slight difference in performance from the optimal
approach and the simplistic one.
Simplest approach would be using btree from bsddb.

Nope, simplest would be just use a list and ignore the theoretical
but unnoticable performance issues. Remember, the actual memory
allocation and manipulations of the list elements happens in C code,
not in Python, so it is very fast.
What can I do instead? What data structure allows you to access
nth..n+10th data from the sorted list in a constant time?

If you've got more than 1000 students, it _might_ be worth
analyzing this further. :)

-Peter
 
A

Alex Martelli

Jane Austine said:
Maybe I could keep the cursor position among sessions if I were using
a standalone server. However, it is in CGI environment.

You can arrange things so that a daemon is always running in the
background, keeping whatever state you wish, and you CGI script
basically just send over the form data to the daemon and get the
response back from it. Webware comes with this kind of arrangement as
one of the ways it can work, for example.

What can I do instead? What data structure allows you to access
nth..n+10th data from the sorted list in a constant time?

Assuming you can find a way to amortize DB connection time (by a
daemon-based arrangement), I suggest a MySQL database and the use of a
SELECT query with ORDER BY and LIMIT clauses. I don't normally suggest
MySQL over other DB engines, but for your specific problem it appears to
me that it may be most appropriate. PostgreSQL also has LIMIT and
OFFSET clauses that work similarly to MySQL's LIMIT clause (iff you also
have an ORDER BY), but standard SQL doesn't support such a "sliding
window on the results" concept, unfortunately.

This suggestion is partly based on your followup post where you mention
your dataset size as being typically of several 10s of 1000s of records.


Alex
 
P

Paul Rubin

What can I do instead? What data structure allows you to access
nth..n+10th data from the sorted list in a constant time?

If you're building a real application you probably should use an SQL
database that handles all the concurrency issues, etc. as well as the
problem you're describing. If you're asking out of theoretical
interest (a homework problem?), then think in terms of secondary
indexes.
 
W

William Park

Jane Austine said:
Hello.

I have student_names and their scores, which are between 0..100.

I want to retrieve in the sorted order in terms of score, Nth
student's name to N+10th student's name along with their scores.
(suppose a web page showing top 10 students and there are "next page"
link and if you follow the link there shows next top 10 students with
"prev" and "next" page links.)

The list is dynamically resized -- elements are added or deleted. The
list could go fairly long enough that pulling all the data on memory
at once is not a good choice.

What algorithm and data/file structure are most efficient? (the
environment is "cgi" based web environment)

Simplest approach would be using btree from bsddb. But there are only
"first"/"last" and "next"/"prev" moves. If I have to retrieve 20..29th
students, I have got to "first" and then "next" 19 times before I
finally get to 20th student, which is a waste of time(and memory).
Maybe I could keep the cursor position among sessions if I were using
a standalone server. However, it is in CGI environment.

What can I do instead? What data structure allows you to access
nth..n+10th data from the sorted list in a constant time?

This is an example of "premature optimization". How many student do you
have anyways? 10,000 students with 100 bytes for name and score comes
to 1MB. Just keep the list as textfile, and read a block at time.
 
J

Jane Austine

William Park said:
This is an example of "premature optimization". How many student do you
have anyways? 10,000 students with 100 bytes for name and score comes
to 1MB. Just keep the list as textfile, and read a block at time.

Thank you for reminding me Knuth's words. However, there are other
information on each records, and the whole list could grow upto
10MB~100MB.

Another problem is, the list update action could take a long time and
big memory. The list could shrink and expand. Do I have to sort them
all in memory and then update the textfile _everytime_ it's changed?
 
P

Paul Rubin

Another problem is, the list update action could take a long time and
big memory. The list could shrink and expand. Do I have to sort them
all in memory and then update the textfile _everytime_ it's changed?

I don't understand this thread. If you're trying to implement a real
system, why don't you use MySQL? You're at nowhere near the traffic
level where that approach runs out of gas.
 
J

Josiah Carlson

I don't understand this thread. If you're trying to implement a real
system, why don't you use MySQL? You're at nowhere near the traffic
level where that approach runs out of gas.

Or even the included bsddb.btree database (remember to translate to/from
strings, open using bsddb.btopen() ).

- Josiah
 
W

William Park

Jane Austine said:
Thank you for reminding me Knuth's words. However, there are other
information on each records, and the whole list could grow upto
10MB~100MB.

Another problem is, the list update action could take a long time and
big memory. The list could shrink and expand. Do I have to sort them
all in memory and then update the textfile _everytime_ it's changed?

There are so many ways. Textfile is one. SQL is another. Only you can
answer that question.
 
A

Alex Martelli

Josiah Carlson said:
Or even the included bsddb.btree database (remember to translate to/from
strings, open using bsddb.btopen() ).

But, given a bsdsb btree with, say, 10,000 entries in it (in virtually
sorted order), how does one speedily say "get entries 4020 to 4030", for
example? That was the original poster's problem as originally posed.
If you call somedb.set_location(4020), you learn that integer keys are
only allowed for Recno and Queue DBs, not for BTree ones...

I think the crucial step (with current versions of bsddb) is to have
btflags=bsddb.db.DB_RECNUM at creation time; then, say xx is the
variable that's your database, xx.dbc.set_recno(4019) should position
the cursor as required. Sleepycat does warn that DB_RECNUM can degrade
performance for some applications, though, so be sure to do some
benchmarks if you want to do things this way.


Alex
 
J

Josiah Carlson

But, given a bsdsb btree with, say, 10,000 entries in it (in virtually
sorted order), how does one speedily say "get entries 4020 to 4030", for
example? That was the original poster's problem as originally posed.
If you call somedb.set_location(4020), you learn that integer keys are
only allowed for Recno and Queue DBs, not for BTree ones...

I think the crucial step (with current versions of bsddb) is to have
btflags=bsddb.db.DB_RECNUM at creation time; then, say xx is the
variable that's your database, xx.dbc.set_recno(4019) should position
the cursor as required. Sleepycat does warn that DB_RECNUM can degrade
performance for some applications, though, so be sure to do some
benchmarks if you want to do things this way.

I would hope that bsddb keeps track of tree location information in an
intelligent way, so even if it is slower, it still beats a flat text
file hands-down.

It would likely not beat a real dbms, but it all depends on if the
original poster finds that necessary.

- Josiah
 
A

Alex Martelli

On 2004 Oct 08, at 17:38, Josiah Carlson wrote:
...
I would hope that bsddb keeps track of tree location information in an
intelligent way, so even if it is slower, it still beats a flat text
file hands-down.

It would likely not beat a real dbms, but it all depends on if the
original poster finds that necessary.

Hmmmm -- I could be wrong, but doesn't MySql _use_ bsddb as its lower
level at least for some kinds of tables? Yet (for simple query) the
vox populi has it "faster" than "real dbms"s... just wondering aloud...


Alex
 
D

Dennis Lee Bieber

Hmmmm -- I could be wrong, but doesn't MySql _use_ bsddb as its lower
level at least for some kinds of tables? Yet (for simple query) the
vox populi has it "faster" than "real dbms"s... just wondering aloud...
The default table is "MyISAM". BSDDB and INNO have to be
explicitly requested (and built into the executable, in some cases)

--
 
A

Alex Martelli

Dennis Lee Bieber said:
The default table is "MyISAM". BSDDB and INNO have to be
explicitly requested (and built into the executable, in some cases)

Ah, I see, thanks -- so, performance relationships between these remain
unclear...


Alex
 
D

Dennis Lee Bieber

Ah, I see, thanks -- so, performance relationships between these remain
unclear...
Let's see what I can find in the hardcopy MySQL reference...


MyISAM (and the older ISAM) tables are not transaction safe;
InnoDB and "BDB" are...

MyISAM runs faster, since it doesn't have the transaction
journal overhead. BLOB and TEXT columns can be indexed. Each table
consists of a definition file, the data file, and index file.

InnoDB supports "foreign key" definitions, and does row-level
locking. Apparently all tables share space within predefined files,
which don't grow in size (along with MySQL's definition file -- InnoDB
runs as a separate back-end to MySQL, so it has its internal definition
overhead along with the MySQL definition).

BDB uses a patched version of the Sleepycat code to fit better
with MySQL. Slower as the data is stored in B-Tree format (MyISAM has
the indexes separate, with a straight data file).

--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,731
Messages
2,569,432
Members
44,832
Latest member
GlennSmall

Latest Threads

Top