Data/File Structure and Algorithm for Retrieving Sorted Data Chunk Efficiently

Discussion in 'Python' started by Jane Austine, Oct 5, 2004.

  1. Jane Austine

    Jane Austine Guest

    Hello.

    I have student_names and their scores, which are between 0..100.

    I want to retrieve in the sorted order in terms of score, Nth
    student's name to N+10th student's name along with their scores.
    (suppose a web page showing top 10 students and there are "next page"
    link and if you follow the link there shows next top 10 students with
    "prev" and "next" page links.)

    The list is dynamically resized -- elements are added or deleted. The
    list could go fairly long enough that pulling all the data on memory
    at once is not a good choice.

    What algorithm and data/file structure are most efficient? (the
    environment is "cgi" based web environment)

    Simplest approach would be using btree from bsddb. But there are only
    "first"/"last" and "next"/"prev" moves. If I have to retrieve 20..29th
    students, I have got to "first" and then "next" 19 times before I
    finally get to 20th student, which is a waste of time(and memory).
    Maybe I could keep the cursor position among sessions if I were using
    a standalone server. However, it is in CGI environment.

    What can I do instead? What data structure allows you to access
    nth..n+10th data from the sorted list in a constant time?

    Jane
     
    Jane Austine, Oct 5, 2004
    #1
    1. Advertising

  2. Re: Data/File Structure and Algorithm for Retrieving Sorted DataChunk Efficiently

    Jane Austine wrote:
    > I have student_names and their scores, which are between 0..100.
    >
    > I want to retrieve in the sorted order in terms of score, Nth
    > student's name to N+10th student's name along with their scores.
    > (suppose a web page showing top 10 students and there are "next page"
    > link and if you follow the link there shows next top 10 students with
    > "prev" and "next" page links.)
    >
    > The list is dynamically resized -- elements are added or deleted. The
    > list could go fairly long enough that pulling all the data on memory
    > at once is not a good choice.


    "Fairly long enough"? I imagine you'd have to have thousands of
    students before just doing the simplest thing (ignore performance
    and read the whole list) would become a noticable problem.

    It's rare in an environment like the web, with all the user and
    network delays that are present, that optimizing at the level
    you are trying to do is worth the effort.

    > What algorithm and data/file structure are most efficient? (the
    > environment is "cgi" based web environment)


    And unless mod_python is involved, CGI adds yet another significant
    delay, masking any slight difference in performance from the optimal
    approach and the simplistic one.

    > Simplest approach would be using btree from bsddb.


    Nope, simplest would be just use a list and ignore the theoretical
    but unnoticable performance issues. Remember, the actual memory
    allocation and manipulations of the list elements happens in C code,
    not in Python, so it is very fast.

    > What can I do instead? What data structure allows you to access
    > nth..n+10th data from the sorted list in a constant time?


    If you've got more than 1000 students, it _might_ be worth
    analyzing this further. :)

    -Peter
     
    Peter L Hansen, Oct 5, 2004
    #2
    1. Advertising

  3. Jane Austine <> wrote:
    ...
    > Maybe I could keep the cursor position among sessions if I were using
    > a standalone server. However, it is in CGI environment.


    You can arrange things so that a daemon is always running in the
    background, keeping whatever state you wish, and you CGI script
    basically just send over the form data to the daemon and get the
    response back from it. Webware comes with this kind of arrangement as
    one of the ways it can work, for example.


    > What can I do instead? What data structure allows you to access
    > nth..n+10th data from the sorted list in a constant time?


    Assuming you can find a way to amortize DB connection time (by a
    daemon-based arrangement), I suggest a MySQL database and the use of a
    SELECT query with ORDER BY and LIMIT clauses. I don't normally suggest
    MySQL over other DB engines, but for your specific problem it appears to
    me that it may be most appropriate. PostgreSQL also has LIMIT and
    OFFSET clauses that work similarly to MySQL's LIMIT clause (iff you also
    have an ORDER BY), but standard SQL doesn't support such a "sliding
    window on the results" concept, unfortunately.

    This suggestion is partly based on your followup post where you mention
    your dataset size as being typically of several 10s of 1000s of records.


    Alex
     
    Alex Martelli, Oct 5, 2004
    #3
  4. Jane Austine

    Paul Rubin Guest

    (Jane Austine) writes:
    > What can I do instead? What data structure allows you to access
    > nth..n+10th data from the sorted list in a constant time?


    If you're building a real application you probably should use an SQL
    database that handles all the concurrency issues, etc. as well as the
    problem you're describing. If you're asking out of theoretical
    interest (a homework problem?), then think in terms of secondary
    indexes.
     
    Paul Rubin, Oct 5, 2004
    #4
  5. Jane Austine

    William Park Guest

    Jane Austine <> wrote:
    > Hello.
    >
    > I have student_names and their scores, which are between 0..100.
    >
    > I want to retrieve in the sorted order in terms of score, Nth
    > student's name to N+10th student's name along with their scores.
    > (suppose a web page showing top 10 students and there are "next page"
    > link and if you follow the link there shows next top 10 students with
    > "prev" and "next" page links.)
    >
    > The list is dynamically resized -- elements are added or deleted. The
    > list could go fairly long enough that pulling all the data on memory
    > at once is not a good choice.
    >
    > What algorithm and data/file structure are most efficient? (the
    > environment is "cgi" based web environment)
    >
    > Simplest approach would be using btree from bsddb. But there are only
    > "first"/"last" and "next"/"prev" moves. If I have to retrieve 20..29th
    > students, I have got to "first" and then "next" 19 times before I
    > finally get to 20th student, which is a waste of time(and memory).
    > Maybe I could keep the cursor position among sessions if I were using
    > a standalone server. However, it is in CGI environment.
    >
    > What can I do instead? What data structure allows you to access
    > nth..n+10th data from the sorted list in a constant time?


    This is an example of "premature optimization". How many student do you
    have anyways? 10,000 students with 100 bytes for name and score comes
    to 1MB. Just keep the list as textfile, and read a block at time.
     
    William Park, Oct 5, 2004
    #5
  6. Jane Austine

    Jane Austine Guest

    William Park <> wrote in message news:<>...
    > Jane Austine <> wrote:
    > > Hello.
    > >
    > > I have student_names and their scores, which are between 0..100.
    > >
    > > I want to retrieve in the sorted order in terms of score, Nth
    > > student's name to N+10th student's name along with their scores.
    > > (suppose a web page showing top 10 students and there are "next page"
    > > link and if you follow the link there shows next top 10 students with
    > > "prev" and "next" page links.)
    > >
    > > The list is dynamically resized -- elements are added or deleted. The
    > > list could go fairly long enough that pulling all the data on memory
    > > at once is not a good choice.
    > >
    > > What algorithm and data/file structure are most efficient? (the
    > > environment is "cgi" based web environment)
    > >
    > > Simplest approach would be using btree from bsddb. But there are only
    > > "first"/"last" and "next"/"prev" moves. If I have to retrieve 20..29th
    > > students, I have got to "first" and then "next" 19 times before I
    > > finally get to 20th student, which is a waste of time(and memory).
    > > Maybe I could keep the cursor position among sessions if I were using
    > > a standalone server. However, it is in CGI environment.
    > >
    > > What can I do instead? What data structure allows you to access
    > > nth..n+10th data from the sorted list in a constant time?

    >
    > This is an example of "premature optimization". How many student do you
    > have anyways? 10,000 students with 100 bytes for name and score comes
    > to 1MB. Just keep the list as textfile, and read a block at time.


    Thank you for reminding me Knuth's words. However, there are other
    information on each records, and the whole list could grow upto
    10MB~100MB.

    Another problem is, the list update action could take a long time and
    big memory. The list could shrink and expand. Do I have to sort them
    all in memory and then update the textfile _everytime_ it's changed?
     
    Jane Austine, Oct 8, 2004
    #6
  7. Jane Austine

    Paul Rubin Guest

    (Jane Austine) writes:
    > Another problem is, the list update action could take a long time and
    > big memory. The list could shrink and expand. Do I have to sort them
    > all in memory and then update the textfile _everytime_ it's changed?


    I don't understand this thread. If you're trying to implement a real
    system, why don't you use MySQL? You're at nowhere near the traffic
    level where that approach runs out of gas.
     
    Paul Rubin, Oct 8, 2004
    #7
  8. Re: Data/File Structure and Algorithm for Retrieving Sorted DataChunk Efficiently


    > > Another problem is, the list update action could take a long time and
    > > big memory. The list could shrink and expand. Do I have to sort them
    > > all in memory and then update the textfile _everytime_ it's changed?

    >
    > I don't understand this thread. If you're trying to implement a real
    > system, why don't you use MySQL? You're at nowhere near the traffic
    > level where that approach runs out of gas.


    Or even the included bsddb.btree database (remember to translate to/from
    strings, open using bsddb.btopen() ).

    - Josiah
     
    Josiah Carlson, Oct 8, 2004
    #8
  9. Jane Austine

    William Park Guest

    Jane Austine <> wrote:
    > Thank you for reminding me Knuth's words. However, there are other
    > information on each records, and the whole list could grow upto
    > 10MB~100MB.
    >
    > Another problem is, the list update action could take a long time and
    > big memory. The list could shrink and expand. Do I have to sort them
    > all in memory and then update the textfile _everytime_ it's changed?


    There are so many ways. Textfile is one. SQL is another. Only you can
    answer that question.

    --
    William Park <>
    Open Geometry Consulting, Toronto, Canada
     
    William Park, Oct 8, 2004
    #9
  10. Re: Data/File Structure and Algorithm for Retrieving Sorted Data Chunk Efficiently

    Josiah Carlson <> wrote:

    > > > Another problem is, the list update action could take a long time and
    > > > big memory. The list could shrink and expand. Do I have to sort them
    > > > all in memory and then update the textfile _everytime_ it's changed?

    > >
    > > I don't understand this thread. If you're trying to implement a real
    > > system, why don't you use MySQL? You're at nowhere near the traffic
    > > level where that approach runs out of gas.

    >
    > Or even the included bsddb.btree database (remember to translate to/from
    > strings, open using bsddb.btopen() ).


    But, given a bsdsb btree with, say, 10,000 entries in it (in virtually
    sorted order), how does one speedily say "get entries 4020 to 4030", for
    example? That was the original poster's problem as originally posed.
    If you call somedb.set_location(4020), you learn that integer keys are
    only allowed for Recno and Queue DBs, not for BTree ones...

    I think the crucial step (with current versions of bsddb) is to have
    btflags=bsddb.db.DB_RECNUM at creation time; then, say xx is the
    variable that's your database, xx.dbc.set_recno(4019) should position
    the cursor as required. Sleepycat does warn that DB_RECNUM can degrade
    performance for some applications, though, so be sure to do some
    benchmarks if you want to do things this way.


    Alex
     
    Alex Martelli, Oct 8, 2004
    #10
  11. Re: Data/File Structure and Algorithm for Retrieving SortedData Chunk Efficiently


    > > Or even the included bsddb.btree database (remember to translate to/from
    > > strings, open using bsddb.btopen() ).

    >
    > But, given a bsdsb btree with, say, 10,000 entries in it (in virtually
    > sorted order), how does one speedily say "get entries 4020 to 4030", for
    > example? That was the original poster's problem as originally posed.
    > If you call somedb.set_location(4020), you learn that integer keys are
    > only allowed for Recno and Queue DBs, not for BTree ones...
    >
    > I think the crucial step (with current versions of bsddb) is to have
    > btflags=bsddb.db.DB_RECNUM at creation time; then, say xx is the
    > variable that's your database, xx.dbc.set_recno(4019) should position
    > the cursor as required. Sleepycat does warn that DB_RECNUM can degrade
    > performance for some applications, though, so be sure to do some
    > benchmarks if you want to do things this way.


    I would hope that bsddb keeps track of tree location information in an
    intelligent way, so even if it is slower, it still beats a flat text
    file hands-down.

    It would likely not beat a real dbms, but it all depends on if the
    original poster finds that necessary.

    - Josiah
     
    Josiah Carlson, Oct 8, 2004
    #11
  12. Re: Data/File Structure and Algorithm for Retrieving SortedData Chunk Efficiently

    On 2004 Oct 08, at 17:38, Josiah Carlson wrote:
    ...
    > I would hope that bsddb keeps track of tree location information in an
    > intelligent way, so even if it is slower, it still beats a flat text
    > file hands-down.
    >
    > It would likely not beat a real dbms, but it all depends on if the
    > original poster finds that necessary.


    Hmmmm -- I could be wrong, but doesn't MySql _use_ bsddb as its lower
    level at least for some kinds of tables? Yet (for simple query) the
    vox populi has it "faster" than "real dbms"s... just wondering aloud...


    Alex
     
    Alex Martelli, Oct 8, 2004
    #12
  13. Re: Data/File Structure and Algorithm for Retrieving Sorted Data Chunk Efficiently

    On Fri, 8 Oct 2004 17:52:37 +0200, Alex Martelli <>
    declaimed the following in comp.lang.python:

    > Hmmmm -- I could be wrong, but doesn't MySql _use_ bsddb as its lower
    > level at least for some kinds of tables? Yet (for simple query) the
    > vox populi has it "faster" than "real dbms"s... just wondering aloud...
    >

    The default table is "MyISAM". BSDDB and INNO have to be
    explicitly requested (and built into the executable, in some cases)

    --
    > ============================================================== <
    > | Wulfraed Dennis Lee Bieber KD6MOG <
    > | Bestiaria Support Staff <
    > ============================================================== <
    > Home Page: <http://www.dm.net/~wulfraed/> <
    > Overflow Page: <http://wlfraed.home.netcom.com/> <
     
    Dennis Lee Bieber, Oct 9, 2004
    #13
  14. Re: Data/File Structure and Algorithm for Retrieving Sorted Data Chunk Efficiently

    Dennis Lee Bieber <> wrote:

    > On Fri, 8 Oct 2004 17:52:37 +0200, Alex Martelli <>
    > declaimed the following in comp.lang.python:
    >
    > > Hmmmm -- I could be wrong, but doesn't MySql _use_ bsddb as its lower
    > > level at least for some kinds of tables? Yet (for simple query) the
    > > vox populi has it "faster" than "real dbms"s... just wondering aloud...
    > >

    > The default table is "MyISAM". BSDDB and INNO have to be
    > explicitly requested (and built into the executable, in some cases)


    Ah, I see, thanks -- so, performance relationships between these remain
    unclear...


    Alex
     
    Alex Martelli, Oct 9, 2004
    #14
  15. Re: Data/File Structure and Algorithm for Retrieving Sorted Data Chunk Efficiently

    On Sat, 9 Oct 2004 08:44:40 +0200, (Alex Martelli)
    declaimed the following in comp.lang.python:

    >
    > Ah, I see, thanks -- so, performance relationships between these remain
    > unclear...
    >

    Let's see what I can find in the hardcopy MySQL reference...


    MyISAM (and the older ISAM) tables are not transaction safe;
    InnoDB and "BDB" are...

    MyISAM runs faster, since it doesn't have the transaction
    journal overhead. BLOB and TEXT columns can be indexed. Each table
    consists of a definition file, the data file, and index file.

    InnoDB supports "foreign key" definitions, and does row-level
    locking. Apparently all tables share space within predefined files,
    which don't grow in size (along with MySQL's definition file -- InnoDB
    runs as a separate back-end to MySQL, so it has its internal definition
    overhead along with the MySQL definition).

    BDB uses a patched version of the Sleepycat code to fit better
    with MySQL. Slower as the data is stored in B-Tree format (MyISAM has
    the indexes separate, with a straight data file).

    >
    > Alex


    --
    > ============================================================== <
    > | Wulfraed Dennis Lee Bieber KD6MOG <
    > | Bestiaria Support Staff <
    > ============================================================== <
    > Home Page: <http://www.dm.net/~wulfraed/> <
    > Overflow Page: <http://wlfraed.home.netcom.com/> <
     
    Dennis Lee Bieber, Oct 9, 2004
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jane Austine
    Replies:
    2
    Views:
    458
    Changjune Kim
    Oct 5, 2004
  2. Hunk
    Replies:
    4
    Views:
    376
  3. Replies:
    6
    Views:
    2,019
  4. Sanjeeb
    Replies:
    3
    Views:
    426
    Ryan Kelly
    Aug 3, 2010
  5. bwv549
    Replies:
    3
    Views:
    427
    Eleanor McHugh
    Jun 17, 2009
Loading...

Share This Page