Datastructure design

Discussion in 'C Programming' started by Santosh, Nov 19, 2003.

  1. Santosh

    Santosh Guest

    Hello,

    I would like some input on choosing a datastructure and a algorithm. I
    have a text file which contains three strings(say name, phonenumber
    and city). The file contains a about a billion records.

    I need to choose a datastructure which will sort efficienctly based on
    any of the strings(keys) which may be any one of the three or a
    combination of the three in which case we will need to sort with
    multiple keys.

    What is the best datastructure to store this data?

    the problem here is that the key is not fixed. It could be the name,
    phonenumber or the city and sometimes we nmight also need to sort
    first by name and then by city.

    I was thinking we could use multi-key quicksort but I am a little
    confused as to how to store the data. I could use a B-Tree to store
    the data but how I dont know how to implement the compare function,
    because the keys are not fixed ?

    Any suggestions?

    Thanks in advance
     
    Santosh, Nov 19, 2003
    #1
    1. Advertising

  2. On Tue, 18 Nov 2003 16:13:49 -0800, Santosh wrote:

    > Hello,
    >
    > I would like some input on choosing a datastructure and a algorithm. I
    > have a text file which contains three strings(say name, phonenumber
    > and city). The file contains a about a billion records.


    Your question is off topic here, but I have to wonder a bit... You
    have a billion records in a file? How interesting. You know there
    are these things called databases. They're quite useful in fact.

    > I was thinking we could use multi-key quicksort but I am a little
    > confused as to how to store the data.


    No kidding... Do a google on disk sort algorithms.
     
    Sheldon Simms, Nov 19, 2003
    #2
    1. Advertising

  3. Santosh

    Mac Guest

    On Tue, 18 Nov 2003 16:13:49 +0000, Santosh wrote:

    > Hello,
    >
    > I would like some input on choosing a datastructure and a algorithm. I
    > have a text file which contains three strings(say name, phonenumber
    > and city). The file contains a about a billion records.
    >
    > I need to choose a datastructure which will sort efficienctly based on
    > any of the strings(keys) which may be any one of the three or a
    > combination of the three in which case we will need to sort with
    > multiple keys.
    >
    > What is the best datastructure to store this data?
    >
    > the problem here is that the key is not fixed. It could be the name,
    > phonenumber or the city and sometimes we nmight also need to sort
    > first by name and then by city.
    >
    > I was thinking we could use multi-key quicksort but I am a little
    > confused as to how to store the data. I could use a B-Tree to store
    > the data but how I dont know how to implement the compare function,
    > because the keys are not fixed ?
    >
    > Any suggestions?
    >
    > Thanks in advance


    Your question is off-topic here, but make sure you understand that sorting
    any data set which is bigger than your computer's real memory size is going
    to be a specialized task. You can't just read it into memory and then
    start swapping things around, because this will lead to thrashing.

    You're going to have to read a little, sort a little, read a little, sort
    a little, merge, merge, merge. Or something like that.

    Of course, maybe you have enough RAM to read in a billion records, I don't
    know.

    Good luck.

    Mac
    --
     
    Mac, Nov 19, 2003
    #3
  4. Santosh

    Santosh Guest

    I am sorry that the question was not offtopic for this group. But I
    felt most of the C Programmers would definitely be able to suggest
    something. Well, I know we can use a Database. But the question was
    more intended from a design perspective.

    Since there are a billion records, I need to use for sure some kind of
    external sorting algorithm like mergesort. I could use a B-Tree to
    store the data with the phone number(or some unique identifier) as the
    key. But what I wanted to know what how can I sort this tree based on
    any other field stored in the node. Lets say a name, something like in
    a Database where we can sort by multiple fields of a record at the
    same time.

    Thanks



    "Mac" <> wrote in message news:<>...
    > On Tue, 18 Nov 2003 16:13:49 +0000, Santosh wrote:
    >
    > > Hello,
    > >
    > > I would like some input on choosing a datastructure and a algorithm. I
    > > have a text file which contains three strings(say name, phonenumber
    > > and city). The file contains a about a billion records.
    > >
    > > I need to choose a datastructure which will sort efficienctly based on
    > > any of the strings(keys) which may be any one of the three or a
    > > combination of the three in which case we will need to sort with
    > > multiple keys.
    > >
    > > What is the best datastructure to store this data?
    > >
    > > the problem here is that the key is not fixed. It could be the name,
    > > phonenumber or the city and sometimes we nmight also need to sort
    > > first by name and then by city.
    > >
    > > I was thinking we could use multi-key quicksort but I am a little
    > > confused as to how to store the data. I could use a B-Tree to store
    > > the data but how I dont know how to implement the compare function,
    > > because the keys are not fixed ?
    > >
    > > Any suggestions?
    > >
    > > Thanks in advance

    >
    > Your question is off-topic here, but make sure you understand that sorting
    > any data set which is bigger than your computer's real memory size is going
    > to be a specialized task. You can't just read it into memory and then
    > start swapping things around, because this will lead to thrashing.
    >
    > You're going to have to read a little, sort a little, read a little, sort
    > a little, merge, merge, merge. Or something like that.
    >
    > Of course, maybe you have enough RAM to read in a billion records, I don't
    > know.
    >
    > Good luck.
    >
    > Mac
    > --
     
    Santosh, Nov 19, 2003
    #4
  5. Santosh

    Eric Sosman Guest

    Santosh wrote:
    >
    > I am sorry that the question was not offtopic for this group. But I
    > felt most of the C Programmers would definitely be able to suggest
    > something. Well, I know we can use a Database. But the question was
    > more intended from a design perspective.
    >
    > Since there are a billion records, I need to use for sure some kind of
    > external sorting algorithm like mergesort. I could use a B-Tree to
    > store the data with the phone number(or some unique identifier) as the
    > key. But what I wanted to know what how can I sort this tree based on
    > any other field stored in the node. Lets say a name, something like in
    > a Database where we can sort by multiple fields of a record at the
    > same time.


    It's still off-topic, and C programmers are not ipso
    facto experts on managing large data collections. Maybe
    there will eventually be some C questions as you move ahead
    with the implementation, but as yet no C question has been
    asked.

    <off-topic>

    Start with Knuth "The Art of Computer Programming, Volume
    III: Sorting and Searching," section 6.5 "Retrieval on
    secondary keys." Follow bibliographical links as appropriate.

    </off-topic>

    --
     
    Eric Sosman, Nov 19, 2003
    #5
  6. Santosh

    Mike Wahler Guest

    "Santosh" <> wrote in message
    news:...
    > I am sorry that the question was not offtopic for this group. But I
    > felt most of the C Programmers would definitely be able to suggest
    > something. Well, I know we can use a Database. But the question was
    > more intended from a design perspective.
    >
    > Since there are a billion records, I need to use for sure some kind of
    > external sorting algorithm like mergesort. I could use a B-Tree to
    > store the data with the phone number(or some unique identifier) as the
    > key. But what I wanted to know what how can I sort this tree based on
    > any other field stored in the node. Lets say a name, something like in
    > a Database where we can sort by multiple fields of a record at the
    > same time.
    >
    > Thanks


    Any sort implies a comparison of elements. Factor this
    comparison out into a function (or several if different
    comparisons are needed), and have your sort function call
    the appropriate comparison function(s) (Do this by providing
    a function pointer parameter for your sort function). This
    is the way the standard library's 'qsort()' function works
    (but 'qsort()' sorts an array, not a 'tree').

    -Mike
     
    Mike Wahler, Nov 19, 2003
    #6
  7. Santosh

    pandy Guest

    (Santosh) wrote in message news:<>...
    > I am sorry that the question was not offtopic for this group. But I
    > felt most of the C Programmers would definitely be able to suggest
    > something. Well, I know we can use a Database. But the question was
    > more intended from a design perspective.
    >
    > Since there are a billion records, I need to use for sure some kind of
    > external sorting algorithm like mergesort. I could use a B-Tree to
    > store the data with the phone number(or some unique identifier) as the
    > key. But what I wanted to know what how can I sort this tree based on
    > any other field stored in the node. Lets say a name, something like in
    > a Database where we can sort by multiple fields of a record at the
    > same time.


    You can achieve it by using SQL. Easy!

    >
    > Thanks
    >
    >
    >
    > "Mac" <> wrote in message news:<>...
    > > On Tue, 18 Nov 2003 16:13:49 +0000, Santosh wrote:
    > >
    > > > Hello,
    > > >
    > > > I would like some input on choosing a datastructure and a algorithm. I
    > > > have a text file which contains three strings(say name, phonenumber
    > > > and city). The file contains a about a billion records.
    > > >
    > > > I need to choose a datastructure which will sort efficienctly based on
    > > > any of the strings(keys) which may be any one of the three or a
    > > > combination of the three in which case we will need to sort with
    > > > multiple keys.
    > > >
    > > > What is the best datastructure to store this data?
    > > >
    > > > the problem here is that the key is not fixed. It could be the name,
    > > > phonenumber or the city and sometimes we nmight also need to sort
    > > > first by name and then by city.
    > > >
    > > > I was thinking we could use multi-key quicksort but I am a little
    > > > confused as to how to store the data. I could use a B-Tree to store
    > > > the data but how I dont know how to implement the compare function,
    > > > because the keys are not fixed ?
    > > >
    > > > Any suggestions?
    > > >
    > > > Thanks in advance

    > >
    > > Your question is off-topic here, but make sure you understand that sorting
    > > any data set which is bigger than your computer's real memory size is going
    > > to be a specialized task. You can't just read it into memory and then
    > > start swapping things around, because this will lead to thrashing.
    > >
    > > You're going to have to read a little, sort a little, read a little, sort
    > > a little, merge, merge, merge. Or something like that.
    > >
    > > Of course, maybe you have enough RAM to read in a billion records, I don't
    > > know.
    > >
    > > Good luck.
    > >
    > > Mac
    > > --
     
    pandy, Nov 20, 2003
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Anony!

    Set DataStructure

    Anony!, Aug 12, 2004, in forum: Java
    Replies:
    2
    Views:
    399
    Anony!
    Aug 13, 2004
  2. Sharp

    Index-based datastructure

    Sharp, Mar 14, 2005, in forum: Java
    Replies:
    1
    Views:
    328
    Chris Uppal
    Mar 14, 2005
  3. Replies:
    3
    Views:
    388
    shakah
    Jun 23, 2005
  4. Prateek Basu

    Datastructure and Algorithms

    Prateek Basu, Jan 23, 2004, in forum: C Programming
    Replies:
    4
    Views:
    2,209
    Prateek Basu
    Jan 24, 2004
  5. jason
    Replies:
    6
    Views:
    366
    jason
    Nov 10, 2007
Loading...

Share This Page