Datastructure design

S

Santosh

Hello,

I would like some input on choosing a datastructure and a algorithm. I
have a text file which contains three strings(say name, phonenumber
and city). The file contains a about a billion records.

I need to choose a datastructure which will sort efficienctly based on
any of the strings(keys) which may be any one of the three or a
combination of the three in which case we will need to sort with
multiple keys.

What is the best datastructure to store this data?

the problem here is that the key is not fixed. It could be the name,
phonenumber or the city and sometimes we nmight also need to sort
first by name and then by city.

I was thinking we could use multi-key quicksort but I am a little
confused as to how to store the data. I could use a B-Tree to store
the data but how I dont know how to implement the compare function,
because the keys are not fixed ?

Any suggestions?

Thanks in advance
 
S

Sheldon Simms

Hello,

I would like some input on choosing a datastructure and a algorithm. I
have a text file which contains three strings(say name, phonenumber
and city). The file contains a about a billion records.

Your question is off topic here, but I have to wonder a bit... You
have a billion records in a file? How interesting. You know there
are these things called databases. They're quite useful in fact.
I was thinking we could use multi-key quicksort but I am a little
confused as to how to store the data.

No kidding... Do a google on disk sort algorithms.
 
M

Mac

Hello,

I would like some input on choosing a datastructure and a algorithm. I
have a text file which contains three strings(say name, phonenumber
and city). The file contains a about a billion records.

I need to choose a datastructure which will sort efficienctly based on
any of the strings(keys) which may be any one of the three or a
combination of the three in which case we will need to sort with
multiple keys.

What is the best datastructure to store this data?

the problem here is that the key is not fixed. It could be the name,
phonenumber or the city and sometimes we nmight also need to sort
first by name and then by city.

I was thinking we could use multi-key quicksort but I am a little
confused as to how to store the data. I could use a B-Tree to store
the data but how I dont know how to implement the compare function,
because the keys are not fixed ?

Any suggestions?

Thanks in advance

Your question is off-topic here, but make sure you understand that sorting
any data set which is bigger than your computer's real memory size is going
to be a specialized task. You can't just read it into memory and then
start swapping things around, because this will lead to thrashing.

You're going to have to read a little, sort a little, read a little, sort
a little, merge, merge, merge. Or something like that.

Of course, maybe you have enough RAM to read in a billion records, I don't
know.

Good luck.

Mac
--
 
S

Santosh

I am sorry that the question was not offtopic for this group. But I
felt most of the C Programmers would definitely be able to suggest
something. Well, I know we can use a Database. But the question was
more intended from a design perspective.

Since there are a billion records, I need to use for sure some kind of
external sorting algorithm like mergesort. I could use a B-Tree to
store the data with the phone number(or some unique identifier) as the
key. But what I wanted to know what how can I sort this tree based on
any other field stored in the node. Lets say a name, something like in
a Database where we can sort by multiple fields of a record at the
same time.

Thanks
 
E

Eric Sosman

Santosh said:
I am sorry that the question was not offtopic for this group. But I
felt most of the C Programmers would definitely be able to suggest
something. Well, I know we can use a Database. But the question was
more intended from a design perspective.

Since there are a billion records, I need to use for sure some kind of
external sorting algorithm like mergesort. I could use a B-Tree to
store the data with the phone number(or some unique identifier) as the
key. But what I wanted to know what how can I sort this tree based on
any other field stored in the node. Lets say a name, something like in
a Database where we can sort by multiple fields of a record at the
same time.

It's still off-topic, and C programmers are not ipso
facto experts on managing large data collections. Maybe
there will eventually be some C questions as you move ahead
with the implementation, but as yet no C question has been
asked.

<off-topic>

Start with Knuth "The Art of Computer Programming, Volume
III: Sorting and Searching," section 6.5 "Retrieval on
secondary keys." Follow bibliographical links as appropriate.

</off-topic>
 
M

Mike Wahler

Santosh said:
I am sorry that the question was not offtopic for this group. But I
felt most of the C Programmers would definitely be able to suggest
something. Well, I know we can use a Database. But the question was
more intended from a design perspective.

Since there are a billion records, I need to use for sure some kind of
external sorting algorithm like mergesort. I could use a B-Tree to
store the data with the phone number(or some unique identifier) as the
key. But what I wanted to know what how can I sort this tree based on
any other field stored in the node. Lets say a name, something like in
a Database where we can sort by multiple fields of a record at the
same time.

Thanks

Any sort implies a comparison of elements. Factor this
comparison out into a function (or several if different
comparisons are needed), and have your sort function call
the appropriate comparison function(s) (Do this by providing
a function pointer parameter for your sort function). This
is the way the standard library's 'qsort()' function works
(but 'qsort()' sorts an array, not a 'tree').

-Mike
 
P

pandy

I am sorry that the question was not offtopic for this group. But I
felt most of the C Programmers would definitely be able to suggest
something. Well, I know we can use a Database. But the question was
more intended from a design perspective.

Since there are a billion records, I need to use for sure some kind of
external sorting algorithm like mergesort. I could use a B-Tree to
store the data with the phone number(or some unique identifier) as the
key. But what I wanted to know what how can I sort this tree based on
any other field stored in the node. Lets say a name, something like in
a Database where we can sort by multiple fields of a record at the
same time.

You can achieve it by using SQL. Easy!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,142
Latest member
arinsharma
Top