reading tabulated files

D

davario

hi everyone, I'm doing a project and it requires comparing entries on a
file. (the entries are separated by \r). i need to compare the first to
the second, then to the third etc. the same thing needs to be done with
the second entry (compared to the third, fourth etc. i want to do it
strait from the file, since it is a large amount of data. how do i
"remember" the place i have last read on the file? or perhaps how can i
remember the locations of all entries?
thanks in advance!!!
 
J

John Harrison

davario said:
hi everyone, I'm doing a project and it requires comparing entries on a
file. (the entries are separated by \r). i need to compare the first to
the second, then to the third etc. the same thing needs to be done with
the second entry (compared to the third, fourth etc. i want to do it
strait from the file, since it is a large amount of data. how do i
"remember" the place i have last read on the file? or perhaps how can i
remember the locations of all entries?
thanks in advance!!!

tellg and seekg are the methods for manipulating the position of a file.
tellg tells you where you are in a file, and seekg moves the file to a
new place.

But really this sounds horrendous, if the amount of data is so big that
you can't load it into memory then this is going to take days to
execute. If the amount of data is small enough to load into memory you
should.

But the real problem is the algorithm. Suppose you have 10,000 data
items, then you are going to have to do 50,000,000 (approx) comparisons.
Suppose you have 100,000 data items then that rises to 5,000,000,000
(approx) comparisons.

Since I don't know what you are comparing and why it's hard to suggest
improvements but you might consider sorting the data before you start
doing comparisons.

john
 
D

davario

I am sorting DNA sequences, and i have around 4000 sequences to
caompare.
they are each quite big, and when i tried to load them all into memory
to use by matlab it took ages and did'nt work too well, so i thought it
might be better to load them two at a time.
 
M

Marcin Kalicinski

hi everyone, I'm doing a project and it requires comparing entries on a
file. (the entries are separated by \r). i need to compare the first to
the second, then to the third etc. the same thing needs to be done with
the second entry (compared to the third, fourth etc. i want to do it
strait from the file, since it is a large amount of data. how do i
"remember" the place i have last read on the file? or perhaps how can i
remember the locations of all entries?
thanks in advance!!!

If your comparison involves only checking for equality/inequality you may
precalculate hash values for sequences and compare them. You can easily hold
4000 hash values in memory and it will work in a snap.

If hash values compare false it means that sequences are different, if they
compare true the sequences _might_ be equal, so then and only then you
compare the sequences themselves. The hash function you use is almost
irrelevant, it can probably be very simple like sum of all bytes in
sequence.

cheers,
Marcin
 
J

John Harrison

davario said:
I am sorting DNA sequences, and i have around 4000 sequences to
caompare.
they are each quite big, and when i tried to load them all into memory
to use by matlab it took ages and did'nt work too well, so i thought it
might be better to load them two at a time.

You are sorting using the process you described in your first email??? I
would say that the reason it took ages when you loaded them all into
memory was that you are using the wrong algorithm. Loading them two at a
time is not going to make things any better (in fact it will be worse).

This is the wrong way to do it. There are vastly more efficient ways to
sort data. If you really want to do this without reading the data into
memory then you should look at an algorithm called merge sort. If you
can read the data into memory then use an algorithm called quick sort.
This is the preferable option, but both of these will be hugely more
efficient than what you are proposing.

There is a great deal of liturature on sorting techniques so a little
research should turn up something very quickly. C++ even has quick sort
as part of it's standard library, so you won't even have to code the
algorithm.

John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,681
Members
48,796
Latest member
Greg L.

Latest Threads

Top