Reading a large number of text files into an array

M

Matthew Crema

Hello,

Say I have 1000 text files and each is a list of 32768 integers.

I have written a C program to read this data into a large matrix. I am
using fopen in combination with fscanf to read the data in. However, it
takes about 20 seconds to complete and I wonder if there is a faster way.

For example, I found that I could use 'fread' to read the data into a
string that looks like this:

91\n212\n34\n40\n25\n100\n300\n ... \0

and it is nearly instantaneous. However, is there a quick way to
convert this string into an array of doubles?

Thanks.
-Matt

Here is my existing code. Sorry if it is ugly. It is the first C code
I've written in a long time:

#include <stdio.h>
#include <stdlib.h>

#define nrows 32768L
#define nfiles 1000L

int main(void)
{
double *data;
unsigned long filenum, pos;
char filename[20];
FILE *fp;
int i;

/* Create output matrix */
data = (double *) malloc((size_t)((nrows*nfiles)*sizeof(double)));

for(filenum=1; filenum<=nfiles; ++filenum) {
// Determine current file name
sprintf(filename, "data%lu.dat", filenum);

// Open the file
fp = fopen (filename,"r");

pos = nrows*(filenum-1L);

for (i=0; i<nrows; ++i)
fscanf(fp, "%lf", data+i+pos);

fclose(fp);
}

// De-allocate memory
free((char*) (data));

return 0;
}
 
C

Christian Kandeler

Matthew said:
Hello,

Say I have 1000 text files and each is a list of 32768 integers.

I have written a C program to read this data into a large matrix. I am
using fopen in combination with fscanf to read the data in. However, it
takes about 20 seconds to complete and I wonder if there is a faster way.

For example, I found that I could use 'fread' to read the data into a
string that looks like this:

91\n212\n34\n40\n25\n100\n300\n ... \0

and it is nearly instantaneous. However, is there a quick way to
convert this string into an array of doubles?

sscanf() might just be what you are looking for.
Here is my existing code. Sorry if it is ugly. It is the first C code
I've written in a long time:

I'll comment on it, if you don't mind.
#include <stdio.h>
#include <stdlib.h>

#define nrows 32768L
#define nfiles 1000L

Symbolic constants use all-capital letters by convention. Also, if you add a
suffix, why not UL instead of L? Both values can never be negative.
int main(void)
{
double *data;
unsigned long filenum, pos;
char filename[20];

This is somewhat unsafe. You should think about a way to make the size of
the array dependent on the maximum length of all components from which the
actual filename is constructed. (This requires dynamic allocation or a VLA
if you have C99).
FILE *fp;
int i;

/* Create output matrix */
data = (double *) malloc((size_t)((nrows*nfiles)*sizeof(double)));

No need for either of the casts here.
for(filenum=1; filenum<=nfiles; ++filenum) {
// Determine current file name
sprintf(filename, "data%lu.dat", filenum);

// Open the file
fp = fopen (filename,"r");

You should always check the return value of fopen().
pos = nrows*(filenum-1L);

for (i=0; i<nrows; ++i)
fscanf(fp, "%lf", data+i+pos);

fclose(fp);
}

// De-allocate memory
free((char*) (data));

Absolutely no need to cast here.
return 0;
}


Christian
 
A

Alex Fraser

Matthew Crema said:
Say I have 1000 text files and each is a list of 32768 integers.

I have written a C program to read this data into a large matrix. I am
using fopen in combination with fscanf to read the data in. However, it
takes about 20 seconds to complete and I wonder if there is a faster way.

For example, I found that I could use 'fread' to read the data into a
string that looks like this:

91\n212\n34\n40\n25\n100\n300\n ... \0

and it is nearly instantaneous. However, is there a quick way to
convert this string into an array of doubles?

It is likely that most of the difference (between calling fscanf()
repeatedly and reading the whole file with fread()) comes from the
conversion. In other words, if you read everything in and then convert it,
it will probably take about the same total time as converting as you read.

Note that fread() does not terminate what it reads to make a string (ie it
does not append '\0' as you showed above).

You say the files contain integers, but the code (that I've snipped)
converted (in a loose sense) the files to an array of double. Is that really
what you meant? (I can see why that might be just what you want.) If so, it
may be faster to do something like the following:

fscanf(fp, "%d", &temp);
array[pos] = temp;

The bottom line is that the C language itself provides no guarantees about
the speed (or relative speed) of code sequences. If speed is an issue, try
several approaches that seem reasonable and measure to see which is fastest.
But bear in mind that your results are valid only on the system you tested.
Changes to (for example) the compiler, compiler options, the standard
library, the operating system, or the hardware could give different results
and possibly a different conclusion.

Alex
 
L

Lawrence Kirby

Hello,

Say I have 1000 text files and each is a list of 32768 integers.

I have written a C program to read this data into a large matrix. I am
using fopen in combination with fscanf to read the data in. However, it
takes about 20 seconds to complete and I wonder if there is a faster way.

For example, I found that I could use 'fread' to read the data into a
string that looks like this:

91\n212\n34\n40\n25\n100\n300\n ... \0

Note that fread() doesn't read strings, i.e. it doesn't write a
terminating null character. It it also likely to split a line betwwen the
end of one read and the beginning of the next.
and it is nearly instantaneous. However, is there a quick way to
convert this string into an array of doubles?

You'll have to sort out the end of the buffer issues yourself but the
"simple" function to convert a string representation to a double is
strtod(). Well there is also atof() but that isn't very good at error
checking. These are likely to be the fastest ways of converting character
data to a double in the standard library.

Try reading your file in a line at a time using fgets(). You may find that
this isn't much slower than using fread() and it makes the rest of your
task easier.

It was suggested that if the numbers in your file data are always integers
then you might convert to an integer and then to a double. That's worth
trying too. There is a strtol() function to do that. You could even try an
inline conversion loop in that case, assuming that performance is really
that much of an issue.

Lawrence
 
M

Matthew Crema

Lawrence said:
Note that fread() doesn't read strings, i.e. it doesn't write a
terminating null character. It it also likely to split a line betwwen the
end of one read and the beginning of the next.




You'll have to sort out the end of the buffer issues yourself but the
"simple" function to convert a string representation to a double is
strtod(). Well there is also atof() but that isn't very good at error
checking. These are likely to be the fastest ways of converting character
data to a double in the standard library.

Try reading your file in a line at a time using fgets(). You may find that
this isn't much slower than using fread() and it makes the rest of your
task easier.

It was suggested that if the numbers in your file data are always integers
then you might convert to an integer and then to a double. That's worth
trying too. There is a strtol() function to do that. You could even try an
inline conversion loop in that case, assuming that performance is really
that much of an issue.

Lawrence

Thanks to all for your responses.

I think I agree with Alex's post that the time consuming part of this
whole thing is the conversion. So 'fread'ing the data in and then
converting the whole thing would likely take a similar amount of time as
'fscanf'ing the data into a double array. Eventually I'll play with
strtod and others, but I'm going to leave my code as it is for now.

Several of you pointed out (and I have verified) that fread does not
append the '\0' as I assumed.

Also, sorry for the confusion about the int's vs. floats. My data is
generally double precision floats.

Aside, using fgets (to read each line) instead of fread (to read the
entire file), seems to take much longer given my large data sets. For
smaller data sets there is not much difference.

Thanks for the other tips on bugfixes as well. I will implement them
immediately.

-Matt
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top