fasteste way to fill a structure.

simon · Jun 23, 2005

From my previous post...

If I have a structure,

struct sFileData
{
char*sSomeString1;
char*sSomeString2;
int iSomeNum1;
int iSomeNum2;
sFileData(){...};
~sFileData(){...};
sFileData(const sFileData&){...};
const sFileData operator=( const sFileData &s ){...}
};

I read the file as follows

FILE *f = fopen( szPath, "rb" );

int nLineSize = 190;
BYTE b[nLineSize+1];

fread( b, sizeof(BYTE), nLineSize, f );
int numofrecords = atoi( b ); // first line is num of records only,

// read the data itself.
while( fread( b, sizeof(BYTE), nLineSize, f ) == nLineSize )
{
// fill data
// The locations of each items is known
// sString1 = 0->39, with blank spaces filler after data
// sString2 = 40->79, with blank spaces filler after data
// iNum1 = 80->99, with blank spaces filler after data
// iNum2 = 100->end, with blank spaces filler after data
}

what would be the best way to fill the data into an array, (vector)?

Many thanks.

Simon.

Victor Bazarov · Jun 23, 2005

simon said:
If I have a structure,

struct sFileData
{
char*sSomeString1;
char*sSomeString2;
int iSomeNum1;
int iSomeNum2;
sFileData(){...};
~sFileData(){...};
sFileData(const sFileData&){...};
const sFileData operator=( const sFileData &s ){...}
};

I read the file as follows

FILE *f = fopen( szPath, "rb" );

int nLineSize = 190;
BYTE b[nLineSize+1];

fread( b, sizeof(BYTE), nLineSize, f );
int numofrecords = atoi( b ); // first line is num of records only,

// read the data itself.
while( fread( b, sizeof(BYTE), nLineSize, f ) == nLineSize )
{
// fill data
// The locations of each items is known
// sString1 = 0->39, with blank spaces filler after data
// sString2 = 40->79, with blank spaces filler after data
// iNum1 = 80->99, with blank spaces filler after data
// iNum2 = 100->end, with blank spaces filler after data
}

what would be the best way to fill the data into an array, (vector)?

I presume nLineSize is greater than 100. Then, something in line with

// as soon as you know the number of structures
yourvector.reserve(numofrecords);

// read the data themselves
while (fread(... )
{
yourvector.push_back(
sFileData(
std::string(b, b+40).c_str(),
std::string(b+40, b+80).c_str(),
strtol(std::string(b+80,b+100).c_str(),10,0),
strtol(std::string(b+100,b+nLineSize).c_str(),10,0)
)
);
}

You will need to create another constructor for your 'sFileData',
which will take two pointers to const char, and two ints (or longs):

sFileData(char const*, char const*, int, int);

Take those pointers and extract the C strings from them to create your
members.

In general, I think it's better to have 'std::string' as members instead
of 'char*'. You may need to fix the rest of your class if you make that
switch.

V

titancipher · Jun 23, 2005

This way is _not _ fast as there are loads of unnecessary memory
allocations. Simon, you had the right idea from the start, but the
data structure can be modified to:

struct sFileData
{
char sSomeString1[40];
char sSomeString2[40];
int iSomeNum1;
int iSomeNum2;
....
};

Then, you can use either an array or a vector. Since you know the size
ahead of time, you can create an array:

struct sFileData array[ numofrecords ];
// read the data itself.
int i = 0;
while( fread( b, sizeof(BYTE), nLineSize, f ) == nLineSize )
{
array[ i ] = *(struct sFileData * )&b;
++i;
}

titancipher · Jun 23, 2005

I failed to see that the file format is most-likely ascii.

Simon · Jun 24, 2005

I presume nLineSize is greater than 100. Then, something in line with

Why would it have to be > 100? or are you saying that because of my
definition?

// as soon as you know the number of structures
yourvector.reserve(numofrecords);

Ok, it does speed things up a bit.

// read the data themselves
while (fread(... )
{
yourvector.push_back(
sFileData(
std::string(b, b+40).c_str(),
std::string(b+40, b+80).c_str(),
strtol(std::string(b+80,b+100).c_str(),10,0),
strtol(std::string(b+100,b+nLineSize).c_str(),10,0)
)
);
}

I still think that I am doing something wrong here.
To read a file with 100000 lines takes 0.66 sec, (windows machine).

But filling the structure above takes +28 seconds.

Is that normal?

Simon

Jeff Flinn · Jun 24, 2005

Simon said:
Why would it have to be > 100? or are you saying that because of my
definition?

Ok, it does speed things up a bit.

I still think that I am doing something wrong here.
To read a file with 100000 lines takes 0.66 sec, (windows machine).

But filling the structure above takes +28 seconds.

Is that normal?

You won't know until you profile and see where the time is spent.

Jeff Flinn

Victor Bazarov · Jun 24, 2005

Simon said:
Why would it have to be > 100? or are you saying that because of my
definition?

Ok, it does speed things up a bit.

I still think that I am doing something wrong here.
To read a file with 100000 lines takes 0.66 sec, (windows machine).

But filling the structure above takes +28 seconds.

Is that normal?

May not be. You may want to change the structure and make it contain
arrays of char instead of pointers to dynamically allocated arrays.

Then the construction will be a bit faster, you could simply drop the
'string' thing there. Also, if you're sure about the source of the
data, and their format, you could avoid constructing temporaries. Play
with making 'sFileData' look like

char s1[41]; // if it's a C string, reserve the room for the null char
char s2[41];
int one, two;

and then you could construct it a bit faster. You will still need to
convert the third and the fourth fields since they can't be memcpy'ed.

V

Simon · Jun 24, 2005

May not be. You may want to change the structure and make it contain
arrays of char instead of pointers to dynamically allocated arrays.

Then the construction will be a bit faster, you could simply drop the
'string' thing there. Also, if you're sure about the source of the
data, and their format, you could avoid constructing temporaries. Play
with making 'sFileData' look like

char s1[41]; // if it's a C string, reserve the room for the null char
char s2[41];
int one, two;

I know I am going to be told I am too difficult, but the reason why I
dynamically create the string is because they are almost never longer than 5
chars.
So by declaring s1[41] I know that I am wasting around 36 chars, (The sizes
are different, there could be a string of 40 chars).

I know that we are only talking about 36 chars here, but I load 100000's of
lines and the waste really seems unnecessary to me, (and I don't like
wasting memory).
It seems to defeat the object dynamic memory allocations.

Simon

Steven T. Hatton · Jun 24, 2005

Simon said:
I know I am going to be told I am too difficult, but the reason why I
dynamically create the string is because they are almost never longer than
5 chars.
So by declaring s1[41] I know that I am wasting around 36 chars, (The
sizes are different, there could be a string of 40 chars).

I know that we are only talking about 36 chars here, but I load 100000's
of lines and the waste really seems unnecessary to me, (and I don't like
wasting memory).
It seems to defeat the object dynamic memory allocations.

Simon

What about using std::string with std::string::reserve(5); Or something
close to the maximum "normal" value? That way, you have a minimum
preallocated, but it can still grow.

Victor Bazarov · Jun 24, 2005

Simon said:
May not be. You may want to change the structure and make it contain
arrays of char instead of pointers to dynamically allocated arrays.

Then the construction will be a bit faster, you could simply drop the
'string' thing there. Also, if you're sure about the source of the
data, and their format, you could avoid constructing temporaries. Play
with making 'sFileData' look like

char s1[41]; // if it's a C string, reserve the room for the null char
char s2[41];
int one, two;

Click to expand...

I know I am going to be told I am too difficult, but the reason why I
dynamically create the string is because they are almost never longer than 5
chars.
So by declaring s1[41] I know that I am wasting around 36 chars, (The sizes
are different, there could be a string of 40 chars).

I know that we are only talking about 36 chars here, but I load 100000's of
lines and the waste really seems unnecessary to me, (and I don't like
wasting memory).
It seems to defeat the object dynamic memory allocations.

Perhaps then you need to invent a smarter scheme for storing those strings
than keeping a pointer to a dynamic array of chars. Do you know that most
heap managers when you need to allocate 1 char would slap 2*sizeof(void*)
on top of it to make a dynamic array? So, you're still wasting enough
memory (not to say all the CPU cycles to allocate and then deallocate them
along with other objects).

Imagine that your 'sFileData' class has a static storage for all its
strings, from which all individual strings are cut out (or, rather, in
which all individual strings are stacked up). If your objects never
change, and only get allocated once and deallocated together at some
point, then it might be the simple custom memory manager you need. You
can allocate that static storage in large chunks and give your class some
mechanism to account for allocations... Well, as you can see, all you may
need to improve the performance is a custom memory manager. You can
probably use an open source one, if you can find it.

V

Steven T. Hatton · Jun 24, 2005

Victor said:
Simon said:

May not be. You may want to change the structure and make it contain
arrays of char instead of pointers to dynamically allocated arrays.

Then the construction will be a bit faster, you could simply drop the
'string' thing there. Also, if you're sure about the source of the
data, and their format, you could avoid constructing temporaries. Play
with making 'sFileData' look like

char s1[41]; // if it's a C string, reserve the room for the null
char char s2[41];
int one, two;

Click to expand...

I know I am going to be told I am too difficult, but the reason why I
dynamically create the string is because they are almost never longer
than 5 chars.
So by declaring s1[41] I know that I am wasting around 36 chars, (The
sizes are different, there could be a string of 40 chars).

I know that we are only talking about 36 chars here, but I load 100000's
of lines and the waste really seems unnecessary to me, (and I don't like
wasting memory).
It seems to defeat the object dynamic memory allocations.

Click to expand...

Perhaps then you need to invent a smarter scheme for storing those strings
than keeping a pointer to a dynamic array of chars. Do you know that most
heap managers when you need to allocate 1 char would slap 2*sizeof(void*)
on top of it to make a dynamic array? So, you're still wasting enough
memory (not to say all the CPU cycles to allocate and then deallocate them
along with other objects).

Imagine that your 'sFileData' class has a static storage for all its
strings, from which all individual strings are cut out (or, rather, in
which all individual strings are stacked up). If your objects never
change, and only get allocated once and deallocated together at some
point, then it might be the simple custom memory manager you need. You
can allocate that static storage in large chunks and give your class some
mechanism to account for allocations... Well, as you can see, all you may
need to improve the performance is a custom memory manager. You can
probably use an open source one, if you can find it.

V

I just thought about another source of what the slowness might be. It may
be a question of jumping back and forth between I/O and other operations.
I'd suggest using C++ I/O, rather than C, and try a buffered stream. I
don't know much about those, but I do know the counterpart in Java could
make a big difference. Alternatively, the file could be read into memory
explicitly in one slurp, and then processed with some kind of input stream
that reads from memory.

Jeff Flinn · Jun 24, 2005

Simon said:
May not be. You may want to change the structure and make it contain
arrays of char instead of pointers to dynamically allocated arrays.

Then the construction will be a bit faster, you could simply drop the
'string' thing there. Also, if you're sure about the source of the
data, and their format, you could avoid constructing temporaries. Play
with making 'sFileData' look like

char s1[41]; // if it's a C string, reserve the room for the null
char
char s2[41];
int one, two;

Click to expand...

I know I am going to be told I am too difficult, but the reason why I
dynamically create the string is because they are almost never longer than
5 chars.
So by declaring s1[41] I know that I am wasting around 36 chars, (The
sizes are different, there could be a string of 40 chars).

I know that we are only talking about 36 chars here, but I load 100000's
of lines and the waste really seems unnecessary to me, (and I don't like
wasting memory).
It seems to defeat the object dynamic memory allocations.

If you're concerned about 36 chars, why not avoid all of this memory
allocation to begin with? Considering your file structure is fixed record
lengths of a known number of records, you shouldn't need to copy/process
anything until it's needed. I've successfully used a version of this
approach in memory limited handheld pc(33Mhz no less) to access several
multi-megabyte files.

For example(simplified,incomplete and untested):

class Record
{
std::string mString;
public:

Record( const std::string& rec ):mString(rec){}

std::string S1()const{ ... } // extract S1 from record
std::string S2()const{ ... }
int N1()const{ ... }
int N2()const{ ... }

};

class structured_file
{
class memory_mapped_file
{
// use os specific implementation

const char* mBegin;

public:

memory_mapped_file( const std::string& name )
: ...
, mBegin( ... )
{}

const char* operator[]( size_t idx )const
{
return std::string( mBegin + idx );
}

...

};

memory_mapped_file mData;
size_t mRecSize;

public:
structured( const std::string& name, size_t rec_size )
: mData(name), mRecSize(size){}

Record operator[]( size_t idx )const
{
const char* lBeg = mData[ mRecSize * idx ];

return Record( std::string( lBeg, lBeg + mRecSize ) );
}

};

int main()
{
structured_file lData( "data.dat" );

Record r1 = lData[ 123];
Record r2 = lData[2456];

int n1n1 = r1.N1();
std::string s2s2 = r2.S2();

return 0;
}

Jeff Flinn

Larry I Smith · Jun 24, 2005

Simon said:
May not be. You may want to change the structure and make it contain
arrays of char instead of pointers to dynamically allocated arrays.

Then the construction will be a bit faster, you could simply drop the
'string' thing there. Also, if you're sure about the source of the
data, and their format, you could avoid constructing temporaries. Play
with making 'sFileData' look like

char s1[41]; // if it's a C string, reserve the room for the null char
char s2[41];
int one, two;

Click to expand...

I know I am going to be told I am too difficult, but the reason why I
dynamically create the string is because they are almost never longer than 5
chars.
So by declaring s1[41] I know that I am wasting around 36 chars, (The sizes
are different, there could be a string of 40 chars).

I know that we are only talking about 36 chars here, but I load 100000's of
lines and the waste really seems unnecessary to me, (and I don't like
wasting memory).
It seems to defeat the object dynamic memory allocations.

Simon

Dynamic memory allocation of many small segments causes
extremely poor memory utilization.

malloc and new (new often uses malloc) get memory from
the operating system in pages (4k, 8k, etc). They use
part of the obtained memory to implement control structures
(for keeping track of allocated and freed/reuseable chunks).
Each allocation also includes book keeping overhead (typically
8 bytes on a 32 bit OS), and normally no less than 16 bytes
is used per allocation - even if the user only asked for
one byte, malloc(1). So, in general, at least 16 bytes plus
a pointer in the malloc control structure (a linked list)
is allocated for each request.

Your program with the 100000 calls to allocate 5 bytes and
another 100000 calls to allocate 8 bytes will use AT LEAST
twice as much memory as you think - because of the hidden
extra memory used to keep track of everything.

These two articles explain it in detail (your OS may vary,
but the generalities apply):

http://www.cs.utk.edu/~plank/plank/classes/cs360/360/notes/Malloc2/lecture.html
http://www.cs.utk.edu/~plank/plank/classes/cs360/360/notes/Fragmentation/lecture.html

Regards,
Larry

Larry I Smith · Jun 24, 2005

simon said:
From my previous post...

If I have a structure,

struct sFileData
{
char*sSomeString1;
char*sSomeString2;
int iSomeNum1;
int iSomeNum2;
sFileData(){...};
~sFileData(){...};
sFileData(const sFileData&){...};
const sFileData operator=( const sFileData &s ){...}
};

I read the file as follows

FILE *f = fopen( szPath, "rb" );

int nLineSize = 190;
BYTE b[nLineSize+1];

fread( b, sizeof(BYTE), nLineSize, f );
int numofrecords = atoi( b ); // first line is num of records only,

// read the data itself.
while( fread( b, sizeof(BYTE), nLineSize, f ) == nLineSize )
{
// fill data
// The locations of each items is known
// sString1 = 0->39, with blank spaces filler after data
// sString2 = 40->79, with blank spaces filler after data
// iNum1 = 80->99, with blank spaces filler after data
// iNum2 = 100->end, with blank spaces filler after data
}

what would be the best way to fill the data into an array, (vector)?

Many thanks.

Simon.

You state that each line in the file (including the first one) is
190 bytes long: 'int nLineSize = 190;' Yet your data items are all
ascii, and they occupy the first 100+ bytes. Is the ascii data
followed by additional (possibly binary) data that fills out the
record to a length of 190 bytes? Does each 190 byte record include
a trailing newline (Windows style "\r\n" or non-Windows "\n")?

Since you open the file in binary mode ("rb"), we might infer
at least 4 things:

1) the file contains mixed ascii/binary data records;
each of which is 190 bytes long with NO delimiting
newlines (aka Fixed Block in IBM parlance).

2) the file contains mixed ascii/binary data records;
each of which is 190 bytes long INCLUDING a delimiting
newline (Windows style "\r\n" or non-Windows "\n").

3) the file contains ascii-only data records with
a fixed length of 190 bytes with NO delimiting
newlines.

4) the file contains ascii-only data records with
a fixed length of 190 bytes INCLUDING a delimiting
newline (Windows style "\r\n" or non-Windows "\n").

It will be much easier for us to suggest effecient coding
approaches if you would please describe the EXACT layout
of the 190 byte records - including what follows 'iNum2',
and whether or not each of the 190 byte records includes
a trailing newline.

I have some ideas, but knowing the complete layout of the
190 byte records is key to picking the best approach.

Regards,
Larry

Larry I Smith · Jun 24, 2005

Larry said:
simon said:

From my previous post...

If I have a structure,

struct sFileData
{
char*sSomeString1;
char*sSomeString2;
int iSomeNum1;
int iSomeNum2;
sFileData(){...};
~sFileData(){...};
sFileData(const sFileData&){...};
const sFileData operator=( const sFileData &s ){...}
};

I read the file as follows

FILE *f = fopen( szPath, "rb" );

int nLineSize = 190;
BYTE b[nLineSize+1];

fread( b, sizeof(BYTE), nLineSize, f );
int numofrecords = atoi( b ); // first line is num of records only,

// read the data itself.
while( fread( b, sizeof(BYTE), nLineSize, f ) == nLineSize )
{
// fill data
// The locations of each items is known
// sString1 = 0->39, with blank spaces filler after data
// sString2 = 40->79, with blank spaces filler after data
// iNum1 = 80->99, with blank spaces filler after data
// iNum2 = 100->end, with blank spaces filler after data
}

what would be the best way to fill the data into an array, (vector)?

Many thanks.

Simon.

Click to expand...

You state that each line in the file (including the first one) is
190 bytes long: 'int nLineSize = 190;' Yet your data items are all
ascii, and they occupy the first 100+ bytes. Is the ascii data
followed by additional (possibly binary) data that fills out the
record to a length of 190 bytes? Does each 190 byte record include
a trailing newline (Windows style "\r\n" or non-Windows "\n")?

Since you open the file in binary mode ("rb"), we might infer
at least 4 things:

1) the file contains mixed ascii/binary data records;
each of which is 190 bytes long with NO delimiting
newlines (aka Fixed Block in IBM parlance).

2) the file contains mixed ascii/binary data records;
each of which is 190 bytes long INCLUDING a delimiting
newline (Windows style "\r\n" or non-Windows "\n").

3) the file contains ascii-only data records with
a fixed length of 190 bytes with NO delimiting
newlines.

4) the file contains ascii-only data records with
a fixed length of 190 bytes INCLUDING a delimiting
newline (Windows style "\r\n" or non-Windows "\n").

It will be much easier for us to suggest effecient coding
approaches if you would please describe the EXACT layout
of the 190 byte records - including what follows 'iNum2',
and whether or not each of the 190 byte records includes
a trailing newline.

I have some ideas, but knowing the complete layout of the
190 byte records is key to picking the best approach.

Regards,
Larry

Two more questions:

5) do you wish to have leading/trailing whitespace
stripped from the first 2 string fields before
they are put into the structure?

6) might the first 2 string fields contain embedded
whitespace (e.g. sSomeString1 could be "hello there")?

Regards,
Larry

simon · Jun 24, 2005

That's part of the problem, the line is 190 char long+'\n'
but the only meaningful data to me is 0->110

It is all ascii+'\n'. I open it "rb" because that's what i usually do.
But it is a flat text file.

What follows is more text and number data, (but all in ASCII).

5) do you wish to have leading/trailing whitespace
stripped from the first 2 string fields before
they are put into the structure?

Yes, but only the trailing spaces.The data is leftmost of it's section.

6) might the first 2 string fields contain embedded
whitespace (e.g. sSomeString1 could be "hello there")?

Yes, if that makes it faster to load, the data is 'protected' and i use
functions to return the values.
The problem are the numbers, it would not be very efficient to return
something like 'atoi("1234" ) all the time.

Regards,
Larry

many thanks for your help.
Simon

simon · Jun 24, 2005

You state that each line in the file (including the first one) is

That's part of the problem, the line is 190 char long+'\n'
but the only meaningful data to me is 0->110

Sorry, I made a mistake, the file is 'windows style' 190+ '\r\n'

Simon

Larry I Smith · Jun 25, 2005

simon said:
Sorry, I made a mistake, the file is 'windows style' 190+ '\r\n'

Simon

Ok, so the lines are each 192 bytes long (including the \r\n).

If you use fread() to read the data, then fread needs to read
192 bytes - NOT 190. "\r\n" is not special to fread() - it reads
raw bytes. So, if you read only 190 bytes when each 'line'
is actually 192 bytes long, then the fields for all records
except the first one will each be off by 2 bytes from the previous
record, e.g. by the time you get to record 40, your data fields
will be off by 80 bytes from where you think they are. This will
cause your sFileData structs to NOT have the contents you expect,
and may be contibuting to the terrible performance that you
are seeing.

I have written 3 small programs that I will post in a few
minutes. I wrote them using a 190 byte line length (including
the trailing "\r\n"). As soon as I change them to use 192
byte lines, I'll post them. They are:

simondat.c: to create a test input data file named "simon.dat"
with 100000 records for use by the other 2 programs.

simon.cpp: uses 'char *' with new/delete for the string
fields in sFileData.

simon2.cpp: uses std::string for the string fields in
sFileData.

On my pc (an old Gateway PII 450MHZ with 384MB of RAM):

simon.cpp runs in 2.20 seconds and uses 5624KB of memory.

simon2.cpp runs in 2.22 seconds and uses 6272KB of memory.

Your mileage may vary. I'm running SuSE Linux v9.3 and
using the GCC "g++" compiler v3.3.5.

Regards,
Larry

Larry I Smith · Jun 25, 2005

Larry said:
Ok, so the lines are each 192 bytes long (including the \r\n).

If you use fread() to read the data, then fread needs to read
192 bytes - NOT 190. "\r\n" is not special to fread() - it reads
raw bytes. So, if you read only 190 bytes when each 'line'
is actually 192 bytes long, then the fields for all records
except the first one will each be off by 2 bytes from the previous
record, e.g. by the time you get to record 40, your data fields
will be off by 80 bytes from where you think they are. This will
cause your sFileData structs to NOT have the contents you expect,
and may be contibuting to the terrible performance that you
are seeing.

I have written 3 small programs that I will post in a few
minutes. I wrote them using a 190 byte line length (including
the trailing "\r\n"). As soon as I change them to use 192
byte lines, I'll post them. They are:

simondat.c: to create a test input data file named "simon.dat"
with 100000 records for use by the other 2 programs.

simon.cpp: uses 'char *' with new/delete for the string
fields in sFileData.

simon2.cpp: uses std::string for the string fields in
sFileData.

On my pc (an old Gateway PII 450MHZ with 384MB of RAM):

simon.cpp runs in 2.20 seconds and uses 5624KB of memory.

simon2.cpp runs in 2.22 seconds and uses 6272KB of memory.

Your mileage may vary. I'm running SuSE Linux v9.3 and
using the GCC "g++" compiler v3.3.5.

Regards,
Larry

Here are the 3 programs:

----------------------------------------

/* simondat.c Builds an MS-Windows format data file 'simon.dat' for
* use as input to the various simon* test programs.
* To compile:
* MS Windows: cl simondat.c
* Linux: gcc -o simondat simondat.c
*
* The file created, "simon.dat", is:
* Composed of 192 byte fixed length records containing ascii text.
* A trailing MS-Windows newline (the 2 char string "\r\n")
* comprises the last 2 bytes of each of the 192 byte records.
* The first 192 byte record contains the count of remaining
* records in the file as the only field in that record.
* All subsequent records follow the printf() format:
* "%-40s%-40s%-20d%-20d%70s\r\n"
* the trailing 70 bytes are all blanks.
*/

#include <stdio.h>

int main()
{
FILE * fp;
unsigned int i;
unsigned int recs = 100000;

fp = fopen("simon.dat", "wb");

fprintf(fp, "%-190u\r\n", recs); // this MUST be 192 chars total

for (i = 0; i < recs; i++)
fprintf(fp, "%-40s%-40s%-20d%-20d%70c\r\n",
"Helo", "Goodbye", i + 1, recs - i, ' ');

fclose(fp);

return 0;
}

--------------------------------------------

// simon.cpp uses 'char *' for the string fields in sFileData.
// to compile:
// MS Windows: cl simon.cpp
// Linux: g++ -o simon simon.cpp
#include <vector>
#include <iostream>
#include <fstream>
#include <stdlib.h> // for atoi/atol
#include <string.h> // for strcpy
#include <time.h> // for clock

using namespace std;

struct sFileData
{
char * sSomeString1;
char * sSomeString2;
int iSomeNum1;
int iSomeNum2;

sFileData()
{
NullAll();
}

~sFileData()
{
CleanAll();
}

sFileData(const sFileData&sfd)
{
NullAll();
*this = sfd;
}

const sFileData& operator=( const sFileData &sfd )
{
if( this != &sfd)
{
CleanAll();
iSomeNum1 = sfd.iSomeNum1;
iSomeNum2 = sfd.iSomeNum2;

if( sfd.sSomeString1 )
{
sSomeString1 = new char[strlen(sfd.sSomeString1)+1];
strcpy( sSomeString1, sfd.sSomeString1 );
}

if( sfd.sSomeString2 )
{
sSomeString2 = new char[strlen(sfd.sSomeString2)+1];
strcpy( sSomeString2, sfd.sSomeString2 );
}
}

return *this;
}

void CleanAll()
{
if (sSomeString1)
{
delete [] sSomeString1;
sSomeString1 = 0;
}

if (sSomeString2)
{
delete [] sSomeString2;
sSomeString2 = 0;
}
}

void NullAll()
{
sSomeString1 = 0;
sSomeString2 = 0;
iSomeNum1 = 0;
iSomeNum2 = 0;
}

};

std::vector< sFileData, std::allocator<sFileData> > address_;

// removes leading/trailing whitespace chars from a buffer.
// buf[] does not have to be nul-terminated.
// the resulting content in buf[] is NOT nul-terminated,
// starts at buf[0], and comprises ONLY the number of bytes
// returned by this function - leftover 'garbage' may follow
// those valid bytes.
// leading whitespace is removed by moving the contents of
// buf[] to lower indexes
// e.g if buf[] contains " hello hi " on entry, it will
// contain "hello hi" on exit.
// returns the final trimmed length, zero if buf[] is all
// whitespace.
unsigned int Trim(char * buf, unsigned int sz)
{
char *white = " \t\r\n";
unsigned int pos1, pos2, len;

// if invalid input
if (!buf || 0 == sz)
return 0;

// find first non-whitespace char in buf[]
for (pos1 = 0; pos1 < sz; pos1++)
if (NULL == strchr(white, buf[pos1]))
break;

// if buf[] is all whitespace
if (pos1 >= sz)
return 0;

// find last non-whitespace char in buf[]
for (pos2 = sz; pos2 > pos1; pos2--)
if (NULL == strchr(white, buf[pos2 - 1]))
break;

// buf[] length less any leading/trailing whitespace
len = pos2 - pos1;

// if leading whitespace, move buf[] contents 'left'
// to eliminate the leading whitespace
if (pos1 > 0)
memmove(buf, buf + pos1, len);

return len;
}

int main()
{
char c;
char buf[196];
clock_t cstart, cend;
double elapsed;
int reclen = 192;
unsigned long recs, i;
std::string::size_type pos1, pos2;

// whitespace chars to strip from file records
const char * white = " \t\r\n";

std::cerr << "check initial memory usage now, then" << std::endl;
std::cerr << "press any alpha key followed by Enter to start"
<< std::endl;
std::cin >> c;

cstart = clock();

std::ifstream in("simon.dat",
std::ios_base::in | std::ios_base::binary);

// read the file record count from the 1st field of the 1st record
if (in.read(buf, reclen))
{
recs = atol(buf);
}
else
{
std::cerr << "Unable to read record count from 1st record"
<<std::endl;

in.close();

return 1;
}

// read/process all the records in the file
for (i = 0; i < recs; ++i)
{
// if we read a 192 byte record into buf[]
if (in.read(buf, reclen))
{
unsigned int len;
sFileData sfd;

// trim lead/trail whitespace from 40 bytes at buf[0]
len = Trim(buf, 40);
if (len) // if it was not all whitespace
{
// dup the trimmed buf into sSomeString1
sfd.sSomeString1 = new char[len + 1];
memcpy(sfd.sSomeString1, buf, len);
sfd.sSomeString1[len] = '\0';
}

// trim lead/trail whitespace from 40 bytes at buf[40]
len = Trim(buf + 40, 40);
if (len) // if it was not all whitespace
{
// dup the trimmed buf into sSomeString2
sfd.sSomeString2 = new char[len + 1];
memcpy(sfd.sSomeString2, buf + 40, len);
sfd.sSomeString2[len] = '\0';
}

// assign the int values from the approp locs in buf[]
sfd.iSomeNum1 = atoi(buf + 80);
sfd.iSomeNum2 = atoi(buf + 100);

address_.push_back(sfd);
#if 0
// DEBUG buf[] parsing
std::cout << sfd.sSomeString1 << ", " << sfd.sSomeString2
<< ", " << sfd.iSomeNum1 << ", " << sfd.iSomeNum2
<< std::endl;
#endif
}
}

in.close();

cend = clock();
elapsed = cend - cstart;

std::cerr << "processed " << i << " records in "
<< elapsed / CLOCKS_PER_SEC << " seconds."
<< std::endl;

std::cerr << "check final memory usage now, then" << std::endl;
std::cerr << "press any alpha key followed by Enter to finish"
<< std::endl;
std::cin >> c;

return 0;
}

----------------------------------------------

// simon2.cpp uses std::string for the string fields in sFileData.
// to compile:
// MS Windows: cl simon2.cpp
// Linux: g++ -o simon2 simon2.cpp
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <stdlib.h> // for atoi/atol
#include <time.h> // for clock

using namespace std;

struct sFileData
{
std::string sSomeString1;
std::string sSomeString2;
int iSomeNum1;
int iSomeNum2;

sFileData()
{
NullAll();
}

~sFileData()
{

}

sFileData(const sFileData&sfd)
{
NullAll();
*this = sfd;
}

const sFileData& operator=( const sFileData &sfd )
{
if( this != &sfd)
{
iSomeNum1 = sfd.iSomeNum1;
iSomeNum2 = sfd.iSomeNum2;
sSomeString1 = sfd.sSomeString1;
sSomeString2 = sfd.sSomeString2;
}

return *this;
}

void NullAll()
{
iSomeNum1 = 0;
iSomeNum2 = 0;
}

};

std::vector< sFileData, std::allocator<sFileData> > address_;

int main()
{
char c;
char buf[196];
clock_t cstart, cend;
double elapsed;
int reclen = 192;
unsigned long recs, i;
std::string::size_type pos1, pos2;

// whitespace chars to strip from file records
const char * white = " \t\r\n";

std::cerr << "check initial memory usage now, then" << std::endl;
std::cerr << "press any alpha key followed by Enter to start"
<< std::endl;
std::cin >> c;

cstart = clock();

std::ifstream in("simon.dat",
std::ios_base::in | std::ios_base::binary);

// read the file record count from the 1st field of the 1st record
if (in.read(buf, reclen))
{
recs = atol(buf);
}
else
{
std::cerr << "Unable to read record count from 1st record"
<<std::endl;

in.close();

return 1;
}

// read/process all the records in the file
for (i = 0; i < recs; ++i)
{
// if we read a 192 byte record
if (in.read(buf, reclen))
{
std::string str;
sFileData sfd;

// make a string from buf[0] thru buf[39]
str = std::string(buf, 40);

// find FIRST non-whitespace char in the string
pos1 = str.find_first_not_of(white);

// if the string is NOT all whitespace
if (pos1 != std::string::npos)
{
// find the LAST non-whitespace char in the string
pos2 = str.find_last_not_of(white);

// copy the inclusive range [pos1-pos2] from 'str'
// to 'sSomeString1'
sfd.sSomeString1 = str.substr(pos1, ++pos2);
}

// make a string from buf[40] thru buf[79]
str = std::string(buf + 40, 40);

// find FIRST non-whitespace char in the string
pos1 = str.find_first_not_of(white);

// if the string is NOT all whitespace
if (pos1 != std::string::npos)
{
// find the LAST non-whitespace char in the string
pos2 = str.find_last_not_of(white);

// copy the inclusive range [pos1-pos2] from 'str'
// to 'sSomeString2'
sfd.sSomeString2 = str.substr(pos1, ++pos2);
}

// assign the int values from the approp locs in buf[]
sfd.iSomeNum1 = atoi(buf + 80);
sfd.iSomeNum2 = atoi(buf + 100);

address_.push_back(sfd);
#if 0
// DEBUG buf[] parsing
std::cout << sfd.sSomeString1 << ", " << sfd.sSomeString2
<< ", " << sfd.iSomeNum1 << ", " << sfd.iSomeNum2
<< std::endl;
#endif
}
}

in.close();

cend = clock();
elapsed = cend - cstart;

std::cerr << "processed " << i << " records in "
<< elapsed / CLOCKS_PER_SEC << " seconds."
<< std::endl;

std::cerr << "check final memory usage now, then" << std::endl;
std::cerr << "press any alpha key followed by Enter to finish"
<< std::endl;
std::cin >> c;

return 0;
}

simon · Jun 25, 2005

Thanks for that, I get 1.24 sec and 6mb.
I just need to check what the difference is with my code.

Here are the 3 programs:

Regards,
Larry

Thanks for that, this is great.
I wonder if my Trim(...) function was not part of the problem.

After profiling I noticed that delete [], (or even free(..) ) takes around
50% of the whole time.

Maybe I should get rid of the dynamic allocation all together.

Simon

Illogical std::vector size?	24	Jun 23, 2005
What's the most secure way to read a long int ?	5	Aug 13, 2011
Iterate through a list of structure arrays of structure to get outthe field	2	Dec 12, 2012
Data Structure Issue	5	Jul 4, 2007
Size of a structure : Structure Padding	15	Oct 1, 2007
Alignment of a structure.	6	Jan 23, 2008
A weird problem on structure and union alignment	5	Aug 25, 2007
Loading/Saving a structure using <fstream>	4	Dec 1, 2005

fasteste way to fill a structure.

simon

Victor Bazarov

titancipher

titancipher

Simon

Jeff Flinn

Victor Bazarov

Simon

Steven T. Hatton

Victor Bazarov

Steven T. Hatton

Jeff Flinn

Larry I Smith

Larry I Smith

Larry I Smith

simon

simon

Larry I Smith

Larry I Smith

simon

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads