dealing with large csv files

B

brzak

hi

I was hoping for a few pointers on how to best go about
processing large csv files.

The files:
a typical file has 100K-500K records
approx 150 chars per line (10 'fields')
so file sizes of 15MB-75MB.

The processing:
summarise numerical fields based on conditions applied to other
fields

the resulting summary is a table;
the column headers of which are the list of unique values of one
field
the row headers are decided upon by conditions on other fields
(this may include lookups, exclusions, reclassifications)

Taking into account the size of the files, and number of operations
requierd on each record...

What if any of thse considerations do I need to take into account:

-is the file read in line by line / or in one go?
+if in one go, there would be issues with available memory?
+if it's line by line, is tehre a significant difference in time
taken to process? (i.e from my limited personal experience with
VBA, reading/writing a cell at a time in a spreasheet is far
slower than reading/writing in 'batches')
+or would it be an idea to read a limited number in one go?
e.g. deal with 20,000 at a time in memory
i suppose this question demonstrates a lack of experience with C++
but hey, that's why i'm posting in the learner's forum :)

-however much of the file is read, is it worth writing a bespoke
solution
or look for a parser/class that's been written for csv files?
+perhaps there is a module that i can import?
+since the csv files are *supposed* to be of a standard format,
would
there be much to gain iin writing something specific to this - this
would be done with the aim of reducing processing time

-data types... should i read the value fields as floating point
numbers (range approx. +/- 500000.00)
+will using floating point data types save memory?


As anyone reading would be able to tell, I'm still quite new to this
language, and am missing some of the basics which I've had a bit of
trouble locating solutions to.

Any advice would be much appreciated!

Brz
 
L

LR

brzak said:
hi

I was hoping for a few pointers on how to best go about
processing large csv files.

I don't think that you've given enough information about your particular
problem, but I'll give it a try.
The files:
a typical file has 100K-500K records
approx 150 chars per line (10 'fields')
so file sizes of 15MB-75MB.

The processing:
summarise numerical fields based on conditions applied to other
fields

the resulting summary is a table;
the column headers of which are the list of unique values of one
field
the row headers are decided upon by conditions on other fields
(this may include lookups, exclusions, reclassifications)

Do you mean a hierarchical file, or some other structure?

Taking into account the size of the files, and number of operations
requierd on each record...

What if any of thse considerations do I need to take into account:

-is the file read in line by line / or in one go?

I'd probably try to do it line by line first and get the processing
right, and then if you need to, switch to all in one go.

+if in one go, there would be issues with available memory?

There will be issues no matter how you write your code, but the question
is do you have enough memory to store the entire text or not? Plus the
space required for the variables you're going to manipulate?

+if it's line by line, is tehre a significant difference in time
taken to process? (i.e from my limited personal experience with
VBA, reading/writing a cell at a time in a spreasheet is far
slower than reading/writing in 'batches')

Very system dependent. And "batch" size dependent. You'll have to try it
and see.

+or would it be an idea to read a limited number in one go?
e.g. deal with 20,000 at a time in memory

Depends on what your data is like and what you have to do with it.
i suppose this question demonstrates a lack of experience with C++
but hey, that's why i'm posting in the learner's forum :)

-however much of the file is read, is it worth writing a bespoke
solution
or look for a parser/class that's been written for csv files?

Reading in csv files isn't difficult. You do mean comma seperated
values? What have you tried so far?

+perhaps there is a module that i can import?

Have you looked at std::getline? You can use that not only to read a
line at a time, but once you have a line in a std::string to easily get
values that are separated by commas using a std::istringstream. Unless
you have quoted fields with commas in them. In which case you might be
more interested in something that can deal with regular expressions.
You might want to look at www.boost.org, or perhaps if your compiler has
support for it, the tr1. I did a quick google(tm) and found this, but I
don't know how good it is: http://www.johndcook.com/cpp_regex.html

But I think that std::getline should be something to take a look at first.
+since the csv files are *supposed* to be of a standard format,

What standard are they following? And do you suspect they don't follow a
standard?
would
there be much to gain iin writing something specific to this - this
would be done with the aim of reducing processing time

Have you tried this already and found that your processing time is too long?
-data types... should i read the value fields as floating point
numbers (range approx. +/- 500000.00)

Probably that or double. I think double would be better. But as you
point out...
+will using floating point data types save memory?

Yes, well probably, but I suspect it will take more time to use float.
System dependent. But I don't know. Best to benchmark if you're
concerned. Also, I don't recall, but I think that sizeof(float) can be
equal to sizeof(double), although I can't think of a system where it is.

I think that most problems have a space time trade off. You may have to
decide which one is more important to you.

As anyone reading would be able to tell, I'm still quite new to this
language, and am missing some of the basics which I've had a bit of
trouble locating solutions to.

Any advice would be much appreciated!

Sounds like you might need a good book. You might want to check out the
reviews at www.accu.org.

LR
 
J

James Kanze

I was hoping for a few pointers on how to best go about
processing large csv files.
The files:
    a typical file has 100K-500K records
    approx 150 chars per line (10 'fields')
    so file sizes of 15MB-75MB.
The processing:
    summarise numerical fields based on conditions applied to other
fields
    the resulting summary is a table;
      the column headers of which are the list of unique values of one
field
      the row headers are decided upon by conditions on other fields
        (this may include lookups, exclusions, reclassifications)
Taking into account the size of the files, and number of operations
requierd on each record...
What if any of thse considerations do I need to take into account:
  -is the file read in line by line / or in one go?
    +if in one go, there would be issues with available memory?
    +if it's line by line, is tehre a significant difference in time
         taken to process? (i.e from my limited personal experience with
         VBA, reading/writing a cell at a time in a spreasheet is far
         slower than reading/writing in 'batches')
    +or would it be an idea to read a limited number in one go?
      e.g. deal with 20,000 at a time in memory
  i suppose this question demonstrates a lack of experience with C++
   but hey, that's why i'm posting in the learner's forum :)

For such a small file, it probably doesn't matter. Reading one
character at a time might, or even unit buffered, but otherwise,
the buffering in ifstream should be largely adequate. If you do
find that I/O is a blocking point, you can try memory mapping
the file, but there's no guarantee that that will improve
anything.
  -however much of the file is read, is it worth writing a bespoke
solution
   or look for a parser/class that's been written for csv files?
    +perhaps there is a  module that i can import?
    +since the csv files are *supposed* to be of a standard format,
would
         there be much to gain iin writing something specific to this - this
         would be done with the aim of reducing processing time

If you can find a generalized CSV parser, use it. I suspect,
however, that CSV is so simple that most people just do it by
hand; if you know up front which fields contain what types, it's
a lot easier (and faster).
  -data types... should i read the value fields as floating point
   numbers (range approx. +/- 500000.00)
    +will using floating point data types save memory?

Compared to what? Floating point takes less space than double,
but again, we're talking about a fairly small data set, so it
probably doesn't matter.
 
E

Erik Wikström

hi

I was hoping for a few pointers on how to best go about
processing large csv files.

The files:
a typical file has 100K-500K records
approx 150 chars per line (10 'fields')
so file sizes of 15MB-75MB.

The processing:
summarise numerical fields based on conditions applied to other
fields

the resulting summary is a table;
the column headers of which are the list of unique values of one
field
the row headers are decided upon by conditions on other fields
(this may include lookups, exclusions, reclassifications)

Taking into account the size of the files, and number of operations
requierd on each record...

It all very much depend on what kind of operations you want to perform,
and whether you can do the operations "on the fly" or if you need access
to the whole table before you begin.
What if any of thse considerations do I need to take into account:

-is the file read in line by line / or in one go?

If memory usage is important you read line by line, if performance is
more important then you probably want to do it in one go. You might also
want to consider memory mapping of the file.
+if in one go, there would be issues with available memory?

Depends on your target environment, 75MB is not very much in a modern
desktop or server, but for other targets it might be.
+if it's line by line, is tehre a significant difference in time
taken to process? (i.e from my limited personal experience with
VBA, reading/writing a cell at a time in a spreasheet is far
slower than reading/writing in 'batches')

Not necessary, but I'd say probably.
+or would it be an idea to read a limited number in one go?
e.g. deal with 20,000 at a time in memory

Since it's hard to know where the content of a field starts and ends
without some processing it is probably not possible to read anything
other than full lines (one or more) without risking getting only half
the contents of a field.
i suppose this question demonstrates a lack of experience with C++
but hey, that's why i'm posting in the learner's forum :)

-however much of the file is read, is it worth writing a bespoke
solution
or look for a parser/class that's been written for csv files?

It's really not that difficult to write it yourself, and if you do you
can customise it to your needs. Just a tip though, try not to limit
yourself by hardcoding the number of columns or something.
-data types... should i read the value fields as floating point
numbers (range approx. +/- 500000.00)
+will using floating point data types save memory?

Float will probably save you memory, the question is whether it is worth
the loss in precision.
As anyone reading would be able to tell, I'm still quite new to this
language, and am missing some of the basics which I've had a bit of
trouble locating solutions to.

If performance is critical you should profile your solution to see where
the bottlenecks are. You might also check out the difference in
performance between C++ streams and C IO functions, some implementations
of the C++ standard library have bad performance.
 
R

Raymond Häb

brzak said:
I was hoping for a few pointers on how to best go about
processing large csv files.

The files:
a typical file has 100K-500K records
approx 150 chars per line (10 'fields')
so file sizes of 15MB-75MB.
[...]

I just want to add a hint to boost::spirit, boost::xpressive
and the fileiterator

http://www.boost.org/doc/libs/1_37_0/libs/spirit/classic/index.html
http://www.boost.org/doc/libs/1_37_0/libs/spirit/classic/doc/file_iterator.html
http://www.boost.org/doc/libs/1_37_0/doc/html/xpressive.html

But maybe learning all this is "overkill" for such an easy task. But if
you are frequently confronted to similar problems you can produce very
efficient solutions with it (both for development time and runtime (but
sadly not compile time)).


Raymond
 
B

Boogie

I just want to add a hint to boost::spirit, boost::xpressive
and the fileiterator

But maybe learning all this is "overkill" for such an easy task. But if
you are frequently confronted to similar problems you can produce very
efficient solutions with it (both for development time and runtime (but
sadly not compile time)).

Raymond

I think that would be an overkill.
I would suggest:

Tokenizer for parsing (read line of text and then tokenize it):
http://www.boost.org/doc/libs/1_37_0/libs/tokenizer/escaped_list_separator.htm

Lexical cast for string to float (double) conversion:
http://www.boost.org/doc/libs/1_37_0/libs/conversion/lexical_cast.htm

Easy to use, doesn't require a lot of writing code ;)

Boogie
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,905
Latest member
Kristy_Poole

Latest Threads

Top