dealing with large csv files

Discussion in 'C++' started by brzak, Nov 30, 2008.

  1. brzak

    brzak Guest

    hi

    I was hoping for a few pointers on how to best go about
    processing large csv files.

    The files:
    a typical file has 100K-500K records
    approx 150 chars per line (10 'fields')
    so file sizes of 15MB-75MB.

    The processing:
    summarise numerical fields based on conditions applied to other
    fields

    the resulting summary is a table;
    the column headers of which are the list of unique values of one
    field
    the row headers are decided upon by conditions on other fields
    (this may include lookups, exclusions, reclassifications)

    Taking into account the size of the files, and number of operations
    requierd on each record...

    What if any of thse considerations do I need to take into account:

    -is the file read in line by line / or in one go?
    +if in one go, there would be issues with available memory?
    +if it's line by line, is tehre a significant difference in time
    taken to process? (i.e from my limited personal experience with
    VBA, reading/writing a cell at a time in a spreasheet is far
    slower than reading/writing in 'batches')
    +or would it be an idea to read a limited number in one go?
    e.g. deal with 20,000 at a time in memory
    i suppose this question demonstrates a lack of experience with C++
    but hey, that's why i'm posting in the learner's forum :)

    -however much of the file is read, is it worth writing a bespoke
    solution
    or look for a parser/class that's been written for csv files?
    +perhaps there is a module that i can import?
    +since the csv files are *supposed* to be of a standard format,
    would
    there be much to gain iin writing something specific to this - this
    would be done with the aim of reducing processing time

    -data types... should i read the value fields as floating point
    numbers (range approx. +/- 500000.00)
    +will using floating point data types save memory?


    As anyone reading would be able to tell, I'm still quite new to this
    language, and am missing some of the basics which I've had a bit of
    trouble locating solutions to.

    Any advice would be much appreciated!

    Brz
     
    brzak, Nov 30, 2008
    #1
    1. Advertising

  2. brzak

    LR Guest

    brzak wrote:
    > hi
    >
    > I was hoping for a few pointers on how to best go about
    > processing large csv files.


    I don't think that you've given enough information about your particular
    problem, but I'll give it a try.

    >
    > The files:
    > a typical file has 100K-500K records
    > approx 150 chars per line (10 'fields')
    > so file sizes of 15MB-75MB.
    >
    > The processing:
    > summarise numerical fields based on conditions applied to other
    > fields
    >
    > the resulting summary is a table;
    > the column headers of which are the list of unique values of one
    > field
    > the row headers are decided upon by conditions on other fields
    > (this may include lookups, exclusions, reclassifications)


    Do you mean a hierarchical file, or some other structure?


    >
    > Taking into account the size of the files, and number of operations
    > requierd on each record...
    >
    > What if any of thse considerations do I need to take into account:
    >
    > -is the file read in line by line / or in one go?


    I'd probably try to do it line by line first and get the processing
    right, and then if you need to, switch to all in one go.


    > +if in one go, there would be issues with available memory?


    There will be issues no matter how you write your code, but the question
    is do you have enough memory to store the entire text or not? Plus the
    space required for the variables you're going to manipulate?


    > +if it's line by line, is tehre a significant difference in time
    > taken to process? (i.e from my limited personal experience with
    > VBA, reading/writing a cell at a time in a spreasheet is far
    > slower than reading/writing in 'batches')


    Very system dependent. And "batch" size dependent. You'll have to try it
    and see.


    > +or would it be an idea to read a limited number in one go?
    > e.g. deal with 20,000 at a time in memory


    Depends on what your data is like and what you have to do with it.

    > i suppose this question demonstrates a lack of experience with C++
    > but hey, that's why i'm posting in the learner's forum :)
    >
    > -however much of the file is read, is it worth writing a bespoke
    > solution
    > or look for a parser/class that's been written for csv files?


    Reading in csv files isn't difficult. You do mean comma seperated
    values? What have you tried so far?


    > +perhaps there is a module that i can import?


    Have you looked at std::getline? You can use that not only to read a
    line at a time, but once you have a line in a std::string to easily get
    values that are separated by commas using a std::istringstream. Unless
    you have quoted fields with commas in them. In which case you might be
    more interested in something that can deal with regular expressions.
    You might want to look at www.boost.org, or perhaps if your compiler has
    support for it, the tr1. I did a quick google(tm) and found this, but I
    don't know how good it is: http://www.johndcook.com/cpp_regex.html

    But I think that std::getline should be something to take a look at first.

    > +since the csv files are *supposed* to be of a standard format,


    What standard are they following? And do you suspect they don't follow a
    standard?

    > would
    > there be much to gain iin writing something specific to this - this
    > would be done with the aim of reducing processing time


    Have you tried this already and found that your processing time is too long?

    >
    > -data types... should i read the value fields as floating point
    > numbers (range approx. +/- 500000.00)


    Probably that or double. I think double would be better. But as you
    point out...

    > +will using floating point data types save memory?


    Yes, well probably, but I suspect it will take more time to use float.
    System dependent. But I don't know. Best to benchmark if you're
    concerned. Also, I don't recall, but I think that sizeof(float) can be
    equal to sizeof(double), although I can't think of a system where it is.

    I think that most problems have a space time trade off. You may have to
    decide which one is more important to you.


    > As anyone reading would be able to tell, I'm still quite new to this
    > language, and am missing some of the basics which I've had a bit of
    > trouble locating solutions to.
    >
    > Any advice would be much appreciated!


    Sounds like you might need a good book. You might want to check out the
    reviews at www.accu.org.

    LR
     
    LR, Nov 30, 2008
    #2
    1. Advertising

  3. brzak

    James Kanze Guest

    On Nov 30, 3:04 am, brzak <> wrote:

    > I was hoping for a few pointers on how to best go about
    > processing large csv files.


    > The files:
    >     a typical file has 100K-500K records
    >     approx 150 chars per line (10 'fields')
    >     so file sizes of 15MB-75MB.


    > The processing:
    >     summarise numerical fields based on conditions applied to other
    > fields


    >     the resulting summary is a table;
    >       the column headers of which are the list of unique values of one
    > field
    >       the row headers are decided upon by conditions on other fields
    >         (this may include lookups, exclusions, reclassifications)


    > Taking into account the size of the files, and number of operations
    > requierd on each record...


    > What if any of thse considerations do I need to take into account:


    >   -is the file read in line by line / or in one go?
    >     +if in one go, there would be issues with available memory?
    >     +if it's line by line, is tehre a significant difference in time
    >          taken to process? (i.e from my limited personal experience with
    >          VBA, reading/writing a cell at a time in a spreasheet is far
    >          slower than reading/writing in 'batches')
    >     +or would it be an idea to read a limited number in one go?
    >       e.g. deal with 20,000 at a time in memory
    >   i suppose this question demonstrates a lack of experience with C++
    >    but hey, that's why i'm posting in the learner's forum :)


    For such a small file, it probably doesn't matter. Reading one
    character at a time might, or even unit buffered, but otherwise,
    the buffering in ifstream should be largely adequate. If you do
    find that I/O is a blocking point, you can try memory mapping
    the file, but there's no guarantee that that will improve
    anything.

    >   -however much of the file is read, is it worth writing a bespoke
    > solution
    >    or look for a parser/class that's been written for csv files?
    >     +perhaps there is a  module that i can import?
    >     +since the csv files are *supposed* to be of a standard format,
    > would
    >          there be much to gain iin writing something specific to this - this
    >          would be done with the aim of reducing processing time


    If you can find a generalized CSV parser, use it. I suspect,
    however, that CSV is so simple that most people just do it by
    hand; if you know up front which fields contain what types, it's
    a lot easier (and faster).

    >   -data types... should i read the value fields as floating point
    >    numbers (range approx. +/- 500000.00)
    >     +will using floating point data types save memory?


    Compared to what? Floating point takes less space than double,
    but again, we're talking about a fairly small data set, so it
    probably doesn't matter.

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
     
    James Kanze, Nov 30, 2008
    #3
  4. On 2008-11-30 03:04, brzak wrote:
    > hi
    >
    > I was hoping for a few pointers on how to best go about
    > processing large csv files.
    >
    > The files:
    > a typical file has 100K-500K records
    > approx 150 chars per line (10 'fields')
    > so file sizes of 15MB-75MB.
    >
    > The processing:
    > summarise numerical fields based on conditions applied to other
    > fields
    >
    > the resulting summary is a table;
    > the column headers of which are the list of unique values of one
    > field
    > the row headers are decided upon by conditions on other fields
    > (this may include lookups, exclusions, reclassifications)
    >
    > Taking into account the size of the files, and number of operations
    > requierd on each record...


    It all very much depend on what kind of operations you want to perform,
    and whether you can do the operations "on the fly" or if you need access
    to the whole table before you begin.

    > What if any of thse considerations do I need to take into account:
    >
    > -is the file read in line by line / or in one go?


    If memory usage is important you read line by line, if performance is
    more important then you probably want to do it in one go. You might also
    want to consider memory mapping of the file.

    > +if in one go, there would be issues with available memory?


    Depends on your target environment, 75MB is not very much in a modern
    desktop or server, but for other targets it might be.

    > +if it's line by line, is tehre a significant difference in time
    > taken to process? (i.e from my limited personal experience with
    > VBA, reading/writing a cell at a time in a spreasheet is far
    > slower than reading/writing in 'batches')


    Not necessary, but I'd say probably.

    > +or would it be an idea to read a limited number in one go?
    > e.g. deal with 20,000 at a time in memory


    Since it's hard to know where the content of a field starts and ends
    without some processing it is probably not possible to read anything
    other than full lines (one or more) without risking getting only half
    the contents of a field.

    > i suppose this question demonstrates a lack of experience with C++
    > but hey, that's why i'm posting in the learner's forum :)
    >
    > -however much of the file is read, is it worth writing a bespoke
    > solution
    > or look for a parser/class that's been written for csv files?


    It's really not that difficult to write it yourself, and if you do you
    can customise it to your needs. Just a tip though, try not to limit
    yourself by hardcoding the number of columns or something.

    > -data types... should i read the value fields as floating point
    > numbers (range approx. +/- 500000.00)
    > +will using floating point data types save memory?


    Float will probably save you memory, the question is whether it is worth
    the loss in precision.

    > As anyone reading would be able to tell, I'm still quite new to this
    > language, and am missing some of the basics which I've had a bit of
    > trouble locating solutions to.


    If performance is critical you should profile your solution to see where
    the bottlenecks are. You might also check out the difference in
    performance between C++ streams and C IO functions, some implementations
    of the C++ standard library have bad performance.

    --
    Erik Wikström
     
    Erik Wikström, Nov 30, 2008
    #4
  5. brzak

    Raymond Häb Guest

    brzak schrieb:
    > I was hoping for a few pointers on how to best go about
    > processing large csv files.
    >
    > The files:
    > a typical file has 100K-500K records
    > approx 150 chars per line (10 'fields')
    > so file sizes of 15MB-75MB.
    > [...]


    I just want to add a hint to boost::spirit, boost::xpressive
    and the fileiterator

    http://www.boost.org/doc/libs/1_37_0/libs/spirit/classic/index.html
    http://www.boost.org/doc/libs/1_37_0/libs/spirit/classic/doc/file_iterator.html
    http://www.boost.org/doc/libs/1_37_0/doc/html/xpressive.html

    But maybe learning all this is "overkill" for such an easy task. But if
    you are frequently confronted to similar problems you can produce very
    efficient solutions with it (both for development time and runtime (but
    sadly not compile time)).


    Raymond
     
    Raymond Häb, Dec 1, 2008
    #5
  6. brzak

    Boogie Guest

    On Dec 1, 4:45 am, Raymond Häb <> wrote:
    > I just want to add a hint to boost::spirit, boost::xpressive
    > and the fileiterator
    >
    > But maybe learning all this is "overkill" for such an easy task. But if
    > you are frequently confronted to similar problems you can produce very
    > efficient solutions with it (both for development time and runtime (but
    > sadly not compile time)).
    >
    > Raymond


    I think that would be an overkill.
    I would suggest:

    Tokenizer for parsing (read line of text and then tokenize it):
    http://www.boost.org/doc/libs/1_37_0/libs/tokenizer/escaped_list_separator.htm

    Lexical cast for string to float (double) conversion:
    http://www.boost.org/doc/libs/1_37_0/libs/conversion/lexical_cast.htm

    Easy to use, doesn't require a lot of writing code ;)

    Boogie
     
    Boogie, Dec 1, 2008
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. nun
    Replies:
    3
    Views:
    96
    John W. Krahn
    Mar 20, 2007
  2. Tim Chase
    Replies:
    0
    Views:
    113
    Tim Chase
    Sep 4, 2013
  3. MRAB
    Replies:
    0
    Views:
    113
  4. Tim Chase
    Replies:
    0
    Views:
    130
    Tim Chase
    Sep 4, 2013
  5. Tim Chase
    Replies:
    0
    Views:
    109
    Tim Chase
    Sep 4, 2013
Loading...

Share This Page