Large Data files and sizes - what are you doing about them?

Discussion in 'Perl Misc' started by Rich_Elswick, Mar 9, 2006.

  1. Rich_Elswick

    Rich_Elswick Guest

    Hi all,

    I am parsing a large data sets (62 gigs on one file). I can parse out
    the into smaller files fine with perl, which is what we have to do
    anyway (i.e. hex data becomes ascii .csv file type of different decoded
    variables.) I am working with CAN data for those that know about
    Controller Area Networks collected by Vector CANalyzer.

    After they are parsed out, I am looking at the largest data file (1
    file becomes ~100 smaller files) is about 2 gigs as of right now, but
    who knows how large it could become in the future. I then use GDGraph
    to parse through the data files and rapidly generate some .png files
    for review (I have issues with this as well and will post those
    questions some other time.) I run this on the whole batch of 100
    files, going through each file one at a time using a batch program to
    call the each perl program separately for each GDGraph, because GDGraph
    loads the entire data set into memory before graphing the data. This
    limits me to using this method on data files smaller than ~20 megs,
    based on system memory. I suppose I could up the memory size of the
    individual machine, but that is 1. costs money, 2. makes me request it
    form IT (not easy), 3. Still doesn't work with a 2 gig file.

    I was wondering 2 things.

    1. Is there a better way of graphing this data, which uses less memory?
    2. What is everyone else out there using?

    Please no comments about just sampling the data (once every 5 lines or
    something like that) and graphing the sampled data as we have already
    considered this and that may be our method of resolving our issues.

    Thanks,
    Rich Elswick
    Test Engineer
    Cobasys LLC
    http://www.cobasys.com
     
    Rich_Elswick, Mar 9, 2006
    #1
    1. Advertising

  2. Rich_Elswick

    Dr.Ruud Guest

    Dr.Ruud, Mar 9, 2006
    #2
    1. Advertising

  3. Rich_Elswick

    Guest

    "Rich_Elswick" <> wrote:
    > Hi all,
    >
    > I am parsing a large data sets (62 gigs on one file). I can parse out
    > the into smaller files fine with perl, which is what we have to do
    > anyway (i.e. hex data becomes ascii .csv file type of different decoded
    > variables.) I am working with CAN data for those that know about
    > Controller Area Networks collected by Vector CANalyzer.
    >
    > After they are parsed out, I am looking at the largest data file (1
    > file becomes ~100 smaller files) is about 2 gigs as of right now, but
    > who knows how large it could become in the future. I then use GDGraph
    > to parse through the data files and rapidly generate some .png files
    > for review (I have issues with this as well and will post those
    > questions some other time.) I run this on the whole batch of 100
    > files, going through each file one at a time using a batch program to
    > call the each perl program separately for each GDGraph, because GDGraph
    > loads the entire data set into memory before graphing the data. This
    > limits me to using this method on data files smaller than ~20 megs,
    > based on system memory. I suppose I could up the memory size of the
    > individual machine, but that is 1. costs money, 2. makes me request it
    > form IT (not easy), 3. Still doesn't work with a 2 gig file.
    >
    > I was wondering 2 things.
    >
    > 1. Is there a better way of graphing this data, which uses less memory?


    It seems to me that if you are trying to plot 2 gig worth of data, than at
    least one of two things is probably the case. Either most of the data
    points fall on almost exaclty top of each other, and therefore you can get
    the same image by plotting less than all of them. Or the resulting image
    is a blob of partially or nearly overlapping symbols, which would convey
    little information other than blobiness, and thus by plotting less than all
    of them you get a graph that is more informative than plotting all of them.

    Since you don't want to hear about sampling, I would suggest two
    alternatives which are related to sampling but aren't the same. One would
    be filtering, where you exclude points if you know that they are
    effectively on top of a previous, included, point. The other would be
    summarization--instead of taking every 500th point to plot, like in
    sampling, you take the mean of all 500 and plot that, or you take the min,
    max, and median of each group of 500 and plot those 3 things rather than
    all 500.

    > 2. What is everyone else out there using?


    I use GD::Graph using sampling summarization techniques.

    Sometimes I use GD::Graph to set up my axes and labels and titles and such
    on a dummy data set, but then use GD directly to draw the actual data
    points on the canvass provided by GD::Graph. This way all the data doesn't
    need to be in memory at once. However, you need to use the internal
    methods of GD::Graph to figure out what coordinates to supply to GD, so
    this is a lot of work and is fragile.

    I also use R and/or gnuplot to draw some types of images (i.e. contour
    plots) which summarize very large datasets without actually drawing each
    point. These are stand alone programs, and I only use Perl to massage
    their inputs, but I think there are modules which will help interface Perl
    with both of them.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Mar 9, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Miguel Dias Moura
    Replies:
    1
    Views:
    433
    Lars Netzel
    Jun 18, 2004
  2. Anonieko

    HttpHandlers - Learn Them. Use Them.

    Anonieko, Jun 15, 2006, in forum: ASP .Net
    Replies:
    5
    Views:
    551
    tdavisjr
    Jun 16, 2006
  3. Replies:
    0
    Views:
    375
  4. why the lucky stiff
    Replies:
    5
    Views:
    160
    why the lucky stiff
    Sep 22, 2004
  5. Myth__Buster
    Replies:
    23
    Views:
    1,148
    Nobody
    Jun 26, 2012
Loading...

Share This Page