Large Data files and sizes - what are you doing about them?

R

Rich_Elswick

Hi all,

I am parsing a large data sets (62 gigs on one file). I can parse out
the into smaller files fine with perl, which is what we have to do
anyway (i.e. hex data becomes ascii .csv file type of different decoded
variables.) I am working with CAN data for those that know about
Controller Area Networks collected by Vector CANalyzer.

After they are parsed out, I am looking at the largest data file (1
file becomes ~100 smaller files) is about 2 gigs as of right now, but
who knows how large it could become in the future. I then use GDGraph
to parse through the data files and rapidly generate some .png files
for review (I have issues with this as well and will post those
questions some other time.) I run this on the whole batch of 100
files, going through each file one at a time using a batch program to
call the each perl program separately for each GDGraph, because GDGraph
loads the entire data set into memory before graphing the data. This
limits me to using this method on data files smaller than ~20 megs,
based on system memory. I suppose I could up the memory size of the
individual machine, but that is 1. costs money, 2. makes me request it
form IT (not easy), 3. Still doesn't work with a 2 gig file.

I was wondering 2 things.

1. Is there a better way of graphing this data, which uses less memory?
2. What is everyone else out there using?

Please no comments about just sampling the data (once every 5 lines or
something like that) and graphing the sampled data as we have already
considered this and that may be our method of resolving our issues.

Thanks,
Rich Elswick
Test Engineer
Cobasys LLC
http://www.cobasys.com
 
X

xhoster

Rich_Elswick said:
Hi all,

I am parsing a large data sets (62 gigs on one file). I can parse out
the into smaller files fine with perl, which is what we have to do
anyway (i.e. hex data becomes ascii .csv file type of different decoded
variables.) I am working with CAN data for those that know about
Controller Area Networks collected by Vector CANalyzer.

After they are parsed out, I am looking at the largest data file (1
file becomes ~100 smaller files) is about 2 gigs as of right now, but
who knows how large it could become in the future. I then use GDGraph
to parse through the data files and rapidly generate some .png files
for review (I have issues with this as well and will post those
questions some other time.) I run this on the whole batch of 100
files, going through each file one at a time using a batch program to
call the each perl program separately for each GDGraph, because GDGraph
loads the entire data set into memory before graphing the data. This
limits me to using this method on data files smaller than ~20 megs,
based on system memory. I suppose I could up the memory size of the
individual machine, but that is 1. costs money, 2. makes me request it
form IT (not easy), 3. Still doesn't work with a 2 gig file.

I was wondering 2 things.

1. Is there a better way of graphing this data, which uses less memory?

It seems to me that if you are trying to plot 2 gig worth of data, than at
least one of two things is probably the case. Either most of the data
points fall on almost exaclty top of each other, and therefore you can get
the same image by plotting less than all of them. Or the resulting image
is a blob of partially or nearly overlapping symbols, which would convey
little information other than blobiness, and thus by plotting less than all
of them you get a graph that is more informative than plotting all of them.

Since you don't want to hear about sampling, I would suggest two
alternatives which are related to sampling but aren't the same. One would
be filtering, where you exclude points if you know that they are
effectively on top of a previous, included, point. The other would be
summarization--instead of taking every 500th point to plot, like in
sampling, you take the mean of all 500 and plot that, or you take the min,
max, and median of each group of 500 and plot those 3 things rather than
all 500.
2. What is everyone else out there using?

I use GD::Graph using sampling summarization techniques.

Sometimes I use GD::Graph to set up my axes and labels and titles and such
on a dummy data set, but then use GD directly to draw the actual data
points on the canvass provided by GD::Graph. This way all the data doesn't
need to be in memory at once. However, you need to use the internal
methods of GD::Graph to figure out what coordinates to supply to GD, so
this is a lot of work and is fragile.

I also use R and/or gnuplot to draw some types of images (i.e. contour
plots) which summarize very large datasets without actually drawing each
point. These are stand alone programs, and I only use Perl to massage
their inputs, but I think there are modules which will help interface Perl
with both of them.

Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,898
Latest member
BlairH7607

Latest Threads

Top