Aggregating/Sorting Large files

W

Worky Workerson

I have a many large files (100M to 1GB) that are stored in CSV format,
and I would like to efficiently 1) sort each file individually based on
the first 10 "columns" of the CSV file and 2) aggregate all of the
files into a single huge file, merging columns 11-13 together based on
a little bit of logic (i.e. updating count fields and first/last seen
timestamps).

Does anyone know of the best way to go about doing this or a couple of
good modules to look at? I was looking at File::Sort on CPAN and it
looks like it might be able to efficiently sort each file, and then it
would be up to me to aggregate them, but I was kinda hoping that I
could save some processing and do them both at the same time.

Here are a couple of possibly tricky things that I am worried about:
-Number of open file descriptors - I will sometimes want to sort/merge
thousands of files.
-Memory - the files are too big to just snarf up the whole thing into
memory

Basically, this stuff comes from a variety of different sources and I
output the normalized CSV files, which I eventually plan on putting
into a database (aggregated). I've written some functions in the
database to merge newly inserted records, however I am pretty sure
that, if I first optimize/sort/aggregate the files, I can get much
better performance. As I also generate the initial CSV, there is some
opportunity for me to sort it at that time, however I can't do it all
in memory as the input and output files are too large, so I was
thinking that a 3 stage process was the best way to go:

1) convert input format into normalize CSV format
2) Sort CSV file
3) Aggregate many CSV files together

Data Format:

col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,date_first_seen,date_last_seen,number_seen

Any insight/pointers would be greatly appreciated.

Thanks!
 
A

A. Sinan Unur

I have a many large files (100M to 1GB) that are stored in CSV format,
and I would like to efficiently 1) sort each file individually based
on the first 10 "columns" of the CSV file and 2) aggregate all of the
files into a single huge file, merging columns 11-13 together based on
a little bit of logic (i.e. updating count fields and first/last seen
timestamps).

Does anyone know of the best way to go about doing this or a couple of
good modules to look at?
....

Basically, this stuff comes from a variety of different sources and I
output the normalized CSV files, which I eventually plan on putting
into a database (aggregated). I've written some functions in the
database to merge newly inserted records, however I am pretty sure
that, if I first optimize/sort/aggregate the files, I can get much
better performance.
....

col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,date_first_seen,
date_last_seen,number_seen


Since all the files have the same fields, this would be simple to deal
with by using a proper database application.

I would not waste time trying to massage these files before inserting
data into the data base.

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
 
W

Worky Workerson

Since all the files have the same fields, this would be simple to deal
with by using a proper database application.

Yes, the CSV files that I produce almost exactly match my DB schema.
And I do have the aggregation implemented as INSERT triggers on the DB.
However, the INSERT trigger does the following for each row:

1. Figure out which partition table to store the data in (based on IP)
2. Attempt an update the row that matches the first 11 columns
3. If update fails (i.e. no such row), insert the row.

If there are indexes on the correct columns, it is reasonably fast as
well as being clean and reasonably elegant. The problem is that its
not fast enough for the amounts of data that I am looking at (hundreds
of GB). If the files are sorted and aggregated before they go into the
DB, I can remove the trigger completely so I get *much* better insert
performace and I can afford to reload the whole DB instead of just
loading incremental deltas through the trigger. Or at least I can load
incremental sorted,aggregated deltas, which would also reduce my
runtime.

Another (side) reason to aggregate in application space is that I can
decrease the size of the raw data that I need to store. And yes, I do
have to store it all ... business requirement.

As an aside, I'm using PostgreSQL, but that shouldn't matter much, as
I'm looking for a perl solution.
 
X

xhoster

Worky Workerson said:
I have a many large files (100M to 1GB) that are stored in CSV format,
and I would like to efficiently 1) sort each file individually based on
the first 10 "columns" of the CSV file and

If the CSV files cannot have embedded quoted or escaped commas, I'd just
use the system sort on each file. In fact, even if they could have
embedded quoted or escaped columns, you could probably still use the system
sort, as it seems like you mostly want to group equal keys together and the
exact order isn't all that important. You may need to ensure that locale
is set appropriate (which in my experience means:
setenv LC_ALL C
unsetenv LANG

2) aggregate all of the
files into a single huge file,

Again, system sort most likely, depending on how exactly the next step
is to be done.

merging columns 11-13 together based on
a little bit of logic (i.e. updating count fields and first/last seen
timestamps).

Er, um, well, just do it, I guess.

Does anyone know of the best way to go about doing this or a couple of
good modules to look at? I was looking at File::Sort on CPAN and it
looks like it might be able to efficiently sort each file, and then it
would be up to me to aggregate them, but I was kinda hoping that I
could save some processing and do them both at the same time.

Since your aggregation needs are somewhat vaguely specified, I don't see
how anyone could recommend a module specifically to help with it. Besides
which, segregation of duties is generally a *good* thing. What advantage
are you hoping to get by combining a perfectly generic sort with a highly
customized aggregation?
Here are a couple of possibly tricky things that I am worried about:
-Number of open file descriptors - I will sometimes want to sort/merge
thousands of files.

Again, depends on the details of your aggregation method. If you need
to maintain separate streams, that will be a problem. But if you don't
you could sort-merge 10 files at a time from 1000 files down to 100, then
merge-only 10 at a time (out of the new 100) down to 10, then merge-only
those 10 down to 1, which you pipe into perl. You would never need more
than 11 file handles open at once for your data files.
-Memory - the files are too big to just snarf up the whole thing into
memory
Sure.

....
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,date_first_seen,
date_last_seen,number_seen

So then is the aggregation nothing more than, for each set of equal keys,
keeping the least date_first_seen, the greatest date_last_seen, and the sum
of number_seen? If so, than you could do the aggregation while doing the
merging in the funnel method I outlined above, decreasing overall IO.

Xho
 
R

rajeev

If the goal was to aggregrate data, cant you dump them into database.
And then take a database dump based on the ordering you need. That
might be fastest way, i guess
 
C

Charles DeRykus

Worky said:
I have a many large files (100M to 1GB) that are stored in CSV format,
and I would like to efficiently 1) sort each file individually based on
the first 10 "columns" of the CSV file and 2) aggregate all of the
files into a single huge file, merging columns 11-13 together based on
a little bit of logic (i.e. updating count fields and first/last seen
timestamps).

Does anyone know of the best way to go about doing this or a couple of
good modules to look at? I was looking at File::Sort on CPAN and it
looks like it might be able to efficiently sort each file, and then it
would be up to me to aggregate them, but I was kinda hoping that I
could save some processing and do them both at the same time.

Here are a couple of possibly tricky things that I am worried about:
-Number of open file descriptors - I will sometimes want to sort/merge
thousands of files.
-Memory - the files are too big to just snarf up the whole thing into
memory

Basically, this stuff comes from a variety of different sources and I
output the normalized CSV files, which I eventually plan on putting
into a database (aggregated). I've written some functions in the
database to merge newly inserted records, however I am pretty sure
that, if I first optimize/sort/aggregate the files, I can get much
better performance. As I also generate the initial CSV, there is some
opportunity for me to sort it at that time, however I can't do it all
in memory as the input and output files are too large, so I was
thinking that a 3 stage process was the best way to go:

1) convert input format into normalize CSV format
2) Sort CSV file
3) Aggregate many CSV files together

Data Format:

col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,date_first_seen,date_last_seen,number_seen

Any insight/pointers would be greatly appreciated.

Thanks!

There's a Sort::Merge on CPAN. I haven't looked closely but maybe in
combo with File::Sort, that'd be an option.

Another alternative would be a DBM which takes care of memory issues.
Perl's DB_File interface to Berkeley DB enables a custom sort key sub
via the `btree` format. You could serially process input files or, if
that's too slow, run a fork loop over the input files. Basically, then
you'd just normalize and then insert each file stream into the DBM via
your custom key lookup. The output file would be ready for PostgreSQL
entry.
 
D

Dr.Ruud

A. Sinan Unur schreef:
I would not waste time trying to massage these files before inserting
data into the data base.

That is also how I like to do these things. Get the data as unchanged as
possible into a database and create the final tables (in the same or
rather in a different database) with some SQL-queries.

If the files are really big and a simple step is feasible to shrink them
considerably, I would do that. Replacing the field separator by a tab,
combined with removing surrounding quotes and trailing spaces from
fields, could be such a step.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,906
Latest member
SkinfixSkintag

Latest Threads

Top