W
Worky Workerson
I have a many large files (100M to 1GB) that are stored in CSV format,
and I would like to efficiently 1) sort each file individually based on
the first 10 "columns" of the CSV file and 2) aggregate all of the
files into a single huge file, merging columns 11-13 together based on
a little bit of logic (i.e. updating count fields and first/last seen
timestamps).
Does anyone know of the best way to go about doing this or a couple of
good modules to look at? I was looking at File::Sort on CPAN and it
looks like it might be able to efficiently sort each file, and then it
would be up to me to aggregate them, but I was kinda hoping that I
could save some processing and do them both at the same time.
Here are a couple of possibly tricky things that I am worried about:
-Number of open file descriptors - I will sometimes want to sort/merge
thousands of files.
-Memory - the files are too big to just snarf up the whole thing into
memory
Basically, this stuff comes from a variety of different sources and I
output the normalized CSV files, which I eventually plan on putting
into a database (aggregated). I've written some functions in the
database to merge newly inserted records, however I am pretty sure
that, if I first optimize/sort/aggregate the files, I can get much
better performance. As I also generate the initial CSV, there is some
opportunity for me to sort it at that time, however I can't do it all
in memory as the input and output files are too large, so I was
thinking that a 3 stage process was the best way to go:
1) convert input format into normalize CSV format
2) Sort CSV file
3) Aggregate many CSV files together
Data Format:
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,date_first_seen,date_last_seen,number_seen
Any insight/pointers would be greatly appreciated.
Thanks!
and I would like to efficiently 1) sort each file individually based on
the first 10 "columns" of the CSV file and 2) aggregate all of the
files into a single huge file, merging columns 11-13 together based on
a little bit of logic (i.e. updating count fields and first/last seen
timestamps).
Does anyone know of the best way to go about doing this or a couple of
good modules to look at? I was looking at File::Sort on CPAN and it
looks like it might be able to efficiently sort each file, and then it
would be up to me to aggregate them, but I was kinda hoping that I
could save some processing and do them both at the same time.
Here are a couple of possibly tricky things that I am worried about:
-Number of open file descriptors - I will sometimes want to sort/merge
thousands of files.
-Memory - the files are too big to just snarf up the whole thing into
memory
Basically, this stuff comes from a variety of different sources and I
output the normalized CSV files, which I eventually plan on putting
into a database (aggregated). I've written some functions in the
database to merge newly inserted records, however I am pretty sure
that, if I first optimize/sort/aggregate the files, I can get much
better performance. As I also generate the initial CSV, there is some
opportunity for me to sort it at that time, however I can't do it all
in memory as the input and output files are too large, so I was
thinking that a 3 stage process was the best way to go:
1) convert input format into normalize CSV format
2) Sort CSV file
3) Aggregate many CSV files together
Data Format:
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,date_first_seen,date_last_seen,number_seen
Any insight/pointers would be greatly appreciated.
Thanks!