Appending to the middle of a file


S

scottmf

I am parsing very large data files (4 million lines or more) to reorder
the data and eliminate unnecessary information. Unfortunately because
of how the file is arranged I have to read the entire file before
processing the data. Currently everything is written to 2-d arrays and
takes about 3Gb of memory to process. I would like to start using a
temp file so that machines with less memory can still complete the
process, but in order to do so I need to be able to append data to the
middle of the file.

eg:
starting with data file:
Line: order read:
load1 1
ID1 a b c 2
ID2 d e f 3
ID3 g h i 4
load2 5
ID1 j k l 6
ID2 m n o 7
ID3 p q r 8

temp file becomes:
Line: order wrote:
ID1 1
load1 a b c 2
load2 j k l 7
ID2 3
load1 d e f 4
load2 m n o 8
ID3 5
load1 g h i 6
load2 p q r 9


any suggestions are much appreciated.
 
Ad

Advertisements

X

xhoster

scottmf said:
I am parsing very large data files (4 million lines or more) to reorder
the data and eliminate unnecessary information. Unfortunately because
of how the file is arranged I have to read the entire file before
processing the data. Currently everything is written to 2-d arrays and
takes about 3Gb of memory to process.

I have no idea what this means. "written to" implies disk, while
"2-d arrays" suggests perl's in-memory data structures.
I would like to start using a
temp file so that machines with less memory can still complete the
process, but in order to do so I need to be able to append data to the
middle of the file.

Appending is, by definition, done at the end of the file, not the middle.
There are ways to insert into the middle of a large file (see Tie::File),
but they all are either hideously inefficient or hideously complicated, if
not both.

eg:
starting with data file:
Line: order read:
load1 1
ID1 a b c 2
ID2 d e f 3
ID3 g h i 4
load2 5
ID1 j k l 6
ID2 m n o 7
ID3 p q r 8

temp file becomes:
Line: order wrote:
ID1 1
load1 a b c 2
load2 j k l 7
ID2 3
load1 d e f 4
load2 m n o 8
ID3 5
load1 g h i 6
load2 p q r 9

any suggestions are much appreciated.

If you are doing what I think you are doing, then I would suggest
a perl script to convert the input file to something like:

ID1 load1 a b c
ID2 load1 d e f
ID3 load1 g h i
ID1 load2 j k l
ID2 load2 m n o
ID3 load2 p q r

And then using your OS's sort program to sort by ID so that equal IDs are
grouped together, and then another perl program to process that file into
what you want.

Alternatively, you could make one temp file for each different ID value,
and then combine all of these temp files together at the end.

Xho
 
M

Mark Clements

scottmf said:
I am parsing very large data files (4 million lines or more) to reorder
the data and eliminate unnecessary information. Unfortunately because
of how the file is arranged I have to read the entire file before
processing the data. Currently everything is written to 2-d arrays and
takes about 3Gb of memory to process. I would like to start using a
temp file so that machines with less memory can still complete the
process, but in order to do so I need to be able to append data to the
middle of the file.

eg:
starting with data file:
Line: order read:
load1 1
ID1 a b c 2
ID2 d e f 3
ID3 g h i 4
load2 5
ID1 j k l 6
ID2 m n o 7
ID3 p q r 8

temp file becomes:
Line: order wrote:
ID1 1
load1 a b c 2
load2 j k l 7
ID2 3
load1 d e f 4
load2 m n o 8
ID3 5
load1 g h i 6
load2 p q r 9


any suggestions are much appreciated.

Xho has already answered this, but I would be tempted to do would be to
throw the data into an RDBMS of some description and let that do the hard
work, though this complicates matters and requires access to eg MySQL
and knowledge of SQL. Process line-by-line in Perl to get the data into the
db, then pull it out again as required. Drop indexes before insertion and
rebuild afterwards.

Mark
 
S

scottmf

have no idea what this means. "written to" implies disk, while
"2-d arrays" suggests perl's in-memory data structures.

I meant the information was stored in 2-d arrays (in-memory).

Thanks for the ideas.

~Scott
 
S

scottmf

Because of other file formats I also have to be able to parse and the
fact that I am using Windows XP (don't know of any sort programs that
come with windows), using one temp file for each ID value seems much
easier. In that case is there any way I can automatically generate the
filehandle from the ID value; i.e. given ID1, ID2, and ID3 can I
automatically do:
my $fh_ID1 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
my $fh_ID2 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
my $fh_ID3 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");

~Scott
 
X

xhoster

scottmf said:
Because of other file formats I also have to be able to parse and the
fact that I am using Windows XP (don't know of any sort programs that
come with windows), using one temp file for each ID value seems much
easier. In that case is there any way I can automatically generate the
filehandle from the ID value; i.e. given ID1, ID2, and ID3 can I
automatically do:
my $fh_ID1 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
my $fh_ID2 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
my $fh_ID3 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");

You should hold the file handles in an array or a hash.
I'd probably do something like this:

my %fh; ##holds hash (by ID) of filehandles

while (<INPUT_DATA>) {
##some stuff which sets $id and $to_print
unless (exists $fh{$id}) {
$fh{$id}=tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
};
print {$fh{$id}} $to_print;
};

Except probably I'd control the naming of the files myself, rather than
letting a module do it, because I would probably want to control the order
in which they are combined when I'm done.

Xho
 
Ad

Advertisements

B

Brian McCauley

scottmf said:
my $fh_ID1 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
my $fh_ID2 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
my $fh_ID3 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");

Always [email protected] if you are doing it three times, you are probably going
it wrong.

Use a loop and put the filehandles in an array (or more likely a hash).
 
B

Brian McCauley

unless (exists $fh{$id}) {
$fh{$id}=tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
};

There is no need for exists() so this is more simply

$fh{$id} ||= tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
 
S

scottmf

Thanks for the help, that does exactly what I needed.

As far as naming the files; since I want the final output to basically
be the contents of all of the temp files sorted by the ID value which I
can do using something similar to:

my ($key, $line);
open(OUTPUT, ">>", "sorted.dat");
foreach $key (sort keys %fh) {
while($line = {$fh{$key}}) {
print OUTPUT $line;
}
}

is there any reason to control the naming of the files (there will be
several hundred) myself?
~Scott
 
S

skye.shaw

Xp does have a sort program. Just holla at cha command line. As for the
problem.....

So all you want to do is eliminate reduntant data? The meat of the
problem is that this redundant data is scattered everywhere ( in the
file that is ), and you cant read that data all at once to sort it out
on smaller machines, correct?

Ok, well then, lets see here....... lets start with whats already built
for us, the sort program. Doing sort /? has this key chunk of info that
kind of already tackles your problem:

By default the
sort will be done with one pass (no
temporary
file) if it fits in the default maximum
memory size, otherwise the sort will be
done
in two passes (with the partially sorted
data
being stored in a temporary file) such
that
the amounts of memory used for both the
sort
and merge passes are equal. The default
maximum memory size is 90% of available
main
memory if both the input and output are
files, and 45% of main memory otherwise.


now its says 2 passes, although a large amount of data on a small mem
system might seem like it could take more than 2 passes. So well use
plan B later. But for now just as(s)ume it works and use sort to do the
"hard" work for you and you can use its its sorted output file

Then, in perl, create your master output file for the data your about
to parse. Read in all the "A" data, (ie the first set of inorder data)
and eliminate the doubles as useual. After this is done, write the
results to the master file. Then read the "B" data and do the same.

As for plan B, let me know how this goes, or if i missed anyting in the
scope of the problem, or if some other thugged out poster thinks this
idea sucks, then I will work on my carpel a little more. Also as a
side note, you can use the split program to chop up your files into
chunks. But for Windows you gotta get cygwin or colinux or somthing. Im
sure there is a Win32 version too maybe.
 
S

skye.shaw

Thanks for the help, that does exactly what I needed.
Oh well I guess mine got in a little late.
 
Ad

Advertisements

S

scottmf

It looks like this would work, although since the files I'm being asked
to parse keep getting larger (just got one that is 1.7 Gb !!!) I think
splitting it into many temp files will be more stable for future
versions. I always appreciate having more than one way to solve a
problem though, so thanks for the reply.

~Scott
 
S

skye.shaw

Oh yah, for what purposes are you using these files for? And why are
they getting so large? Are these files all text? I would drop the
DBomb idea on them. If they dont like that then tell them U have this
great new way to index large amounts of data in binary files for fast
retrieval. Then ask for a raise to 5 cents perl ASCII char per file.
 
X

xhoster

scottmf said:
Thanks for the help, that does exactly what I needed.

As far as naming the files; since I want the final output to basically
be the contents of all of the temp files sorted by the ID value which I
can do using something similar to:

my ($key, $line);

Don't declare them there, declare them in the smallest scope
open(OUTPUT, ">>", "sorted.dat");
foreach $key (sort keys %fh) {
foreach my $key (sort keys %fh) {

Does the tempfile routine you used return a handle that is open for
both reading and writing? If so, you probably still need to rewind
the file pointer before you start reader. Something like:

seek $fh{$key}, 0,0; ## test for failure? Does it work for windows?

while($line = {$fh{$key}}) {

You probably want angle rather than curly brackets there, but that
still won't work because angle brackets require simple scalar, not
a hash element.

while (my $line=readline($fh{$key})) {
print OUTPUT $line;
}
}

is there any reason to control the naming of the files

During the combine stage, I would just reopen the files for reading rather
than messing around with "seek" and making sure the originals were
read/write. As long you don't mind messing around with seek and
read/write, then there is no reason to control the naming. Well, maybe
one: Do your tempfiles disappear once their handles are closed? If so,
then what happens if your program bombs out during the last stage? All of
your computer's (potentially hours) of work in making those files would be
lost. If you named them by hand, it would be a simple matter to restart at
the combine stage. It is a trade off between recoverability and leaving a
mess behind.
(there will be
several hundred) myself?

With several hundred, you might run into problems with limits on the number
of open file handles.

Xho
 
S

scottmf

The files contain information from finite element analysis (all single
precision floats); basically all the forces each element is exposed to
for all different cases. Thousands of cases * thousands of elements =
Very large files. They can also be formatted two different ways, which
makes my job even more of a pain:

format 1:
case #1
element1 forcex forcey forcexy
element2 forcex forcey forcexy
case #2
element1 forcex forcey forcexy
element2 forcex forcey forcexy

format2:
case #1 x
element1 forcex
element2 forcex
case #1 y
element1 forcey
element2 forcey
case #1 xy
element1 forcexy
element2 forcexy
case #2 x
element1 forcex
element2 forcex
case #2 y
element1 forcey
element2 forcey
case #2 xy
element1 forcexy
element2 forcexy

Unfortunately there is no way to get them to change how the data is
saved.
Before I started writing my code they were entering the data into excel
by hand!!
 
S

scottmf

Is there an easy way to find out the max number of open filehandles? I
know that with perl v5.8 the Filecache package can be used to manage
the number of simultanious open filehandles, but I cannot get that for
several weeks. I tried the following, and if I change the max value of
$i 1 at a time I can tell when the creation fails, but if I just set
the max very high I get the following error:

use strict;
use Carp::Heavy;
use File::Temp qw(tempfile tempdir);
my $tempdir = tempdir(CLEANUP => 1);
my %fh;

for (my $i = 1; $i <=550; ++$i)
{
$fh{$i} = tempfile() or die "Could not create filehandle #$i\n";
};

returns:
Error in tempfile() using C:\DOCUME~1\user\LOCALS~1\Temp\XXXXXXXXXX:
Could not create temp file C:\DOCUME~1\user\LOCALS~1\Temp\QSksT8zrbh:
Too many open files at file_handles.pl line 12

rather than the statement following the or die, containing the max file
handles.
 
Ad

Advertisements

X

xhoster

scottmf said:
Is there an easy way to find out the max number of open filehandles? I
know that with perl v5.8 the Filecache package can be used to manage
the number of simultanious open filehandles, but I cannot get that for
several weeks.

You can do a fairly decent job yourself with something like:

unless (exists $fh{$id}) {
%fh=() if keys %fh >= $max_open_handles;
open $fh{$id}, ">>/tmp/foo/$id.dat" or die $!;
};

(Of course, it requires you to manage the naming yourself, so that
you can open for appending to the correct file.)


I tried the following, and if I change the max value of
$i 1 at a time I can tell when the creation fails, but if I just set
the max very high I get the following error:

use strict;
use Carp::Heavy;
use File::Temp qw(tempfile tempdir);
my $tempdir = tempdir(CLEANUP => 1);
my %fh;

for (my $i = 1; $i <=550; ++$i)
{
$fh{$i} = tempfile() or die "Could not create filehandle #$i\n";

eval { $fh{$i} = tempfile() }
or die "Could not create filehandle #$i\n";


Xho
 

Top