Appending to the middle of a file

scottmf · Jun 1, 2005

I am parsing very large data files (4 million lines or more) to reorder
the data and eliminate unnecessary information. Unfortunately because
of how the file is arranged I have to read the entire file before
processing the data. Currently everything is written to 2-d arrays and
takes about 3Gb of memory to process. I would like to start using a
temp file so that machines with less memory can still complete the
process, but in order to do so I need to be able to append data to the
middle of the file.

eg:
starting with data file:
Line: order read:
load1 1
ID1 a b c 2
ID2 d e f 3
ID3 g h i 4
load2 5
ID1 j k l 6
ID2 m n o 7
ID3 p q r 8

temp file becomes:
Line: order wrote:
ID1 1
load1 a b c 2
load2 j k l 7
ID2 3
load1 d e f 4
load2 m n o 8
ID3 5
load1 g h i 6
load2 p q r 9

any suggestions are much appreciated.

xhoster · Jun 1, 2005

scottmf said:
I am parsing very large data files (4 million lines or more) to reorder
the data and eliminate unnecessary information. Unfortunately because
of how the file is arranged I have to read the entire file before
processing the data. Currently everything is written to 2-d arrays and
takes about 3Gb of memory to process.

I have no idea what this means. "written to" implies disk, while
"2-d arrays" suggests perl's in-memory data structures.

I would like to start using a
temp file so that machines with less memory can still complete the
process, but in order to do so I need to be able to append data to the
middle of the file.

Appending is, by definition, done at the end of the file, not the middle.
There are ways to insert into the middle of a large file (see Tie::File),
but they all are either hideously inefficient or hideously complicated, if
not both.

eg:
starting with data file:
Line: order read:
load1 1
ID1 a b c 2
ID2 d e f 3
ID3 g h i 4
load2 5
ID1 j k l 6
ID2 m n o 7
ID3 p q r 8

temp file becomes:
Line: order wrote:
ID1 1
load1 a b c 2
load2 j k l 7
ID2 3
load1 d e f 4
load2 m n o 8
ID3 5
load1 g h i 6
load2 p q r 9

any suggestions are much appreciated.

If you are doing what I think you are doing, then I would suggest
a perl script to convert the input file to something like:

ID1 load1 a b c
ID2 load1 d e f
ID3 load1 g h i
ID1 load2 j k l
ID2 load2 m n o
ID3 load2 p q r

And then using your OS's sort program to sort by ID so that equal IDs are
grouped together, and then another perl program to process that file into
what you want.

Alternatively, you could make one temp file for each different ID value,
and then combine all of these temp files together at the end.

Xho

Mark Clements · Jun 1, 2005

scottmf said:
I am parsing very large data files (4 million lines or more) to reorder
the data and eliminate unnecessary information. Unfortunately because
of how the file is arranged I have to read the entire file before
processing the data. Currently everything is written to 2-d arrays and
takes about 3Gb of memory to process. I would like to start using a
temp file so that machines with less memory can still complete the
process, but in order to do so I need to be able to append data to the
middle of the file.

eg:
starting with data file:
Line: order read:
load1 1
ID1 a b c 2
ID2 d e f 3
ID3 g h i 4
load2 5
ID1 j k l 6
ID2 m n o 7
ID3 p q r 8

temp file becomes:
Line: order wrote:
ID1 1
load1 a b c 2
load2 j k l 7
ID2 3
load1 d e f 4
load2 m n o 8
ID3 5
load1 g h i 6
load2 p q r 9

any suggestions are much appreciated.

Xho has already answered this, but I would be tempted to do would be to
throw the data into an RDBMS of some description and let that do the hard
work, though this complicates matters and requires access to eg MySQL
and knowledge of SQL. Process line-by-line in Perl to get the data into the
db, then pull it out again as required. Drop indexes before insertion and
rebuild afterwards.

Mark

scottmf · Jun 1, 2005

have no idea what this means. "written to" implies disk, while

"2-d arrays" suggests perl's in-memory data structures.

I meant the information was stored in 2-d arrays (in-memory).

Thanks for the ideas.

~Scott

scottmf · Jun 1, 2005

Because of other file formats I also have to be able to parse and the
fact that I am using Windows XP (don't know of any sort programs that
come with windows), using one temp file for each ID value seems much
easier. In that case is there any way I can automatically generate the
filehandle from the ID value; i.e. given ID1, ID2, and ID3 can I
automatically do:
my $fh_ID1 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
my $fh_ID2 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
my $fh_ID3 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");

~Scott

xhoster · Jun 1, 2005

scottmf said:
Because of other file formats I also have to be able to parse and the
fact that I am using Windows XP (don't know of any sort programs that
come with windows), using one temp file for each ID value seems much
easier. In that case is there any way I can automatically generate the
filehandle from the ID value; i.e. given ID1, ID2, and ID3 can I
automatically do:
my $fh_ID1 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
my $fh_ID2 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
my $fh_ID3 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");

You should hold the file handles in an array or a hash.
I'd probably do something like this:

my %fh; ##holds hash (by ID) of filehandles

while (<INPUT_DATA>) {
##some stuff which sets $id and $to_print
unless (exists $fh{$id}) {
$fh{$id}=tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
};
print {$fh{$id}} $to_print;
};

Except probably I'd control the naming of the files myself, rather than
letting a module do it, because I would probably want to control the order
in which they are combined when I'm done.

Xho

Brian McCauley · Jun 1, 2005

scottmf said:
my $fh_ID1 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
my $fh_ID2 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
my $fh_ID3 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");

Always remember@ if you are doing it three times, you are probably going
it wrong.

Use a loop and put the filehandles in an array (or more likely a hash).

Brian McCauley · Jun 1, 2005

unless (exists $fh{$id}) {
$fh{$id}=tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
};

There is no need for exists() so this is more simply

$fh{$id} ||= tempfile($template, DIR => $tempdir, SUFFIX => ".dat");

scottmf · Jun 1, 2005

Thanks for the help, that does exactly what I needed.

As far as naming the files; since I want the final output to basically
be the contents of all of the temp files sorted by the ID value which I
can do using something similar to:

my ($key, $line);
open(OUTPUT, ">>", "sorted.dat");
foreach $key (sort keys %fh) {
while($line = {$fh{$key}}) {
print OUTPUT $line;
}
}

is there any reason to control the naming of the files (there will be
several hundred) myself?
~Scott

skye.shaw · Jun 1, 2005

Xp does have a sort program. Just holla at cha command line. As for the
problem.....

So all you want to do is eliminate reduntant data? The meat of the
problem is that this redundant data is scattered everywhere ( in the
file that is ), and you cant read that data all at once to sort it out
on smaller machines, correct?

Ok, well then, lets see here....... lets start with whats already built
for us, the sort program. Doing sort /? has this key chunk of info that
kind of already tackles your problem:

By default the
sort will be done with one pass (no
temporary
file) if it fits in the default maximum
memory size, otherwise the sort will be
done
in two passes (with the partially sorted
data
being stored in a temporary file) such
that
the amounts of memory used for both the
sort
and merge passes are equal. The default
maximum memory size is 90% of available
main
memory if both the input and output are
files, and 45% of main memory otherwise.

now its says 2 passes, although a large amount of data on a small mem
system might seem like it could take more than 2 passes. So well use
plan B later. But for now just as(s)ume it works and use sort to do the
"hard" work for you and you can use its its sorted output file

Then, in perl, create your master output file for the data your about
to parse. Read in all the "A" data, (ie the first set of inorder data)
and eliminate the doubles as useual. After this is done, write the
results to the master file. Then read the "B" data and do the same.

As for plan B, let me know how this goes, or if i missed anyting in the
scope of the problem, or if some other thugged out poster thinks this
idea sucks, then I will work on my carpel a little more. Also as a
side note, you can use the split program to chop up your files into
chunks. But for Windows you gotta get cygwin or colinux or somthing. Im
sure there is a Win32 version too maybe.

skye.shaw · Jun 1, 2005

Thanks for the help, that does exactly what I needed.
Oh well I guess mine got in a little late.

scottmf · Jun 1, 2005

It looks like this would work, although since the files I'm being asked
to parse keep getting larger (just got one that is 1.7 Gb !!!) I think
splitting it into many temp files will be more stable for future
versions. I always appreciate having more than one way to solve a
problem though, so thanks for the reply.

~Scott

skye.shaw · Jun 1, 2005

Oh yah, for what purposes are you using these files for? And why are
they getting so large? Are these files all text? I would drop the
DBomb idea on them. If they dont like that then tell them U have this
great new way to index large amounts of data in binary files for fast
retrieval. Then ask for a raise to 5 cents perl ASCII char per file.

xhoster · Jun 1, 2005

scottmf said:
Thanks for the help, that does exactly what I needed.

As far as naming the files; since I want the final output to basically
be the contents of all of the temp files sorted by the ID value which I
can do using something similar to:

my ($key, $line);

Don't declare them there, declare them in the smallest scope

open(OUTPUT, ">>", "sorted.dat");
foreach $key (sort keys %fh) {

foreach my $key (sort keys %fh) {

Does the tempfile routine you used return a handle that is open for
both reading and writing? If so, you probably still need to rewind
the file pointer before you start reader. Something like:

seek $fh{$key}, 0,0; ## test for failure? Does it work for windows?

while($line = {$fh{$key}}) {

You probably want angle rather than curly brackets there, but that
still won't work because angle brackets require simple scalar, not
a hash element.

while (my $line=readline($fh{$key})) {

print OUTPUT $line;
}
}

is there any reason to control the naming of the files

During the combine stage, I would just reopen the files for reading rather
than messing around with "seek" and making sure the originals were
read/write. As long you don't mind messing around with seek and
read/write, then there is no reason to control the naming. Well, maybe
one: Do your tempfiles disappear once their handles are closed? If so,
then what happens if your program bombs out during the last stage? All of
your computer's (potentially hours) of work in making those files would be
lost. If you named them by hand, it would be a simple matter to restart at
the combine stage. It is a trade off between recoverability and leaving a
mess behind.

(there will be
several hundred) myself?

With several hundred, you might run into problems with limits on the number
of open file handles.

Xho

scottmf · Jun 1, 2005

The files contain information from finite element analysis (all single
precision floats); basically all the forces each element is exposed to
for all different cases. Thousands of cases * thousands of elements =
Very large files. They can also be formatted two different ways, which
makes my job even more of a pain:

format 1:
case #1
element1 forcex forcey forcexy
element2 forcex forcey forcexy
case #2
element1 forcex forcey forcexy
element2 forcex forcey forcexy

format2:
case #1 x
element1 forcex
element2 forcex
case #1 y
element1 forcey
element2 forcey
case #1 xy
element1 forcexy
element2 forcexy
case #2 x
element1 forcex
element2 forcex
case #2 y
element1 forcey
element2 forcey
case #2 xy
element1 forcexy
element2 forcexy

Unfortunately there is no way to get them to change how the data is
saved.
Before I started writing my code they were entering the data into excel
by hand!!

scottmf · Jun 2, 2005

Is there an easy way to find out the max number of open filehandles? I
know that with perl v5.8 the Filecache package can be used to manage
the number of simultanious open filehandles, but I cannot get that for
several weeks. I tried the following, and if I change the max value of
$i 1 at a time I can tell when the creation fails, but if I just set
the max very high I get the following error:

use strict;
use Carp::Heavy;
use File::Temp qw(tempfile tempdir);
my $tempdir = tempdir(CLEANUP => 1);
my %fh;

for (my $i = 1; $i <=550; ++$i)
{
$fh{$i} = tempfile() or die "Could not create filehandle #$i\n";
};

returns:
Error in tempfile() using C:\DOCUME~1\user\LOCALS~1\Temp\XXXXXXXXXX:
Could not create temp file C:\DOCUME~1\user\LOCALS~1\Temp\QSksT8zrbh:
Too many open files at file_handles.pl line 12

rather than the statement following the or die, containing the max file
handles.

xhoster · Jun 2, 2005

scottmf said:
Is there an easy way to find out the max number of open filehandles? I
know that with perl v5.8 the Filecache package can be used to manage
the number of simultanious open filehandles, but I cannot get that for
several weeks.

You can do a fairly decent job yourself with something like:

unless (exists $fh{$id}) {
%fh=() if keys %fh >= $max_open_handles;
open $fh{$id}, ">>/tmp/foo/$id.dat" or die $!;
};

(Of course, it requires you to manage the naming yourself, so that
you can open for appending to the correct file.)

I tried the following, and if I change the max value of
$i 1 at a time I can tell when the creation fails, but if I just set
the max very high I get the following error:

use strict;
use Carp::Heavy;
use File::Temp qw(tempfile tempdir);
my $tempdir = tempdir(CLEANUP => 1);
my %fh;

for (my $i = 1; $i <=550; ++$i)
{
$fh{$i} = tempfile() or die "Could not create filehandle #$i\n";

eval { $fh{$i} = tempfile() }
or die "Could not create filehandle #$i\n";

};

Xho

Blue J Ciphertext Program	2	Nov 22, 2023
Trying to build a SARIMAX model to forecast the S&P500 trend	0	Nov 5, 2023
Creating a vector of queues	2	Jan 23, 2025
How to play corresponding sound?	2	Jun 10, 2023
Universal BMP Steganography Tool (AES-128-CTR + SP800-90A CSPRNG) Full Encoder/Decoder with 3LSB Payload, PasswordDerived Key & External Key File	4	Mar 26, 2026
Database Manager: A C++ Console Application	14	May 12, 2025
RSA implementation issues in public key pem loader function	0	May 21, 2025
Dont work, it´s something whit the loops?	1	Jun 30, 2021

Appending to the middle of a file

scottmf

xhoster

Mark Clements

scottmf

scottmf

xhoster

Brian McCauley

Brian McCauley

scottmf

skye.shaw

skye.shaw

scottmf

skye.shaw

xhoster

scottmf

scottmf

xhoster

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads