Appending to the middle of a file

Discussion in 'Perl Misc' started by scottmf, Jun 1, 2005.

  1. scottmf

    scottmf Guest

    I am parsing very large data files (4 million lines or more) to reorder
    the data and eliminate unnecessary information. Unfortunately because
    of how the file is arranged I have to read the entire file before
    processing the data. Currently everything is written to 2-d arrays and
    takes about 3Gb of memory to process. I would like to start using a
    temp file so that machines with less memory can still complete the
    process, but in order to do so I need to be able to append data to the
    middle of the file.

    eg:
    starting with data file:
    Line: order read:
    load1 1
    ID1 a b c 2
    ID2 d e f 3
    ID3 g h i 4
    load2 5
    ID1 j k l 6
    ID2 m n o 7
    ID3 p q r 8

    temp file becomes:
    Line: order wrote:
    ID1 1
    load1 a b c 2
    load2 j k l 7
    ID2 3
    load1 d e f 4
    load2 m n o 8
    ID3 5
    load1 g h i 6
    load2 p q r 9


    any suggestions are much appreciated.
     
    scottmf, Jun 1, 2005
    #1
    1. Advertising

  2. scottmf

    Guest

    "scottmf" <> wrote:
    > I am parsing very large data files (4 million lines or more) to reorder
    > the data and eliminate unnecessary information. Unfortunately because
    > of how the file is arranged I have to read the entire file before
    > processing the data. Currently everything is written to 2-d arrays and
    > takes about 3Gb of memory to process.


    I have no idea what this means. "written to" implies disk, while
    "2-d arrays" suggests perl's in-memory data structures.

    >I would like to start using a
    > temp file so that machines with less memory can still complete the
    > process, but in order to do so I need to be able to append data to the
    > middle of the file.


    Appending is, by definition, done at the end of the file, not the middle.
    There are ways to insert into the middle of a large file (see Tie::File),
    but they all are either hideously inefficient or hideously complicated, if
    not both.


    > eg:
    > starting with data file:
    > Line: order read:
    > load1 1
    > ID1 a b c 2
    > ID2 d e f 3
    > ID3 g h i 4
    > load2 5
    > ID1 j k l 6
    > ID2 m n o 7
    > ID3 p q r 8
    >
    > temp file becomes:
    > Line: order wrote:
    > ID1 1
    > load1 a b c 2
    > load2 j k l 7
    > ID2 3
    > load1 d e f 4
    > load2 m n o 8
    > ID3 5
    > load1 g h i 6
    > load2 p q r 9
    >
    > any suggestions are much appreciated.


    If you are doing what I think you are doing, then I would suggest
    a perl script to convert the input file to something like:

    ID1 load1 a b c
    ID2 load1 d e f
    ID3 load1 g h i
    ID1 load2 j k l
    ID2 load2 m n o
    ID3 load2 p q r

    And then using your OS's sort program to sort by ID so that equal IDs are
    grouped together, and then another perl program to process that file into
    what you want.

    Alternatively, you could make one temp file for each different ID value,
    and then combine all of these temp files together at the end.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Jun 1, 2005
    #2
    1. Advertising

  3. scottmf wrote:

    > I am parsing very large data files (4 million lines or more) to reorder
    > the data and eliminate unnecessary information. Unfortunately because
    > of how the file is arranged I have to read the entire file before
    > processing the data. Currently everything is written to 2-d arrays and
    > takes about 3Gb of memory to process. I would like to start using a
    > temp file so that machines with less memory can still complete the
    > process, but in order to do so I need to be able to append data to the
    > middle of the file.
    >
    > eg:
    > starting with data file:
    > Line: order read:
    > load1 1
    > ID1 a b c 2
    > ID2 d e f 3
    > ID3 g h i 4
    > load2 5
    > ID1 j k l 6
    > ID2 m n o 7
    > ID3 p q r 8
    >
    > temp file becomes:
    > Line: order wrote:
    > ID1 1
    > load1 a b c 2
    > load2 j k l 7
    > ID2 3
    > load1 d e f 4
    > load2 m n o 8
    > ID3 5
    > load1 g h i 6
    > load2 p q r 9
    >
    >
    > any suggestions are much appreciated.


    Xho has already answered this, but I would be tempted to do would be to
    throw the data into an RDBMS of some description and let that do the hard
    work, though this complicates matters and requires access to eg MySQL
    and knowledge of SQL. Process line-by-line in Perl to get the data into the
    db, then pull it out again as required. Drop indexes before insertion and
    rebuild afterwards.

    Mark
     
    Mark Clements, Jun 1, 2005
    #3
  4. scottmf

    scottmf Guest

    >have no idea what this means. "written to" implies disk, while
    >"2-d arrays" suggests perl's in-memory data structures.


    I meant the information was stored in 2-d arrays (in-memory).

    Thanks for the ideas.

    ~Scott
     
    scottmf, Jun 1, 2005
    #4
  5. scottmf

    scottmf Guest

    Because of other file formats I also have to be able to parse and the
    fact that I am using Windows XP (don't know of any sort programs that
    come with windows), using one temp file for each ID value seems much
    easier. In that case is there any way I can automatically generate the
    filehandle from the ID value; i.e. given ID1, ID2, and ID3 can I
    automatically do:
    my $fh_ID1 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
    my $fh_ID2 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
    my $fh_ID3 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");

    ~Scott
     
    scottmf, Jun 1, 2005
    #5
  6. scottmf

    Guest

    "scottmf" <> wrote:
    > Because of other file formats I also have to be able to parse and the
    > fact that I am using Windows XP (don't know of any sort programs that
    > come with windows), using one temp file for each ID value seems much
    > easier. In that case is there any way I can automatically generate the
    > filehandle from the ID value; i.e. given ID1, ID2, and ID3 can I
    > automatically do:
    > my $fh_ID1 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
    > my $fh_ID2 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
    > my $fh_ID3 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");


    You should hold the file handles in an array or a hash.
    I'd probably do something like this:

    my %fh; ##holds hash (by ID) of filehandles

    while (<INPUT_DATA>) {
    ##some stuff which sets $id and $to_print
    unless (exists $fh{$id}) {
    $fh{$id}=tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
    };
    print {$fh{$id}} $to_print;
    };

    Except probably I'd control the naming of the files myself, rather than
    letting a module do it, because I would probably want to control the order
    in which they are combined when I'm done.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Jun 1, 2005
    #6
  7. scottmf wrote:

    > my $fh_ID1 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
    > my $fh_ID2 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
    > my $fh_ID3 = tempfile($template, DIR => $tempdir, SUFFIX => ".dat");


    Always remember@ if you are doing it three times, you are probably going
    it wrong.

    Use a loop and put the filehandles in an array (or more likely a hash).
     
    Brian McCauley, Jun 1, 2005
    #7
  8. wrote:

    > unless (exists $fh{$id}) {
    > $fh{$id}=tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
    > };


    There is no need for exists() so this is more simply

    $fh{$id} ||= tempfile($template, DIR => $tempdir, SUFFIX => ".dat");
     
    Brian McCauley, Jun 1, 2005
    #8
  9. scottmf

    scottmf Guest

    Thanks for the help, that does exactly what I needed.

    As far as naming the files; since I want the final output to basically
    be the contents of all of the temp files sorted by the ID value which I
    can do using something similar to:

    my ($key, $line);
    open(OUTPUT, ">>", "sorted.dat");
    foreach $key (sort keys %fh) {
    while($line = {$fh{$key}}) {
    print OUTPUT $line;
    }
    }

    is there any reason to control the naming of the files (there will be
    several hundred) myself?
    ~Scott
     
    scottmf, Jun 1, 2005
    #9
  10. scottmf

    Guest

    Xp does have a sort program. Just holla at cha command line. As for the
    problem.....

    So all you want to do is eliminate reduntant data? The meat of the
    problem is that this redundant data is scattered everywhere ( in the
    file that is ), and you cant read that data all at once to sort it out
    on smaller machines, correct?

    Ok, well then, lets see here....... lets start with whats already built
    for us, the sort program. Doing sort /? has this key chunk of info that
    kind of already tackles your problem:

    By default the
    sort will be done with one pass (no
    temporary
    file) if it fits in the default maximum
    memory size, otherwise the sort will be
    done
    in two passes (with the partially sorted
    data
    being stored in a temporary file) such
    that
    the amounts of memory used for both the
    sort
    and merge passes are equal. The default
    maximum memory size is 90% of available
    main
    memory if both the input and output are
    files, and 45% of main memory otherwise.


    now its says 2 passes, although a large amount of data on a small mem
    system might seem like it could take more than 2 passes. So well use
    plan B later. But for now just as(s)ume it works and use sort to do the
    "hard" work for you and you can use its its sorted output file

    Then, in perl, create your master output file for the data your about
    to parse. Read in all the "A" data, (ie the first set of inorder data)
    and eliminate the doubles as useual. After this is done, write the
    results to the master file. Then read the "B" data and do the same.

    As for plan B, let me know how this goes, or if i missed anyting in the
    scope of the problem, or if some other thugged out poster thinks this
    idea sucks, then I will work on my carpel a little more. Also as a
    side note, you can use the split program to chop up your files into
    chunks. But for Windows you gotta get cygwin or colinux or somthing. Im
    sure there is a Win32 version too maybe.
     
    , Jun 1, 2005
    #10
  11. scottmf

    Guest

    >Thanks for the help, that does exactly what I needed.
    Oh well I guess mine got in a little late.
     
    , Jun 1, 2005
    #11
  12. scottmf

    scottmf Guest

    It looks like this would work, although since the files I'm being asked
    to parse keep getting larger (just got one that is 1.7 Gb !!!) I think
    splitting it into many temp files will be more stable for future
    versions. I always appreciate having more than one way to solve a
    problem though, so thanks for the reply.

    ~Scott
     
    scottmf, Jun 1, 2005
    #12
  13. scottmf

    Guest

    Oh yah, for what purposes are you using these files for? And why are
    they getting so large? Are these files all text? I would drop the
    DBomb idea on them. If they dont like that then tell them U have this
    great new way to index large amounts of data in binary files for fast
    retrieval. Then ask for a raise to 5 cents perl ASCII char per file.
     
    , Jun 1, 2005
    #13
  14. scottmf

    Guest

    "scottmf" <> wrote:
    > Thanks for the help, that does exactly what I needed.
    >
    > As far as naming the files; since I want the final output to basically
    > be the contents of all of the temp files sorted by the ID value which I
    > can do using something similar to:
    >
    > my ($key, $line);


    Don't declare them there, declare them in the smallest scope

    > open(OUTPUT, ">>", "sorted.dat");
    > foreach $key (sort keys %fh) {

    foreach my $key (sort keys %fh) {

    Does the tempfile routine you used return a handle that is open for
    both reading and writing? If so, you probably still need to rewind
    the file pointer before you start reader. Something like:

    seek $fh{$key}, 0,0; ## test for failure? Does it work for windows?


    > while($line = {$fh{$key}}) {


    You probably want angle rather than curly brackets there, but that
    still won't work because angle brackets require simple scalar, not
    a hash element.

    while (my $line=readline($fh{$key})) {

    > print OUTPUT $line;
    > }
    > }
    >
    > is there any reason to control the naming of the files


    During the combine stage, I would just reopen the files for reading rather
    than messing around with "seek" and making sure the originals were
    read/write. As long you don't mind messing around with seek and
    read/write, then there is no reason to control the naming. Well, maybe
    one: Do your tempfiles disappear once their handles are closed? If so,
    then what happens if your program bombs out during the last stage? All of
    your computer's (potentially hours) of work in making those files would be
    lost. If you named them by hand, it would be a simple matter to restart at
    the combine stage. It is a trade off between recoverability and leaving a
    mess behind.

    > (there will be
    > several hundred) myself?


    With several hundred, you might run into problems with limits on the number
    of open file handles.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Jun 1, 2005
    #14
  15. scottmf

    scottmf Guest

    The files contain information from finite element analysis (all single
    precision floats); basically all the forces each element is exposed to
    for all different cases. Thousands of cases * thousands of elements =
    Very large files. They can also be formatted two different ways, which
    makes my job even more of a pain:

    format 1:
    case #1
    element1 forcex forcey forcexy
    element2 forcex forcey forcexy
    case #2
    element1 forcex forcey forcexy
    element2 forcex forcey forcexy

    format2:
    case #1 x
    element1 forcex
    element2 forcex
    case #1 y
    element1 forcey
    element2 forcey
    case #1 xy
    element1 forcexy
    element2 forcexy
    case #2 x
    element1 forcex
    element2 forcex
    case #2 y
    element1 forcey
    element2 forcey
    case #2 xy
    element1 forcexy
    element2 forcexy

    Unfortunately there is no way to get them to change how the data is
    saved.
    Before I started writing my code they were entering the data into excel
    by hand!!
     
    scottmf, Jun 1, 2005
    #15
  16. scottmf

    scottmf Guest

    Is there an easy way to find out the max number of open filehandles? I
    know that with perl v5.8 the Filecache package can be used to manage
    the number of simultanious open filehandles, but I cannot get that for
    several weeks. I tried the following, and if I change the max value of
    $i 1 at a time I can tell when the creation fails, but if I just set
    the max very high I get the following error:

    use strict;
    use Carp::Heavy;
    use File::Temp qw(tempfile tempdir);
    my $tempdir = tempdir(CLEANUP => 1);
    my %fh;

    for (my $i = 1; $i <=550; ++$i)
    {
    $fh{$i} = tempfile() or die "Could not create filehandle #$i\n";
    };

    returns:
    Error in tempfile() using C:\DOCUME~1\user\LOCALS~1\Temp\XXXXXXXXXX:
    Could not create temp file C:\DOCUME~1\user\LOCALS~1\Temp\QSksT8zrbh:
    Too many open files at file_handles.pl line 12

    rather than the statement following the or die, containing the max file
    handles.
     
    scottmf, Jun 2, 2005
    #16
  17. scottmf

    Guest

    "scottmf" <> wrote:
    > Is there an easy way to find out the max number of open filehandles? I
    > know that with perl v5.8 the Filecache package can be used to manage
    > the number of simultanious open filehandles, but I cannot get that for
    > several weeks.


    You can do a fairly decent job yourself with something like:

    unless (exists $fh{$id}) {
    %fh=() if keys %fh >= $max_open_handles;
    open $fh{$id}, ">>/tmp/foo/$id.dat" or die $!;
    };

    (Of course, it requires you to manage the naming yourself, so that
    you can open for appending to the correct file.)



    > I tried the following, and if I change the max value of
    > $i 1 at a time I can tell when the creation fails, but if I just set
    > the max very high I get the following error:
    >
    > use strict;
    > use Carp::Heavy;
    > use File::Temp qw(tempfile tempdir);
    > my $tempdir = tempdir(CLEANUP => 1);
    > my %fh;
    >
    > for (my $i = 1; $i <=550; ++$i)
    > {
    > $fh{$i} = tempfile() or die "Could not create filehandle #$i\n";


    eval { $fh{$i} = tempfile() }
    or die "Could not create filehandle #$i\n";


    > };


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Jun 2, 2005
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. scopp
    Replies:
    3
    Views:
    449
    Guest
    Jan 22, 2004
  2. Marcelo
    Replies:
    5
    Views:
    10,554
    Gordon Beaton
    Oct 25, 2005
  3. Roger Pack
    Replies:
    2
    Views:
    113
    Roger Pack
    Jun 26, 2008
  4. A. Farber
    Replies:
    3
    Views:
    300
    Ben Morrow
    Mar 3, 2004
  5. Dan Jacobson

    binmode must be in the middle of the file

    Dan Jacobson, Feb 21, 2006, in forum: Perl Misc
    Replies:
    0
    Views:
    86
    Dan Jacobson
    Feb 21, 2006
Loading...

Share This Page