Binary file manipulation

Discussion in 'Perl Misc' started by Monty, Apr 11, 2006.

  1. Monty

    Monty Guest

    I have very large image files that I need to search for consecutive
    values of zero. the files are around 800 MB in szie and don't lend
    themselves to being loaded into memory for manipulation.

    I thought Tie::File might do the trick as it ties an array directly to
    a file, but it's written to expect some sort of end-of-line marker,
    whereas none of my data has that. Tie::File also won't let me set the
    end-of-line marker to empty or null, so I can't use that module.
    According to the Tie::File manpage, there doesn't seem to be a way of
    connecting an array to a file without these EOL markers, and I didn't
    see any options for binary files in the documentation.

    Can some one recommend a method for parsing through this much data,
    array style, that would let me compare values as though there were
    adjacent members of a two-dimensional array?

    Thanks
     
    Monty, Apr 11, 2006
    #1
    1. Advertising

  2. "Monty" <> wrote in
    news::

    > I have very large image files that I need to search for consecutive
    > values of zero. the files are around 800 MB in szie and don't lend
    > themselves to being loaded into memory for manipulation.


    There is no need to load the whole file into memory.

    Use sysread to read in chunks, then find consecutive zero bytes.

    perldoc -f sysread

    If you make an attempt, we will be able to help you better.

    Please do read the posting guidelines for this group.

    Sinan
    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Apr 11, 2006
    #2
    1. Advertising

  3. Monty

    Guest

    "Monty" <> wrote:
    > I have very large image files that I need to search for consecutive
    > values of zero. the files are around 800 MB in szie and don't lend
    > themselves to being loaded into memory for manipulation.


    Why don't they lend themselves to that? Because you don't 800MB of memory
    (plus overhead) to spare, or for some other reason?

    > I thought Tie::File might do the trick as it ties an array directly to
    > a file, but it's written to expect some sort of end-of-line marker,
    > whereas none of my data has that.


    Yep. That is inherently what Tie::File does. The whole module is centered
    around line-oriented, variable line length files.

    > Tie::File also won't let me set the
    > end-of-line marker to empty or null, so I can't use that module.
    > According to the Tie::File manpage, there doesn't seem to be a way of
    > connecting an array to a file without these EOL markers, and I didn't
    > see any options for binary files in the documentation.


    Tie::File is not the only tying module. I don't know of a tying module,
    off the top of my head, that would serve your purposes, but if you look
    under the Tie::* hierarchy on CPAN you might get something. But it seems
    so easy to implement what you want with seek and read, that I wouldn't
    spend much time searching around for a ready-made module.

    But really, this problem seems to just be begging for C, rather than
    Perl.

    >
    > Can some one recommend a method for parsing through this much data,
    > array style, that would let me compare values as though there were
    > adjacent members of a two-dimensional array?


    What does "adjacent" mean to you in a two-dimensional array?

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Apr 11, 2006
    #3
  4. Monty

    Guest

    Monty wrote:
    >
    > I have very large image files that I need to search for consecutive
    > values of zero. the files are around 800 MB in szie and don't lend
    > themselves to being loaded into memory for manipulation.


    If your binary file is just a list of fixed-length records, you can
    loop through the records one at a time (provided that you know the
    length of the records) by setting the $/ variable, like this:

    open(IN, "<file.binary") or die $!;
    binmode(IN);
    $/ = \1020; # each record is 1020 bytes long
    # Loop through the file, 1020 bytes at a time:
    while (<IN>)
    {
    # The binary-record is now in $_
    }
    close(IN);

    If, by chance, you know the pack-string that corresponds to these
    records (assuming they are fixed-length records), you can easily view
    the data inside the file, like this:

    open(IN, "<file.binary") or die $!;
    binmode(IN);
    # Set the $packString for each record:
    my $packString = "i4 d2 Z12 Z256";
    # Use the $packString to find the length of each record:
    $/ = \(length(pack($packString)));
    # Loop through the file, one record at a time:
    while (<IN>)
    {
    # The binary-record is now in $_
    my @values = unpack($packString, $_);
    print "*** Record found:\n @values\n";
    }
    close(IN);

    A few notes to keep in mind:

    1. Since you are dealing with binary data, you really want to call
    binmode() on your filehandle. Not doing so prevents your code from
    being portable and may create some hard-to-find bugs.

    2. Since this method uses the pack() and unpack() functions, you
    really want to turn on warnings and strictures (with "use warnings;"
    and "use strict;" near the top of your file). Not doing so will make
    simple bugs extremely difficult to find. If you use them, they will
    often point out the exact line number where an error occurs right away
    (eliminating the need for you to hunt down the error's exact spot).

    3. If you are not familiar with pack(), unpack(), and how to compose
    pack strings, I encourage you to read the perldocs by typing "perldoc
    pack" at any prompt.

    4. If you are not familiar with the $/ variable, look it up in
    "perldoc perlvar".


    > Can some one recommend a method for parsing through this much data,
    > array style, that would let me compare values as though there were
    > adjacent members of a two-dimensional array?


    I'm not quite sure what you mean here, but if your files contain
    fixed length records, you would probably benefit by using the $/
    variable. And if you know the data-types of the fixed-length records,
    the pack() and unpack() functions are extremely useful.

    I hope this helps, Monty.

    -- Jean-Luc Romano
     
    , Apr 11, 2006
    #4
  5. Monty wrote:
    > I have very large image files that I need to search for consecutive
    > values of zero. the files are around 800 MB in szie and don't lend
    > themselves to being loaded into memory for manipulation.


    Use the Mmap module http://search.cpan.org/~micb/Mmap-a2/ or the Sys::Mmap
    module http://search.cpan.org/~swalters/Sys-Mmap-0.13/ to access the file as a
    scalar and use index() to search for the consecutive values of zero.


    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Apr 12, 2006
    #5
  6. Monty

    Monty Guest

    Frist off, thanks to the respondents. Secondly, I may have posted
    erroneously here or some how created a breech of etiquette for which I
    apologize. I've read what I could find on guidelines for posting and
    thought I was within parameters.

    I left out a few things maybe I should have mentioned: the file has no
    individual records--it's one, huge, 800 MB file of image data that gets
    displayed in a rectangular format. As bytes get read in they're used
    to populate an image of predetermined size in both width and height.

    to A. Sinan Anur: I contemplated using sysread, but am not experienced
    with programming enough that I think I may miss adjacent bytes of data
    across different chunks of the file. For instance, if I find a data
    hole (value 0) in a byte at the end of a chunk, I would need to somehow
    remember that so that in the ensuing chunk I can look for another hole
    that would coincide with being just 'below' the value I found in the
    previous chunk. While this isn't completely out of the question, I
    though there might be another way. Also, I'll check the guidelines you
    listed.

    To Xho: Our system has 32 GB RAM, most of which often goes unused. I
    haven't quite figured out how to up the inidividual user memory
    allocation limit, but it still seems like there should be a better way,
    one that's not dependent on having a slew of memory to toss around.
    Also, 'adjacent' in a two dimensional array would be those bytes that
    are next to each other in the same row, or directly above or below each
    other in the same column.

    To: Jean-Luc Romano: thanks, but there are no individual records in
    this file.

    To John: I'll check those links out. They sound promising.

    Thanks again all!
     
    Monty, Apr 12, 2006
    #6
  7. Monty

    Dr.Ruud Guest

    Monty schreef:

    > one, huge, 800 MB file of image data that
    > gets displayed in a rectangular format. As bytes get read in they're
    > used to populate an image of predetermined size in both width and
    > height. [...]
    > 'adjacent' in a two dimensional array would be those
    > bytes that are next to each other in the same row, or directly above
    > or below each other in the same column.


    You are (almost) implying there that there is 1 byte per pixel (or per
    pixel.color or pixel.channel or pixel.layer, etc.).

    Have you checked GD yet?
    http://search.cpan.org/~lds/GD/

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Apr 12, 2006
    #7
  8. Monty

    Guest

    "Monty" <> wrote:
    > Frist off, thanks to the respondents. Secondly, I may have posted
    > erroneously here or some how created a breech of etiquette for which I
    > apologize. I've read what I could find on guidelines for posting and
    > thought I was within parameters.
    >
    > I left out a few things maybe I should have mentioned: the file has no
    > individual records--it's one, huge, 800 MB file of image data that gets
    > displayed in a rectangular format. As bytes get read in they're used
    > to populate an image of predetermined size in both width and height.


    There is actually a hierarchy of records. You have one record type,
    containing some fixed number of bytes, which represents a row (or is it a
    column) in the image. Within that, you have another record type,
    presumably one byte (or is it one bit? Or something else?) representing a
    pixel within that row in the image.


    >
    > to A. Sinan Anur: I contemplated using sysread, but am not experienced
    > with programming enough that I think I may miss adjacent bytes of data
    > across different chunks of the file.


    I would use read rather than sysread and make each chunk exactly equal
    to one row of the image (or just set $/ to a reference to the
    number of bytes in a row, then use <$fh>).

    > For instance, if I find a data
    > hole (value 0) in a byte at the end of a chunk, I would need to somehow
    > remember that so that in the ensuing chunk I can look for another hole
    > that would coincide with being just 'below' the value I found in the
    > previous chunk.


    If the next chunk was the start of the next image-row, then you wouldn't
    need to worry about it. (assuming your space is flat, like a chessboard.
    If it is really a torus represented as a flat image, like a pac-man screen,
    then that is different.)


    > While this isn't completely out of the question, I
    > though there might be another way. Also, I'll check the guidelines you
    > listed.
    >
    > To Xho: Our system has 32 GB RAM, most of which often goes unused. I
    > haven't quite figured out how to up the inidividual user memory
    > allocation limit, but it still seems like there should be a better way,
    > one that's not dependent on having a slew of memory to toss around.
    > Also, 'adjacent' in a two dimensional array would be those bytes that
    > are next to each other in the same row, or directly above or below each
    > other in the same column.


    So at any one time you only need two rows worth of bytes in memory.

    binmode $fh;
    $/=\$how_ever_many_bytes_in_a_row;

    my $old_row;
    while (<$fh>) {
    find_adjacent_in_row($_);
    find_adjacent_between_rows($old_row,$_) if defined $old_row;
    $old_row=$_;
    };

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Apr 12, 2006
    #8
  9. Monty

    Monty Guest

    I see what you're saying, but the number of adjacent rows containing a
    zero value in a particular column could be very large (minimum of 10
    vertically or horizontally adjacent bytes is considered a hole, and
    often becomes 100 or more).

    Secondly, are you saying (with $/=\$how_ever_many_bytes_in_a_row) that
    the end-of-line delimiter can be set to a paticular number of bytes
    instead of an actual end-of-line value?
     
    Monty, Apr 12, 2006
    #9
  10. Monty

    Guest Guest

    Monty <> wrote:
    : I see what you're saying, but the number of adjacent rows containing a
    : zero value in a particular column could be very large (minimum of 10
    : vertically or horizontally adjacent bytes is considered a hole, and
    : often becomes 100 or more).

    100 lines is still easy to handle, you can push them into a "gliding stack"
    (FIFO, first in, first out) realized via an array. The array combines well
    with the map function (perldoc -f map), you can perform checks of virtually
    arbitrary complexity on any desired number of rows (simply by indicating
    the size of the list passed to map).

    : Secondly, are you saying (with $/=\$how_ever_many_bytes_in_a_row) that
    : the end-of-line delimiter can be set to a paticular number of bytes
    : instead of an actual end-of-line value?

    Yes, Perl allows for reading fixed-length "records" without any visible
    eol character. Check read and sysread. If you have any knowledge of your
    data _before_ you run your program, you can hard-code the record length
    into your program, but you can also set the record length dynamically,
    e.g. by reading specific bytes from your file.

    Oliver.

    --
    Dr. Oliver Corff e-mail: -berlin.de
     
    Guest, Apr 12, 2006
    #10
  11. Monty

    Anno Siegel Guest

    <> wrote in comp.lang.perl.misc:
    > "Monty" <> wrote:


    [finding adjacent zeros in a matrix of bytes]

    > So at any one time you only need two rows worth of bytes in memory.
    >
    > binmode $fh;
    > $/=\$how_ever_many_bytes_in_a_row;
    >
    > my $old_row;
    > while (<$fh>) {
    > find_adjacent_in_row($_);
    > find_adjacent_between_rows($old_row,$_) if defined $old_row;
    > $old_row=$_;
    > };


    If the majority of bytes are non-zero (as the term "hole" might suggest)
    one could simply record their positions.

    my %by_lines;
    while ( <$fh> ) {
    next unless /\0/; # don't record lines without zeros
    push @{ $by_lines{ $.}, $-[ 0] while /\0/g;
    }

    Finding chains of adjacent zeros in a line means searching a (sorted)
    list of integers for runs of consecutive integers. That's not hard
    to do.

    To do the same thing for columns, "invert" the %by_lines hash

    my %by_columns;
    for my $li ( sort { $a <=> $b } keys %by_lines ) {
    push @{ $by_columns{ $_} }, $li for @{ $by_columns{ $li} };
    }

    Then apply the same procedure to find adjacent zeros in each column.

    Other patterns of zeros could be detected, but may involve using both
    tables at once.

    Anno
    --
    If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers.
     
    Anno Siegel, Apr 12, 2006
    #11
  12. Monty

    Anno Siegel Guest

    <> wrote in comp.lang.perl.misc:
    > "Monty" <> wrote:


    [finding adjacent zeros in a matrix of bytes]

    > So at any one time you only need two rows worth of bytes in memory.
    >
    > binmode $fh;
    > $/=\$how_ever_many_bytes_in_a_row;
    >
    > my $old_row;
    > while (<$fh>) {
    > find_adjacent_in_row($_);
    > find_adjacent_between_rows($old_row,$_) if defined $old_row;
    > $old_row=$_;
    > };


    If the majority of bytes are non-zero (as the term "hole" might suggest)
    one could simply record their positions.

    my %by_lines;
    while ( <$fh> ) {
    next unless /\0/; # don't record lines without zeros
    push @{ $by_lines{ $.}, $-[ 0] while /\0/g;
    }

    Finding chains of adjacent zeros in a line means searching a (sorted)
    list of integers for runs of consecutive integers. That's not hard
    to do.

    To do the same thing for columns, "invert" the %by_lines hash

    my %by_columns;
    for my $li ( sort { $a <=> $b } keys %by_lines ) {
    push @{ $by_columns{ $_} }, $li for @{ $by_columns{ $li} };
    }

    Then apply the same procedure to find adjacent zeros in each column.

    Other patterns of zeros could be detected, but may involve using both
    tables at once. (Code untested)

    Anno
    --
    If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers.

    --
    If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers.
     
    Anno Siegel, Apr 12, 2006
    #12
  13. Monty

    Guest

    "Monty" <> wrote:
    > I see what you're saying,


    But I don't :)

    Please quote some of the text you are replying to. That way I can more
    easily see which part of what I said you are responding to.

    > but the number of adjacent rows containing a
    > zero value in a particular column could be very large (minimum of 10
    > vertically or horizontally adjacent bytes is considered a hole, and
    > often becomes 100 or more).


    Ah, this is different. I thought you meant a pair of adjacent things,
    not a whole run of them. What if it is a 9 by 9 square of zero values? no
    direction is a minimum of 10, yet overall there are 81 missing pixels.

    What other wrinkles are there that you haven't described yet? Once you
    find these runs or streaks (or blotches or squares or whatever), what are
    you going to do with them?

    >
    > Secondly, are you saying (with $/=\$how_ever_many_bytes_in_a_row) that
    > the end-of-line delimiter can be set to a paticular number of bytes
    > instead of an actual end-of-line value?


    Yes

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Apr 12, 2006
    #13
  14. Monty

    Monty Guest

    I'm still researching the perl 'map' function...are you referring to
    Sys::Mmap?
     
    Monty, Apr 13, 2006
    #14
  15. Monty

    Monty Guest

    >> I see what you're saying,

    >But I don't :)


    I meant in general, I got the drift of your advice.

    >I thought you meant a pair of adjacent things,

    not a whole run of them.

    It starts with adjacent members :)

    >What if it is a 9 by 9 square of zero values?


    That may come later. For now, to find a horizontal or vertical run of
    zeroes will be a good start, and we've yet to decide what to do with
    them.
     
    Monty, Apr 13, 2006
    #15
  16. Monty

    Monty Guest

    To all:

    Please disregard my previous post. I'm still learning how these
    newsgroups and their protocols work.

    Many thanks for the good advice. Let's end this thread before I spend
    more time maintaining it instead of implementing these suggestions.
     
    Monty, Apr 13, 2006
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Matthew A. Berglund

    Binary number manipulation

    Matthew A. Berglund, Nov 28, 2003, in forum: Python
    Replies:
    1
    Views:
    376
    Irmen de Jong
    Nov 28, 2003
  2. Replies:
    5
    Views:
    332
    Tim Roberts
    Dec 7, 2003
  3. nguser3552
    Replies:
    3
    Views:
    445
  4. Ron Eggler

    writing binary file (ios::binary)

    Ron Eggler, Apr 25, 2008, in forum: C++
    Replies:
    9
    Views:
    937
    James Kanze
    Apr 28, 2008
  5. Chris

    Manipulation binary files

    Chris, Oct 19, 2003, in forum: Perl Misc
    Replies:
    3
    Views:
    158
    Anno Siegel
    Oct 23, 2003
Loading...

Share This Page