search for hex characters in a binary file and remove them

Discussion in 'Perl Misc' started by venkateshwar D, Aug 18, 2009.

  1. Hi All,

    I need to look for a sequence of hex characters in a binary file and
    remove them. the binary file has 00 00 02 02 01 00 sequence somewhere
    in the file.
    The script should open the file and look for this sequence 00 00 02 02
    01 00 <18 variable bytes> and remove the 18 + 6 = 24 bytes from the
    file.can someone please help. I can open the binary file and buffer
    byte by byte but since the pattern can be anywhere in the file i dont
    know how to proceed

    regards
    venkat
    venkateshwar D, Aug 18, 2009
    #1
    1. Advertising

  2. venkateshwar D

    Guest

    On Tue, 18 Aug 2009 07:14:34 -0700 (PDT), venkateshwar D <> wrote:

    >Hi All,
    >
    >I need to look for a sequence of hex characters in a binary file and
    >remove them. the binary file has 00 00 02 02 01 00 sequence somewhere
    >in the file.
    >The script should open the file and look for this sequence 00 00 02 02
    >01 00 <18 variable bytes> and remove the 18 + 6 = 24 bytes from the
    >file.can someone please help. I can open the binary file and buffer
    >byte by byte but since the pattern can be anywhere in the file i dont
    >know how to proceed
    >
    >regards
    >venkat


    Hex characters? Like [a-f0-9] ? Or integers?

    $sequence = " 00 00 02 02 01 00";

    open my $fin, '<:raw', 'filename.in' or die "can't open input file: $!";
    open my $fout, '>:raw', 'filename.out' or die "can't open output file: $!";
    {
    local $/;
    $buf = <$fin>;
    $buf =~ s/$sequence//;
    print $fout $buf;
    }
    close $fout;
    close $fin;

    -sln
    , Aug 18, 2009
    #2
    1. Advertising

  3. venkateshwar D

    Guest

    On Tue, 18 Aug 2009 07:36:32 -0700, wrote:

    >On Tue, 18 Aug 2009 07:14:34 -0700 (PDT), venkateshwar D <> wrote:
    >
    >>Hi All,
    >>
    >>I need to look for a sequence of hex characters in a binary file and
    >>remove them. the binary file has 00 00 02 02 01 00 sequence somewhere
    >>in the file.
    >>The script should open the file and look for this sequence 00 00 02 02
    >>01 00 <18 variable bytes> and remove the 18 + 6 = 24 bytes from the
    >>file.can someone please help. I can open the binary file and buffer
    >>byte by byte but since the pattern can be anywhere in the file i dont
    >>know how to proceed
    >>
    >>regards
    >>venkat

    >
    >Hex characters? Like [a-f0-9] ? Or integers?
    >
    >$sequence = " 00 00 02 02 01 00";
    >
    >open my $fin, '<:raw', 'filename.in' or die "can't open input file: $!";
    >open my $fout, '>:raw', 'filename.out' or die "can't open output file: $!";
    >{
    > local $/;
    > $buf = <$fin>;
    > $buf =~ s/$sequence//;

    ^^
    $buf =~ s/$sequence.{6}//gs;

    The 6 bytes after the sequence as well?
    -sln
    , Aug 18, 2009
    #3
  4. On Aug 18, 7:42 pm, wrote:
    > On Tue, 18 Aug 2009 07:36:32 -0700, wrote:
    > >On Tue, 18 Aug 2009 07:14:34 -0700 (PDT), venkateshwar D <> wrote:

    >
    > >>Hi All,

    >
    > >>I need to look for a sequence of hex characters in a binary file and
    > >>remove them. the binary file has 00 00 02 02 01 00 sequence somewhere
    > >>in the file.
    > >>The script should open the file and look for this sequence 00 00 02 02
    > >>01 00 <18 variable bytes> and remove the 18 + 6 = 24 bytes from the
    > >>file.can someone please help. I can open the binary file and buffer
    > >>byte by byte but since the pattern can be anywhere in the file i dont
    > >>know how to proceed

    >
    > >>regards
    > >>venkat

    >
    > >Hex characters? Like [a-f0-9] ? Or integers?

    >
    > >$sequence = " 00 00 02 02 01 00";

    >
    > >open my $fin, '<:raw', 'filename.in' or die "can't open input file: $!";
    > >open my $fout, '>:raw', 'filename.out' or die "can't open output file: $!";
    > >{
    > >   local $/;
    > >   $buf = <$fin>;
    > >   $buf =~ s/$sequence//;

    >
    >                   ^^
    >    $buf =~ s/$sequence.{6}//gs;
    >
    > The 6 bytes after the sequence as well?
    > -sln- Hide quoted text -
    >
    > - Show quoted text -


    Hi

    Thanks a lot. This does not seem to be working. it is doing a binary
    file copy.

    I want to search for that pattern in the binary file (000002020100)
    (it is hex character file) and remove this pattern + the next 18 bytes
    in the file.
    thanks
    venkat
    venkateshwar D, Aug 18, 2009
    #4
  5. venkateshwar D

    Guest

    On Tue, 18 Aug 2009 08:34:49 -0700 (PDT), venkateshwar D <> wrote:

    >On Aug 18, 7:42 pm, wrote:
    >> On Tue, 18 Aug 2009 07:36:32 -0700, wrote:
    >> >On Tue, 18 Aug 2009 07:14:34 -0700 (PDT), venkateshwar D <> wrote:

    >>
    >>
    >> >Hex characters? Like [a-f0-9] ? Or integers?

    >>
    >> >$sequence = " 00 00 02 02 01 00";

    >>
    >> >open my $fin, '<:raw', 'filename.in' or die "can't open input file: $!";
    >> >open my $fout, '>:raw', 'filename.out' or die "can't open output file: $!";

    >>
    >> - Show quoted text -

    >
    >Hi
    >
    >Thanks a lot. This does not seem to be working. it is doing a binary
    >file copy.
    >
    >I want to search for that pattern in the binary file (000002020100)
    >(it is hex character file) and remove this pattern + the next 18 bytes
    >in the file.
    >thanks
    >venkat


    I don't understand what you mean. Opening the file in ':raw' mode
    takes away any CRLF translations and or possible encoding.
    Your free to read it as bytes then.

    Surely "000002020100" as text can be represented in a regular expression.
    Regular expressions are all about 'text'.
    Each character there has an ordinal value that you would consider binary.

    If you are instead looking for binary value, 0 value would be \x{0}
    character, 2 is \x{2}.

    If its text, just look for the sequence + the next 18 bytes:

    =~ s/000002020100.{18}//s

    After the buffer is modified, write it out to a different ':raw' file,
    where no translations will take place.

    You can get the same affect in translated mode just make sure the buffer
    isin't upgraded to utf8.


    -sln
    , Aug 18, 2009
    #5
  6. venkateshwar D

    Guest

    On Tue, 18 Aug 2009 16:47:08 +0100, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:

    > wrote:
    >
    >> Hex characters? Like [a-f0-9] ? Or integers?
    >>
    >> $sequence = " 00 00 02 02 01 00";
    >>
    >> open my $fin, '<:raw', 'filename.in' or die "can't open input file: $!";
    >> open my $fout, '>:raw', 'filename.out' or die "can't open output file: $!";
    >> {
    >> local $/;
    >> $buf = <$fin>;
    >> $buf =~ s/$sequence//;
    >> print $fout $buf;
    >> }
    >> close $fout;
    >> close $fin;

    >
    >For extra merit, make it work without reading
    >the whole file into ram at once ;-)
    >
    > BugBear


    Double buffer, something like this then:

    open my $fin, '<:raw', 'filename.in' or die "can't open input file: $!";
    open my $fout, '>:raw', 'filename.out' or die "can't open output file: $!";

    $keep = 50;
    {
    local $/ = \4092;
    ($buf,$block) = ('','');

    while (defined ($block = <$fin>))
    {
    $buf .= $block;
    $keep = 0 if ($keep and $buf =~ s/000002020100.{18}//s);
    print $fout substr( $buf, 0, length($buf)-$keep, "");
    }
    }
    close $fout;
    close $fin;
    ======================

    Or, a little more efficient, but this may actually be slower:

    $keep = 50;
    {
    local $/ = \4092;
    ($buf,$block) = ('','');
    $bref = \$block;

    while (defined ($$bref = <$fin>))
    {
    if ($keep)
    {
    $buf .= $block;
    if ($buf =~ s/000002020100.{18}//s) {
    $keep = 0;
    $bref = \$buf;
    }
    print $fout substr( $buf, 0, length($buf)-$keep, "");
    next;
    }
    print $fout $buf;
    }
    }
    close $fout;
    close $fin;

    ======================
    -sln
    , Aug 18, 2009
    #6
  7. venkateshwar D

    Guest

    On Tue, 18 Aug 2009 12:31:48 -0700, wrote:

    >On Tue, 18 Aug 2009 16:47:08 +0100, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    >
    >> wrote:
    >>
    >>For extra merit, make it work without reading
    >>the whole file into ram at once ;-)
    >>
    >> BugBear

    >
    >Double buffer, something like this then:
    >
    >
    > $keep = 50;
    > {
    > }

    Of course you have to check $keep or $buf here
    incase nothing was found, but if it wasn't found, the
    output will match the input file, so invalid results:

    print $fout $buf if $keep;

    > close $fout;
    > close $fin;


    -sln
    , Aug 18, 2009
    #7
  8. venkateshwar D

    Guest

    On Tue, 18 Aug 2009 16:47:08 +0100, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:

    > wrote:
    >
    >> Hex characters? Like [a-f0-9] ? Or integers?
    >>
    >> $sequence = " 00 00 02 02 01 00";
    >>
    >> open my $fin, '<:raw', 'filename.in' or die "can't open input file: $!";
    >> open my $fout, '>:raw', 'filename.out' or die "can't open output file: $!";
    >> {
    >> local $/;
    >> $buf = <$fin>;
    >> $buf =~ s/$sequence//;
    >> print $fout $buf;
    >> }
    >> close $fout;
    >> close $fin;

    >
    >For extra merit, make it work without reading
    >the whole file into ram at once ;-)
    >
    > BugBear


    haha, I can do than. special buffering, and algo.
    -sln
    , Aug 18, 2009
    #8
  9. venkateshwar D

    Guest

    On Tue, 18 Aug 2009 08:34:49 -0700 (PDT), venkateshwar D <> wrote:

    >Thanks a lot. This does not seem to be working. it is doing a binary
    >file copy.
    >
    >I want to search for that pattern in the binary file (000002020100)
    >(it is hex character file) and remove this pattern + the next 18 bytes
    >in the file.
    >thanks
    >venkat


    How did you make out, any luck?
    Try this sample and see if it is similar to what you have.

    -sln
    ------------------------
    use strict;
    use warnings;


    open my $ftest, '>', 'dummy.txt' or die "can't create dummy.txt: $!";
    for (1 .. 2_000)
    {
    print $ftest "$_ 0000000000000000000 111111111111111111111\n";
    }
    print $ftest "sequence line: 0000000000000000000 <000002020100555555555555555555>111\n";
    for (2_001 .. 4_000)
    {
    print $ftest "$_ 0000000000000000000 111111111111111111111\n";
    }
    close $ftest;


    open my $fin, '<:raw', 'dummy.txt' or die "can't open input file: $!";
    open my $fout, '>:raw', 'dummy_o.txt' or die "can't open output file: $!";

    my ($chunksize, $found) = (4096,0);
    {
    local $/ = \$chunksize;

    my ($keep, $buf, $data) = (50,'','');

    while (defined ($data = <$fin>))
    {
    $buf .= $data;
    $found = 1 if (not $found and $buf =~ s/000002020100.{18}//s);
    print $fout substr( $buf, 0, -$keep, "");
    }
    print $fout $buf;
    }
    if (!$found) {
    print "Did not match sequence: '000002020100.{18}'\n";
    }

    close $fout;
    close $fin;

    __END__
    , Aug 19, 2009
    #9
  10. venkateshwar D wrote:
    > Hi All,
    >
    > I need to look for a sequence of hex characters in a binary file and
    > remove them. the binary file has 00 00 02 02 01 00 sequence somewhere
    > in the file.
    > The script should open the file and look for this sequence 00 00 02 02
    > 01 00 <18 variable bytes> and remove the 18 + 6 = 24 bytes from the
    > file.can someone please help. I can open the binary file and buffer
    > byte by byte but since the pattern can be anywhere in the file i dont
    > know how to proceed


    I've done this a couple of times in order to find some embedded files in
    some documents (most often to find images in xls, doc, ppt, ...),
    although I usually discard whatever is not of interest to me.

    You have to read the file byte-by-byte and check for the header:
    (Untested Code follows!)

    my $special = pack('C*', 0x00, 0x00, 0x02, 0x02, 0x01, 0x00);
    open(my $src, '<', $srcname) or die "$0: cannot open $srcname: $!\n";
    open(my $dst, '>', $dstname) or die "$0: Cannot create $dstname: $!\n";
    binmode $src;
    my $buf;
    read($src, $buf, length($special));
    while (1) {
    if ($buf eq $special) {
    seek($src, 18, 0);
    last if read($src, $buf, length($special)) != length($special);
    next;
    }
    print $dst substr($buf, 1, 1);
    substr($buf, 1, 1, '');
    last if read($src, $buf, 1, -1) != 1;
    }
    print $dst $buf;
    close($src);
    close($dst);

    HTH,

    Josef
    --
    These are my personal views and not those of Fujitsu Technology Solutions!
    Josef Möllers (Pinguinpfleger bei FTS)
    If failure had no penalty success would not be a prize (T. Pratchett)
    Company Details: http://de.ts.fujitsu.com/imprint.html
    Josef Moellers, Aug 20, 2009
    #10
  11. venkateshwar D

    Guest

    On Thu, 20 Aug 2009 14:11:27 +0200, Josef Moellers <> wrote:

    >venkateshwar D wrote:
    >> Hi All,
    >>
    >> I need to look for a sequence of hex characters in a binary file and
    >> remove them. the binary file has 00 00 02 02 01 00 sequence somewhere
    >> in the file.

    >
    >I've done this a couple of times in order to find some embedded files in
    >some documents (most often to find images in xls, doc, ppt, ...),
    >although I usually discard whatever is not of interest to me.
    >
    >You have to read the file byte-by-byte and check for the header:
    >(Untested Code follows!)
    >
    >my $special = pack('C*', 0x00, 0x00, 0x02, 0x02, 0x01, 0x00);


    Why would you have to read the file a byte at a time and check
    for the header? You store binary (byte) data in a buffer then use 'eq'
    as if it is a character, but you won't trust a regular expression which
    would do the same thing.

    The file could be slurped into a buffer then checked with a regular expression
    or it could be read in a chunk at a time, checked, then the chunk rolled out
    of the buffer minus the width of the sequence plus 18 bytes. The next chunk
    is appended, then the process repeats until its found.

    I put up an example how to do this.
    The proof that this works is using the same method you use but
    instead of read 1, is it 'eq', etc.., uses a regular expression
    on a chunk of bytes.

    Perl defaults to bytes in regex, it will upgrade the context to
    utf8 if anything in the expression forces it to. In this case it
    doesen't, the sequence is byte context (ie: less than 0x100).
    The file is opened in binary mode, its byte context.

    -sln

    use strict;
    use warnings;

    my $special = pack('C*', 0x00, 0x00, 0x02, 0x02, 0x01, 0x00);
    my $bytes = '';

    for (1 .. 12_000) {
    if ($_ == 6000) {
    $bytes .= $special;
    } else {
    $bytes .= chr(int(rand(256)) & 0xff);
    }
    }
    print "buf len = ".length($bytes)."\n";
    my $posn = 0;
    if ($bytes =~ s/($special)(.{18})/$posn = pos($bytes); ''/es) {
    print "Found special at position ".$posn.": ".ordsplit($1)."\n";
    print "Next 18 bytes : ".ordsplit($2)."\n";
    print "Special + 18 bytes, removed!\n";
    }
    print "buf len = ".length($bytes)."\n";
    sub ordsplit
    {
    my $string = shift;
    my $buf = '';
    for (map {ord $_} split //, $string) {
    $buf.= sprintf ("%02x ",$_);
    }
    return $buf;
    }
    __END__

    buf len = 12005
    Found special at position 5999: 00 00 02 02 01 00
    Next 18 bytes : de b9 70 b9 4b b9 4c 9f 1d f3 de 33 52 00 26 a7
    50 41
    Special + 18 bytes, removed!
    buf len = 11981
    , Aug 20, 2009
    #11
  12. venkateshwar D

    Guest

    On Tue, 18 Aug 2009 07:14:34 -0700 (PDT), venkateshwar D <> wrote:

    >Hi All,
    >
    >I need to look for a sequence of hex characters in a binary file and
    >remove them. the binary file has 00 00 02 02 01 00 sequence somewhere
    >in the file.
    >The script should open the file and look for this sequence 00 00 02 02
    >01 00 <18 variable bytes> and remove the 18 + 6 = 24 bytes from the
    >file.can someone please help. I can open the binary file and buffer
    >byte by byte but since the pattern can be anywhere in the file i dont
    >know how to proceed
    >
    >regards
    >venkat


    Here's the same example in binary mode (ie: the dummy file
    is random binary, with the binary sequence embedded).
    If this doesen't work for you, something else is wrong.

    -sln
    -------------------------

    use strict;
    use warnings;

    my $sequence = "\x{00}\x{00}\x{02}\x{02}\x{01}\x{00}";
    # or = pack('C*', 0x00, 0x00, 0x02, 0x02, 0x01, 0x00);

    # Create dummy random binary file with embeded sequence
    # ##
    open my $ftest, '>:raw', 'dummy.bin' or die "can't create dummy.bin: $!";
    for (1 .. 12_000) {
    if ($_ == 2000) {
    print $ftest $sequence;
    } else {
    print $ftest chr(int(rand(256)) & 0xff);
    }
    }
    close $ftest;

    # Read in binary, look for sequence, remove then write to file
    # ##
    open my $fin, '<:raw', 'dummy.bin' or die "can't open input file: $!";
    open my $fout, '>:raw', 'dummy_o.bin' or die "can't open output file: $!";
    my ($chunksize, $found) = (1024,0);
    {
    local $/ = \$chunksize;
    my ($keep, $buf, $data) = (50,'','');
    while (defined ($data = <$fin>)) {
    $buf .= $data;
    if (!$found) {
    if ($buf =~ s/($sequence)(.{18})//s) {
    print "Found sequence: ".ordsplit($1)."\n";
    print "Next 18 bytes : ".ordsplit($2)."\n";
    print "Sequence + 18 bytes, removed!\n";
    $found = 1;
    }
    }
    print $fout substr( $buf, 0, -$keep, "");
    }
    print $fout $buf;
    }
    if (!$found) {
    print "Did not match sequence: '\$sequence.{18}'\n";
    }
    close $fout;
    close $fin;

    ## End of program
    exit 0;

    sub ordsplit
    {
    my $string = shift;
    my $buf = '';
    for (map {ord $_} split //, $string) {
    $buf.= sprintf ("%02x ",$_);
    }
    return $buf;
    }

    __END__

    Found sequence: 00 00 02 02 01 00
    Next 18 bytes : 25 6f e4 7e 6e fb fe 1e 47 af e6 2e 50 3f 31 54
    dd 51
    Sequence + 18 bytes, removed!
    , Aug 20, 2009
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    10
    Views:
    6,109
    Neredbojias
    Aug 19, 2005
  2. arvind
    Replies:
    4
    Views:
    20,984
    Kevin Goodsell
    Oct 4, 2003
  3. Bengt Richter
    Replies:
    6
    Views:
    438
    Juha Autero
    Aug 19, 2003
  4. rvino
    Replies:
    0
    Views:
    4,627
    rvino
    Aug 14, 2007
  5. Bogdan

    Binary tree search vs Binary search

    Bogdan, Oct 18, 2010, in forum: C Programming
    Replies:
    22
    Views:
    3,020
    Michael Angelo Ravera
    Oct 21, 2010
Loading...

Share This Page