Convert raw binary file to ascii

R

r2

I have a memory dump from a machine I am trying to analyze. I can view
the file in a hex editor to see text strings in the binary code. I
don't see a way to save these ascii representations of the binary, so
I went digging into Python to see if there were any modules to help.

I found one I think might do what I want it to do - the binascii
module. Can anyone describe to me how to convert a raw binary file to
an ascii file using this module. I've tried? Boy, I've tried.

Am I correct in assuming I can get the converted binary to ascii text
I see in a hex editor using this module? I'm new to this forensics
thing and it's quite possible I am mixing technical terms. I am not
new to Python, however. Thanks for your help.
 
P

Peter Otten

r2 said:
I have a memory dump from a machine I am trying to analyze. I can view
the file in a hex editor to see text strings in the binary code. I
don't see a way to save these ascii representations of the binary, so
I went digging into Python to see if there were any modules to help.

I found one I think might do what I want it to do - the binascii
module. Can anyone describe to me how to convert a raw binary file to
an ascii file using this module. I've tried? Boy, I've tried.

That won't work because a text editor doesn't need any help to convert the
bytes into characters. If it expects ascii it just will be puzzled by bytes
that are not valid ascii. Also, it will happily display byte sequences that
are valid ascii, but that you as a user will see as gibberish because they
were meant to be binary data by the program that wrote them.
Am I correct in assuming I can get the converted binary to ascii text
I see in a hex editor using this module? I'm new to this forensics
thing and it's quite possible I am mixing technical terms. I am not
new to Python, however. Thanks for your help.

Unix has the "strings" commandline tool to extract text from a binary.
Get hold of a copy of the MinGW tools if you are on windows.

Peter
 
R

r2

That won't work because a text editor doesn't need any help to convert the
bytes into characters. If it expects ascii it just will be puzzled by bytes
that are not valid ascii. Also, it will happily display byte sequences that
are valid ascii, but that you as a user will see as gibberish because they
were meant to be binary data by the program that wrote them.


Unix has the "strings" commandline tool to extract text from a binary.
Get hold of a copy of the MinGW tools if you are on windows.

Peter

Okay. Thanks for the guidance. I have a machine with Linux, so I
should be able to do what you describe above. Could Python extract the
strings from the binary as well? Just wondering.
 
P

Peter Otten

r2 said:
Okay. Thanks for the guidance. I have a machine with Linux, so I
should be able to do what you describe above. Could Python extract the
strings from the binary as well? Just wondering.

As a special service for you here is a naive implementation to build upon:

#!/usr/bin/env python
import sys

wanted_chars = ["\0"]*256
for i in range(32, 127):
wanted_chars = chr(i)
wanted_chars[ord("\t")] = "\t"
wanted_chars = "".join(wanted_chars)

THRESHOLD = 4

for s in sys.stdin.read().translate(wanted_chars).split("\0"):
if len(s) >= THRESHOLD:
print s

Peter
 
R

r2

Okay. Thanks for the guidance. I have a machine with Linux, so I
should be able to do what you describe above. Could Python extract the
strings from the binary as well? Just wondering.

As a special service for you here is a naive implementation to build upon:

#!/usr/bin/env python
import sys

wanted_chars = ["\0"]*256
for i in range(32, 127):
    wanted_chars = chr(i)
wanted_chars[ord("\t")] = "\t"
wanted_chars = "".join(wanted_chars)

THRESHOLD = 4

for s in sys.stdin.read().translate(wanted_chars).split("\0"):
    if len(s) >= THRESHOLD:
        print s

Peter- Hide quoted text -

- Show quoted text -


Perfect! Thanks.
 
D

Dave Angel

r2 said:
Okay. Thanks for the guidance. I have a machine with Linux, so I
should be able to do what you describe above. Could Python extract the
strings from the binary as well? Just wondering.
Yes, you could do the same thing in Python easily enough. And with the
advantage that you could define your own meanings for "characters."

The memory dump could be storing characters that are strictly ASCII. Or
it could have EBCDIC, or UTF-8. And it could be Unicode, 16 bit or 32
bits, and big-endian or little-endian. Or the characters could be in
some other format specific to a particular program.

However, it's probably very useful to see what a "strings" program might
look like, because you can quickly code variations on it, to suit your
particular data.
Something like the following (totally untested)

def isprintable(char):
return 0x20 <= char <= 0x7f

def string(filename):
data = open(filename, "rb").read()
count = 0
line = ""
for ch in data:
if isprintable(ch):
count += 1
line = line + ch
else:
if count > 4 : #cutoff, don't print strings smaller
than this because they're probably just coincidence
print line
count = 0
line= ""
print line


Now you can change the definition of what's "printable", you can change
the min-length that you care about. And of course you can fine-tune
things like max-length lines and such.

DaveA
 
J

Jan Kaliszewski

Hello Friends,

It's my first post to python-list, so first let me introduce myself...
* my name is Jan Kaliszewski,
* country -- Poland,
* occupation -- composer (studied in F. Chopin Academy of Music @Warsaw)
and programmer (currently in Record System company,
working on Anakonda -- ERP system for
big companies [developed in Python + WX
+ Postgres]).

Now, to the matter...

27-07-2009 Grant Edwards said:
$ strings memdump.binary >memdump.strings

$ hexdump -C memdump.binary >memdump.hex+as

Do You (r2) want to do get ASCII substrings (i.e. extract only those
pieces of file that consist of ASCII codes -- i.e. 7-bit values -- i.e in
range 0...127), or rather "possibly readable ascii representation" of
the whole file, with printable ascii characters preserved 'as is' and
not-printable/non-ascii characters being replaced with their codes
(e.g. with '\x...' notation).

If the latter, you probably want something like this:

import codecs
with open('memdump.binary', 'rb') as source:
with open('memdump.txt', 'w') as target:
for quasiline in codecs.iterencode(source, 'string_escape'):
target.write(quasiline)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top