Simple converter of files into their hex components... but i can'tarrange utf-8 parts!

B

blatt447477

Hi all,
I developed a script, which, IMHO, is more useful than the well
known bash "hexdump".
Unfortunately i can't arrange very easily the utf-8 encoding,
so in my output there is a loss of synchronization between the
the literal and the hex part...
The script is not very long but is written not very well (no functions,
no classes...) but I didn't succeed in formulating my doubts in
a more concise way... so here you can find it!

# -*- coding: utf-8 -*-
# px.py # python 2.6.6
nLenN=3 # n. of digits for lines

# hex conversion on 2 lines (except spaces)
# various run options: std : python px.py file
# bash cat : cat file | python px.py (alias hex)
# bash echo: echo line | python px.py " "

# works on any n. of bytes for utf-8

import os, sys
import signal
signal.signal(signal.SIGPIPE, signal.SIG_DFL)

try:
sFN=sys.argv[1]
f=open(sFN)
lF=f.readlines()
f.close()
except:
sHD=sys.stdin.read().replace('\n','~\n')
lF=sHD.split('\n')
for n in xrange(len(lF)):
lF[n]=lF[n].replace('~','\n')

#################################################################

lP=[]
for n in xrange(len(lF)):

lP.append(str(n+1).zfill(nLenN)+' '+lF[n])
lNoSpaces=lF[n].replace(' ','~!').split('!')
sHexH=sHexL=' ' * nLenN +' '
for k in xrange(len(lNoSpaces)):
sHex=lNoSpaces[k].encode('hex')
sHexNT=sHex.replace('7e','')

sH=''
for c in xrange(0,len(sHexNT),2):
sH += sHexNT[c]
sHexH += sH+' '

sL=''
for c in xrange(1,len(sHexNT),2):
sL += sHexNT[c]
sHexL += sL+' '

lP.append(sHexH+'\n')
lP.append(sHexL+'\n\n') # to jump a line

# the insertion of one or more spaces after the unicode characters must be
# done manually on the output (lP)
print ''.join(lP)
#--------------------------------------------------------------

print '---------------------\n'
for n in xrange(0,len(lP),3):
try:
lP[n].encode('utf-8')
except:
print lP[n], # to be modified by hand in presence of utf-8 char
print lP[n+1], # to syncronize ascii and hex
print lP[n+2],

As you see, it is a hex conversion on 2 lines (except spaces), which
has various run options: std : python px.py file
bash cat : cat file | python px.py (alias hex)
bash echo: echo line | python px.py " "

Besides that, it can work (if I solve my problems) on any n. of bytes
for utf-8.
As an example of such problems, you can compare the output in presence of
utf-8 chars...

004 # qwerty: not unicode but ascii
2 7767773 667 7666666 677 676660
3 175249a ef4 5e93f45 254 13399a

005 # qwerty: non è unicode bensì ascii
2 7767773 666 ca 7666666 6667ca 676660
3 175249a efe 38 5e93f45 25e33c 13399a

Thanks in advance for any help!
Blatt
 
C

Chris Angelico

Hi all,
I developed a script, which, IMHO, is more useful than the well
known bash "hexdump".
Unfortunately i can't arrange very easily the utf-8 encoding,
so in my output there is a loss of synchronization between the
the literal and the hex part...
The script is not very long but is written not very well (no functions,
no classes...) but I didn't succeed in formulating my doubts in
a more concise way... so here you can find it!

Functions and classes are entirely optional in Python :) However,
there are a number of points about your code that are not Pythonic, so
I'll take the liberty of commenting on those. You're free to ignore my
comments, of course!
004 # qwerty: not unicode but ascii
2 7767773 667 7666666 677 676660
3 175249a ef4 5e93f45 254 13399a

005 # qwerty: non è unicode bensì ascii
2 7767773 666 ca 7666666 6667ca 676660
3 175249a efe 38 5e93f45 25e33c 13399a

I'm not 100% sure of what you're trying to accomplish here. You want
to produce a hex-dump output, but:
1) Your hex digits are directly underneath the character concerned (q
= 0x71 ergo line 2 has "7" and line 3 has "1");
2) Spaces are shown as spaces;
3) UTF-8 sequences get merged.

The one part I'm not sure about is what you intend to happen with
UTF-8 sequences. Currently, your line 005 gets offset by the c3 a8 and
then again by the c3 ac, which as you say is undesirable, but what
*do* you want? Are you trying to have the character on top get split
into its bytes? That would look like this:

005 # qwerty: non è unicode bensì ascii
2 7767773 666 ca 7666666 6667ca 676660
3 175249a efe 38 5e93f45 25e33c 13399a

You can do that by simply opening the file as Latin-1 (iso-8859-1)
instead of UTF-8. Each byte will be taken to represent its eight-bit
value, and then when you produce the output, it'll be encoded as
whatever your console requires.

Alternatively, do you want to insert spaces, or some other placeholder?

005 # qwerty: non è unicode bensì ascii
2 7767773 666 ca 7666666 6667ca 676660
3 175249a efe 38 5e93f45 25e33c 13399a

Or perhaps it'd be better to string them out further vertically?

005 # qwerty: non è unicode bensì ascii
2 7767773 666 c 7666666 6667c 676660
3 175249a efe 3 5e93f45 25e33 13399a
a a
8 c

Or maybe something else entirely? My understanding of your comment here
# the insertion of one or more spaces after the unicode characters must be
# done manually on the output (lP)
is that you want to insert spaces, but that means changing the text
itself. I'm not so sure that's a good thing, but if that really is
what you want, then sure.
# -*- coding: utf-8 -*-
# px.py # python 2.6.6

I would advise, by-the-by, that you consider targeting Python 3. There
are heaps of extremely handy Unicode features in Python 3, most
notably that the default string type is Unicode characters, not bytes.
Also, Py3 has a future, Py2 will receive only bugfixes and security
patches.
try:
sFN=sys.argv[1]
f=open(sFN)
lF=f.readlines()
f.close()
except:
sHD=sys.stdin.read().replace('\n','~\n')
lF=sHD.split('\n')
for n in xrange(len(lF)):
lF[n]=lF[n].replace('~','\n')

This is reading the entire file in before producing any output. This
plays badly with other Unix tools (you can't, for instance, 'tail -f
some-file|your-script' to monitor a growing log), and causes extensive
memory usage. Since you then (as far as I can see) always work
line-by-line, you would probably do better to simply iterate over the
lines of input.

Also: I'm not sure what your replace calls are meant to do, but you're
turning all tildes into newlines. It should be possible to iterate
over the lines without this hassle.

I have no idea what this name is supposed to represent; longer names
are more usually preferred. Also, all-uppercase names tend to be for
constants, which will confuse people.
for k in xrange(len(lNoSpaces)):
sHex=lNoSpaces[k].encode('hex')

You're iterating up to the length of something and then using the
index only to retrieve the current element. There's an easier way to
spell that:

for char in lNoSpaces:
sHex=char.encode('hex')
sH=''
for c in xrange(0,len(sHexNT),2):
sH += sHexNT[c]
sHexH += sH+' '

sL=''
for c in xrange(1,len(sHexNT),2):
sL += sHexNT[c]
sHexL += sL+' '

Here's a really fancy trick you can do: Slicing with a step.
Demonstrating with the interactive interpreter:
s = "7177657274793a206e6f74202020756e69636f6465206275742020206173636969"
hi = s[::2]
lo = s[1::2]
hi '776777326672227666666267722267666'
lo
'175249a0ef40005e93f45025400013399'

I love Python!
for n in xrange(0,len(lP),3):
try:
lP[n].encode('utf-8')
except:
print lP[n], # to be modified by hand in presence of utf-8 char
print lP[n+1], # to syncronize ascii and hex
print lP[n+2],

Okay... this is something I'm not understanding. I *think* that IP[n]
here is your original text, as a byte string. In that case, what you
want here is to decode it as UTF-8, I think. But I'm not sure.
Recommendation: Don't use a bare 'except', unless you're logging an
exception and moving on. You're masking an error here; when you
attempt to *encode* a byte string as UTF-8, what it actually does is
first try to *decode* it as ASCII (which produces a Unicode string),
then encode the result. Again using the interactive interpreter:
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
'\xc2\xa2'.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
0: ordinal not in range(128)u'\xa2'

This is another area where Python 3 would help, because you would be
working with strings most of the way through.

Here's how I'd put together a Py3 version of that code - this is
pseudo-code, but Python's pretty good at executing pseudo-code...

import fileinput
import binascii
for line in fileinput.input(): # handles all the details of args-or-stdin
output=['line','headers','here']
for char in line:
if char==' ':
output[*]+=' ' # not Python syntax but so close
continue
utf8=char.encode('utf-8')
output[0]+=char+' '*len(utf8)-1
utf8=binascii.hexlify(utf8).decode()
output[1]+=utf8[::2]; output[2]+=utf8[1::2]
print(output[*])

That's actually very close to real code, feel free to flesh it out a
smidge and run it :) I seriously was going to start by writing just
pseudo-code, but it got closer and closer to actual working Python...

Oh, and since we have a current thread about copyright and license:
This is copyright 2013 Chris Angelico, MIT license. So go ahead, use
it. :)

ChrisA
 
B

blatt447477

Hi Chris,
your critics are welcome! But perhaps the majority of them has been
caused by font problems in my posting.
Google should put as default a "mono" font!
Or perhaps it has been a mistake on my part to not configure
correctly the output of my post (I even didn't change my nickname...
so you read a selfish "me" instead of "Blatt"!).
I will try in this reply to have a more readable output, but I'm not
sure... (I need help from experts!...).
By now, the only advice is to copy/paste my post in an editor.

Coming to your critics...

01Together with my not being 'pythonic', is correct...
but it is so difficult to change programming style at my age!...

02
004 # qwerty: not _ unicode but _ ascii
___ 2 7767773 667 _ 7666666 677 _ 676660
___ 3 175249a ef4 _ 5e93f45 254 _ 13399a

005 # qwerty: non è unicode bensì ascii
___ 2 7767773 666 ca 7666666 6667ca 676660
___ 3 175249a efe 38 5e93f45 25e33c 13399a

As you can see from the corrected output (I hope!), I indeed
am trying to produce a hex-dump output ('_' is a "place-holder"
for space which itself is not considered for hex output).
The 2s and 3s are not lines 2 and 3, but 23 in vertical (hex of #)!
In the editor you can better see my problem with utf-8 chars...
They cause the lack of synchronization. For example 'unicode' after
the utf-8 char is no more synchronized with it's corresponding hex.
To keep the synchronization, you should insert a space after the
utf-8 char... but I didn't succeeed (at least programmatically...).

03I probably will never use Python 3! I am perfectly happy with bugfixes
and security patches...
The reason is that I don't need all the "bells and whistles" of Py3,
especially in the field of OOP!

04I already synchronized the output... but afterwards in the editor...
I want to do this programmatically, so I can use it from the
consolle with bash and pipes.

05I considered this solution, but if I succeed in synchronizig all
programmatically... there is not such a difference.

06No problems of speed...

07
lP=[]
I have no idea what this name is supposed to represent
Simply a list initialization to empty...

08Your solution is much better!

I can go further, but it's better that you run my script (if you want)
to get a better understanding. If you are really interested, you can also
try your version (probably better... I will try it).

Bye, Blatt.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,928
Messages
2,570,068
Members
46,513
Latest member
JacklynMcC

Latest Threads

Top