Simple converter of files into their hex components... but i can'tarrange utf-8 parts!

Discussion in 'Python' started by blatt447477@gmail.com, Jun 9, 2013.

  1. Guest

    Hi all,
    I developed a script, which, IMHO, is more useful than the well
    known bash "hexdump".
    Unfortunately i can't arrange very easily the utf-8 encoding,
    so in my output there is a loss of synchronization between the
    the literal and the hex part...
    The script is not very long but is written not very well (no functions,
    no classes...) but I didn't succeed in formulating my doubts in
    a more concise way... so here you can find it!

    # -*- coding: utf-8 -*-
    # px.py # python 2.6.6
    nLenN=3 # n. of digits for lines

    # hex conversion on 2 lines (except spaces)
    # various run options: std : python px.py file
    # bash cat : cat file | python px.py (alias hex)
    # bash echo: echo line | python px.py " "

    # works on any n. of bytes for utf-8

    import os, sys
    import signal
    signal.signal(signal.SIGPIPE, signal.SIG_DFL)

    try:
    sFN=sys.argv[1]
    f=open(sFN)
    lF=f.readlines()
    f.close()
    except:
    sHD=sys.stdin.read().replace('\n','~\n')
    lF=sHD.split('\n')
    for n in xrange(len(lF)):
    lF[n]=lF[n].replace('~','\n')

    #################################################################

    lP=[]
    for n in xrange(len(lF)):

    lP.append(str(n+1).zfill(nLenN)+' '+lF[n])
    lNoSpaces=lF[n].replace(' ','~!').split('!')
    sHexH=sHexL=' ' * nLenN +' '
    for k in xrange(len(lNoSpaces)):
    sHex=lNoSpaces[k].encode('hex')
    sHexNT=sHex.replace('7e','')

    sH=''
    for c in xrange(0,len(sHexNT),2):
    sH += sHexNT[c]
    sHexH += sH+' '

    sL=''
    for c in xrange(1,len(sHexNT),2):
    sL += sHexNT[c]
    sHexL += sL+' '

    lP.append(sHexH+'\n')
    lP.append(sHexL+'\n\n') # to jump a line

    # the insertion of one or more spaces after the unicode characters must be
    # done manually on the output (lP)
    print ''.join(lP)
    #--------------------------------------------------------------

    print '---------------------\n'
    for n in xrange(0,len(lP),3):
    try:
    lP[n].encode('utf-8')
    except:
    print lP[n], # to be modified by hand in presence of utf-8 char
    print lP[n+1], # to syncronize ascii and hex
    print lP[n+2],

    As you see, it is a hex conversion on 2 lines (except spaces), which
    has various run options: std : python px.py file
    bash cat : cat file | python px.py (alias hex)
    bash echo: echo line | python px.py " "

    Besides that, it can work (if I solve my problems) on any n. of bytes
    for utf-8.
    As an example of such problems, you can compare the output in presence of
    utf-8 chars...

    004 # qwerty: not unicode but ascii
    2 7767773 667 7666666 677 676660
    3 175249a ef4 5e93f45 254 13399a

    005 # qwerty: non è unicode bensì ascii
    2 7767773 666 ca 7666666 6667ca 676660
    3 175249a efe 38 5e93f45 25e33c 13399a

    Thanks in advance for any help!
    Blatt
    , Jun 9, 2013
    #1
    1. Advertising

  2. Re: Simple converter of files into their hex components... but ican't arrange utf-8 parts!

    On Mon, Jun 10, 2013 at 7:06 AM, <> wrote:
    > Hi all,
    > I developed a script, which, IMHO, is more useful than the well
    > known bash "hexdump".
    > Unfortunately i can't arrange very easily the utf-8 encoding,
    > so in my output there is a loss of synchronization between the
    > the literal and the hex part...
    > The script is not very long but is written not very well (no functions,
    > no classes...) but I didn't succeed in formulating my doubts in
    > a more concise way... so here you can find it!


    Functions and classes are entirely optional in Python :) However,
    there are a number of points about your code that are not Pythonic, so
    I'll take the liberty of commenting on those. You're free to ignore my
    comments, of course!

    > 004 # qwerty: not unicode but ascii
    > 2 7767773 667 7666666 677 676660
    > 3 175249a ef4 5e93f45 254 13399a
    >
    > 005 # qwerty: non è unicode bensì ascii
    > 2 7767773 666 ca 7666666 6667ca 676660
    > 3 175249a efe 38 5e93f45 25e33c 13399a


    I'm not 100% sure of what you're trying to accomplish here. You want
    to produce a hex-dump output, but:
    1) Your hex digits are directly underneath the character concerned (q
    = 0x71 ergo line 2 has "7" and line 3 has "1");
    2) Spaces are shown as spaces;
    3) UTF-8 sequences get merged.

    The one part I'm not sure about is what you intend to happen with
    UTF-8 sequences. Currently, your line 005 gets offset by the c3 a8 and
    then again by the c3 ac, which as you say is undesirable, but what
    *do* you want? Are you trying to have the character on top get split
    into its bytes? That would look like this:

    005 # qwerty: non è unicode bensì ascii
    2 7767773 666 ca 7666666 6667ca 676660
    3 175249a efe 38 5e93f45 25e33c 13399a

    You can do that by simply opening the file as Latin-1 (iso-8859-1)
    instead of UTF-8. Each byte will be taken to represent its eight-bit
    value, and then when you produce the output, it'll be encoded as
    whatever your console requires.

    Alternatively, do you want to insert spaces, or some other placeholder?

    005 # qwerty: non è unicode bensì ascii
    2 7767773 666 ca 7666666 6667ca 676660
    3 175249a efe 38 5e93f45 25e33c 13399a

    Or perhaps it'd be better to string them out further vertically?

    005 # qwerty: non è unicode bensì ascii
    2 7767773 666 c 7666666 6667c 676660
    3 175249a efe 3 5e93f45 25e33 13399a
    a a
    8 c

    Or maybe something else entirely? My understanding of your comment here
    > # the insertion of one or more spaces after the unicode characters must be
    > # done manually on the output (lP)

    is that you want to insert spaces, but that means changing the text
    itself. I'm not so sure that's a good thing, but if that really is
    what you want, then sure.

    > # -*- coding: utf-8 -*-
    > # px.py # python 2.6.6


    I would advise, by-the-by, that you consider targeting Python 3. There
    are heaps of extremely handy Unicode features in Python 3, most
    notably that the default string type is Unicode characters, not bytes.
    Also, Py3 has a future, Py2 will receive only bugfixes and security
    patches.

    > try:
    > sFN=sys.argv[1]
    > f=open(sFN)
    > lF=f.readlines()
    > f.close()
    > except:
    > sHD=sys.stdin.read().replace('\n','~\n')
    > lF=sHD.split('\n')
    > for n in xrange(len(lF)):
    > lF[n]=lF[n].replace('~','\n')


    This is reading the entire file in before producing any output. This
    plays badly with other Unix tools (you can't, for instance, 'tail -f
    some-file|your-script' to monitor a growing log), and causes extensive
    memory usage. Since you then (as far as I can see) always work
    line-by-line, you would probably do better to simply iterate over the
    lines of input.

    Also: I'm not sure what your replace calls are meant to do, but you're
    turning all tildes into newlines. It should be possible to iterate
    over the lines without this hassle.

    > lP=[]


    I have no idea what this name is supposed to represent; longer names
    are more usually preferred. Also, all-uppercase names tend to be for
    constants, which will confuse people.

    > for k in xrange(len(lNoSpaces)):
    > sHex=lNoSpaces[k].encode('hex')


    You're iterating up to the length of something and then using the
    index only to retrieve the current element. There's an easier way to
    spell that:

    for char in lNoSpaces:
    sHex=char.encode('hex')

    > sH=''
    > for c in xrange(0,len(sHexNT),2):
    > sH += sHexNT[c]
    > sHexH += sH+' '
    >
    > sL=''
    > for c in xrange(1,len(sHexNT),2):
    > sL += sHexNT[c]
    > sHexL += sL+' '


    Here's a really fancy trick you can do: Slicing with a step.
    Demonstrating with the interactive interpreter:

    >>> s = "7177657274793a206e6f74202020756e69636f6465206275742020206173636969"
    >>> hi = s[::2]
    >>> lo = s[1::2]
    >>> hi

    '776777326672227666666267722267666'
    >>> lo

    '175249a0ef40005e93f45025400013399'

    I love Python!

    > for n in xrange(0,len(lP),3):
    > try:
    > lP[n].encode('utf-8')
    > except:
    > print lP[n], # to be modified by hand in presence of utf-8 char
    > print lP[n+1], # to syncronize ascii and hex
    > print lP[n+2],


    Okay... this is something I'm not understanding. I *think* that IP[n]
    here is your original text, as a byte string. In that case, what you
    want here is to decode it as UTF-8, I think. But I'm not sure.
    Recommendation: Don't use a bare 'except', unless you're logging an
    exception and moving on. You're masking an error here; when you
    attempt to *encode* a byte string as UTF-8, what it actually does is
    first try to *decode* it as ASCII (which produces a Unicode string),
    then encode the result. Again using the interactive interpreter:

    >>> '\xc2\xa2'.encode('utf-8')

    Traceback (most recent call last):
    File "<pyshell#1>", line 1, in <module>
    '\xc2\xa2'.encode('utf-8')
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
    0: ordinal not in range(128)
    >>> '\xc2\xa2'.decode('utf-8')

    u'\xa2'

    This is another area where Python 3 would help, because you would be
    working with strings most of the way through.

    Here's how I'd put together a Py3 version of that code - this is
    pseudo-code, but Python's pretty good at executing pseudo-code...

    import fileinput
    import binascii
    for line in fileinput.input(): # handles all the details of args-or-stdin
    output=['line','headers','here']
    for char in line:
    if char==' ':
    output[*]+=' ' # not Python syntax but so close
    continue
    utf8=char.encode('utf-8')
    output[0]+=char+' '*len(utf8)-1
    utf8=binascii.hexlify(utf8).decode()
    output[1]+=utf8[::2]; output[2]+=utf8[1::2]
    print(output[*])

    That's actually very close to real code, feel free to flesh it out a
    smidge and run it :) I seriously was going to start by writing just
    pseudo-code, but it got closer and closer to actual working Python...

    Oh, and since we have a current thread about copyright and license:
    This is copyright 2013 Chris Angelico, MIT license. So go ahead, use
    it. :)

    ChrisA
    Chris Angelico, Jun 10, 2013
    #2
    1. Advertising

  3. Guest

    Re: Simple converter of files into their hex components... but ican't arrange utf-8 parts!

    Hi Chris,
    your critics are welcome! But perhaps the majority of them has been
    caused by font problems in my posting.
    Google should put as default a "mono" font!
    Or perhaps it has been a mistake on my part to not configure
    correctly the output of my post (I even didn't change my nickname...
    so you read a selfish "me" instead of "Blatt"!).
    I will try in this reply to have a more readable output, but I'm not
    sure... (I need help from experts!...).
    By now, the only advice is to copy/paste my post in an editor.

    Coming to your critics...

    01
    >> Functions and classes are entirely optional in Python :)

    Together with my not being 'pythonic', is correct...
    but it is so difficult to change programming style at my age!...

    02
    > 004 # qwerty: not _ unicode but _ ascii
    > ___ 2 7767773 667 _ 7666666 677 _ 676660
    > ___ 3 175249a ef4 _ 5e93f45 254 _ 13399a
    >
    > 005 # qwerty: non è unicode bensì ascii
    > ___ 2 7767773 666 ca 7666666 6667ca 676660
    > ___ 3 175249a efe 38 5e93f45 25e33c 13399a


    >>I'm not 100% sure of what you're trying to accomplish here.


    As you can see from the corrected output (I hope!), I indeed
    am trying to produce a hex-dump output ('_' is a "place-holder"
    for space which itself is not considered for hex output).
    The 2s and 3s are not lines 2 and 3, but 23 in vertical (hex of #)!
    In the editor you can better see my problem with utf-8 chars...
    They cause the lack of synchronization. For example 'unicode' after
    the utf-8 char is no more synchronized with it's corresponding hex.
    To keep the synchronization, you should insert a space after the
    utf-8 char... but I didn't succeeed (at least programmatically...).

    03
    >> ...you consider targeting Python 3.

    I probably will never use Python 3! I am perfectly happy with bugfixes
    and security patches...
    The reason is that I don't need all the "bells and whistles" of Py3,
    especially in the field of OOP!

    04
    >> You can do (synchronization) by simply opening the file as
    >> Latin-1 (iso-8859-1) instead of UTF-8.

    I already synchronized the output... but afterwards in the editor...
    I want to do this programmatically, so I can use it from the
    consolle with bash and pipes.

    05
    >> Or perhaps it'd be better to string them out further vertically?

    I considered this solution, but if I succeed in synchronizig all
    programmatically... there is not such a difference.

    06
    >> ... reading the entire file in before producing any output

    No problems of speed...

    07
    >> lP=[]
    >> I have no idea what this name is supposed to represent

    Simply a list initialization to empty...

    08
    >> Here's a really fancy trick you can do: Slicing with a step.

    Your solution is much better!

    I can go further, but it's better that you run my script (if you want)
    to get a better understanding. If you are really interested, you can also
    try your version (probably better... I will try it).

    Bye, Blatt.
    , Jun 10, 2013
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    10
    Views:
    6,181
    Neredbojias
    Aug 19, 2005
  2. Bengt Richter
    Replies:
    6
    Views:
    461
    Juha Autero
    Aug 19, 2003
  3. Replies:
    3
    Views:
    594
    Keith Thompson
    Mar 31, 2007
  4. Replies:
    1
    Views:
    945
    =?Utf-8?B?UGV0ZXIgQnJvbWJlcmcgW0MjIE1WUF0=?=
    Apr 12, 2007
  5. moonhkt
    Replies:
    18
    Views:
    2,513
    Roedy Green
    Feb 5, 2010
Loading...

Share This Page