regex/lambda black magic

Discussion in 'Python' started by Andrew Robert, May 25, 2006.

  1. Hi everyone,

    I have two test scripts, an encoder and a decoder.

    The encoder, listed below, works perfectly.


    import re,sys
    output = open(r'e:\pycode\out_test.txt','wb')
    for line in open(r'e:\pycode\sigh.txt','rb') :
    output.write( re.sub(r'([^\w\s])', lambda s: '%%%2X' %
    ord(s.group()), line))


    The decoder, well, I have hopes.


    import re,sys
    output = open(r'e:\pycode\new_test.txt','wb')
    for line in open(r'e:\pycode\out_test.txt','rb') :
    output.write( re.sub(r'([^\w\s])', lambda s: chr(int(s.group(), 16))
    % ord(s.group()), line))


    The decoder generates the following traceback:

    Traceback (most recent call last):
    File "E:\pycode\sample_decode_file_specials_from_hex.py", line 9, in ?
    output.write( re.sub(r'([^\w\s])', lambda s: chr(int(s.group(), 16))
    % ord(s.group()), line))
    File "C:\Python24\lib\sre.py", line 142, in sub
    return _compile(pattern, 0).sub(repl, string, count)
    File "E:\pycode\sample_decode_file_specials_from_hex.py", line 9, in
    <lambda>
    output.write( re.sub(r'([^\w\s])', lambda s: chr(int(s.group(), 16))
    % ord(s.group()), line))
    ValueError: invalid literal for int(): %

    Does anyone see what I am doing wrong?
     
    Andrew Robert, May 25, 2006
    #1
    1. Advertising

  2. Andrew Robert

    Max Erickson Guest

    Andrew Robert <> wrote:

    > ValueError: invalid literal for int(): %
    >
    > Does anyone see what I am doing wrong?
    >


    Try getting rid of the lamba, it might make things clearer and it
    simplifies debugging. Something like(this is just a sketch):

    def callback(match):
    print match.group()
    return chr(int(match.group(),16)) % ord(match.group())

    output.write(re.sub('r([^\w\s])', callback, line)

    It looks like your match.group is a '%' character:

    >>> int('%', 16)

    Traceback (most recent call last):
    File "<pyshell#108>", line 1, in ?
    int('%', 16)
    ValueError: invalid literal for int(): %
    >>>



    max
     
    Max Erickson, May 25, 2006
    #2
    1. Advertising

  3. Max Erickson wrote:
    <snip>

    </snip>

    > Try getting rid of the lamba, it might make things clearer and it
    > simplifies debugging. Something like(this is just a sketch):
    >
    >
    > max
    >

    Yeah.. trying to keep everything on one line is becoming something of a
    problem.

    To make this easier, I followed something from another poster and came
    up with this.

    import re,base64

    # Evaluate captured character as hex
    def ret_hex(value):
    return base64.b16encode(value)

    def ret_ascii(value):
    return base64.b16decode(value)

    # Evaluate the value of whatever was matched
    def eval_match(match):
    return ret_ascii(match.group(0))

    # Evaluate the value of whatever was matched
    # def eval_match(match):
    # return ret_hex(match.group(0))

    out=open(r'e:\pycode\sigh.new2','wb')

    # Read each line, pass any matches on line to function for
    # line in file.readlines():
    for line in open(r'e:\pycode\sigh.new','rb'):
    print (re.sub('[^\w\s]',eval_match, line))



    The char to hex pass works but omits the leading % at the start of each
    hex value.

    ie. 22 instead of %22


    The hex to char pass does not appear to work at all.

    No error is generated. It just appears to be ignored.
     
    Andrew Robert, May 25, 2006
    #3
  4. Andrew Robert

    Max Erickson Guest

    Andrew Robert <> wrote:
    > import re,base64
    >
    > # Evaluate captured character as hex
    > def ret_hex(value):
    > return base64.b16encode(value)
    >
    > def ret_ascii(value):
    > return base64.b16decode(value)
    >


    Note that you can just do this:

    from base64 import b16encode,b16decode

    and use them directly, or

    ret_hex=base64.b16encode

    ret_ascii=base64.b16decode

    if you want different names.


    As far as the rest of your problem goes, I only see one pass being
    made, is the code you posted the code you are running?

    Also, is there some reason that base64.b16encode should be returning a
    string that starts with a '%'?

    All I would expect is:

    base64.b16decode(base64.b16encode(input))==input

    other than that I have no idea about the expected behavior.

    max
     
    Max Erickson, May 25, 2006
    #4
  5. Hi Everyone,


    Thanks for all of your patience on this.

    I finally got it to work.


    Here is the completed test code showing what is going on.

    Not cleaned up yet but it works for proof-of-concept purposes.



    #!/usr/bin/python

    import re,base64

    # Evaluate captured character as hex
    def ret_hex(value):
    return '%'+base64.b16encode(value)

    # Evaluate the value of whatever was matched
    def enc_hex_match(match):
    return ret_hex(match.group(0))

    def ret_ascii(value):
    return base64.b16decode(value)

    # Evaluate the value of whatever was matched
    def enc_ascii_match(match):

    arg=match.group()

    #remove the artifically inserted % sign
    arg=arg[1:]

    # decode the result
    return ret_ascii(arg)

    def file_encoder():
    # Read each line, pass any matches on line to function for
    # line in file.readlines():
    output=open(r'e:\pycode\sigh.new','wb')
    for line in open(r'e:\pycode\sigh.txt','rb'):
    output.write( (re.sub('[^\w\s]',enc_hex_match, line)) )
    output.close()


    def file_decoder():
    # Read each line, pass any matches on line to function for
    # line in file.readlines():

    output=open(r'e:\pycode\sigh.new2','wb')
    for line in open(r'e:\pycode\sigh.new','rb'):
    output.write(re.sub('%[0-9A-F][0-9A-F]',enc_ascii_match, line))
    output.close()




    file_encoder()

    file_decoder()
     
    Andrew Robert, May 25, 2006
    #5
  6. Andrew Robert

    John Machin Guest

    On 26/05/2006 4:33 AM, Andrew Robert wrote:
    > Hi Everyone,
    >
    >
    > Thanks for all of your patience on this.
    >
    > I finally got it to work.
    >
    >
    > Here is the completed test code showing what is going on.


    Consider doing what you should have done at the start: state what you
    are trying to achieve. Not very many people have the patience that Max
    showing ploughing through code that was both fugly and broken in order
    to determine what it should have been doing.

    What is the motivation for encoding characters like
    ,./<>;':"`~!@#$^&*()-+=[]\{}|

    >
    > Not cleaned up yet but it works for proof-of-concept purposes.
    >
    >
    >
    > #!/usr/bin/python
    >
    > import re,base64
    >
    > # Evaluate captured character as hex
    > def ret_hex(value):
    > return '%'+base64.b16encode(value)


    This is IMHO rather pointless and obfuscatory, calling a function in a
    module when it can be done by a standard language feature. Why did you
    change it from the original "%%%2X" % value (which would have been
    better IMHO done as "%%%02X" % value)?

    >
    > # Evaluate the value of whatever was matched
    > def enc_hex_match(match):
    > return ret_hex(match.group(0))


    Why a second level of function call?

    >
    > def ret_ascii(value):
    > return base64.b16decode(value)


    See above.


    >
    > # Evaluate the value of whatever was matched
    > def enc_ascii_match(match):
    >
    > arg=match.group()
    >
    > #remove the artifically inserted % sign


    Don't bother, just ignore it.
    return int(match()[1:], 16)

    > arg=arg[1:]
    >
    > # decode the result
    > return ret_ascii(arg)
    >
    > def file_encoder():
    > # Read each line, pass any matches on line to function for
    > # line in file.readlines():
    > output=open(r'e:\pycode\sigh.new','wb')
    > for line in open(r'e:\pycode\sigh.txt','rb'):
    > output.write( (re.sub('[^\w\s]',enc_hex_match, line)) )
    > output.close()


    Why are you opening the file with "rb" but then reading it a line at a time?
    For a binary file, the whole file may be one "line"; it would be safer
    to read() blocks of say 8Kb.
    For a text file, the only point of the binary mode might be to avoid any
    sort of problem caused by OS-dependant definitions of "newline" i.e.
    CRLF vs LF. I note that as \r and \n are whitespace, you are not
    encoding them as %0D and %0A; is this deliberate?

    >
    > def file_decoder():
    > # Read each line, pass any matches on line to function for
    > # line in file.readlines():
    >
    > output=open(r'e:\pycode\sigh.new2','wb')
    > for line in open(r'e:\pycode\sigh.new','rb'):
    > output.write(re.sub('%[0-9A-F][0-9A-F]',enc_ascii_match, line))
    > output.close()
    >
    >
    >
    >
    > file_encoder()
    >
    > file_decoder()
     
    John Machin, May 25, 2006
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jan Burgy
    Replies:
    2
    Views:
    613
    Jan Burgy
    Aug 16, 2004
  2. Michael Spencer

    Black Magic - Currying using __get__

    Michael Spencer, Mar 24, 2005, in forum: Python
    Replies:
    0
    Views:
    397
    Michael Spencer
    Mar 24, 2005
  3. fdm
    Replies:
    18
    Views:
    704
    Balog Pal
    Oct 5, 2009
  4. Tom Willis

    Ruby black magic? Meta Programming

    Tom Willis, Mar 12, 2005, in forum: Ruby
    Replies:
    4
    Views:
    333
    Mathieu Bouchard
    Mar 13, 2005
  5. Jon
    Replies:
    5
    Views:
    130
    Brian Candler
    Mar 31, 2007
Loading...

Share This Page