regex/lambda black magic

Andrew Robert · May 25, 2006

Hi everyone,

I have two test scripts, an encoder and a decoder.

The encoder, listed below, works perfectly.

import re,sys
output = open(r'e:\pycode\out_test.txt','wb')
for line in open(r'e:\pycode\sigh.txt','rb') :
output.write( re.sub(r'([^\w\s])', lambda s: '%%%2X' %
ord(s.group()), line))

The decoder, well, I have hopes.

import re,sys
output = open(r'e:\pycode\new_test.txt','wb')
for line in open(r'e:\pycode\out_test.txt','rb') :
output.write( re.sub(r'([^\w\s])', lambda s: chr(int(s.group(), 16))
% ord(s.group()), line))

The decoder generates the following traceback:

Traceback (most recent call last):
File "E:\pycode\sample_decode_file_specials_from_hex.py", line 9, in ?
output.write( re.sub(r'([^\w\s])', lambda s: chr(int(s.group(), 16))
% ord(s.group()), line))
File "C:\Python24\lib\sre.py", line 142, in sub
return _compile(pattern, 0).sub(repl, string, count)
File "E:\pycode\sample_decode_file_specials_from_hex.py", line 9, in
<lambda>
output.write( re.sub(r'([^\w\s])', lambda s: chr(int(s.group(), 16))
% ord(s.group()), line))
ValueError: invalid literal for int(): %

Does anyone see what I am doing wrong?

Max Erickson · May 25, 2006

Andrew Robert said:
ValueError: invalid literal for int(): %

Does anyone see what I am doing wrong?

Try getting rid of the lamba, it might make things clearer and it
simplifies debugging. Something like(this is just a sketch):

def callback(match):
print match.group()
return chr(int(match.group(),16)) % ord(match.group())

output.write(re.sub('r([^\w\s])', callback, line)

It looks like your match.group is a '%' character:
Traceback (most recent call last):
File "<pyshell#108>", line 1, in ?
int('%', 16)
ValueError: invalid literal for int(): %

max

Andrew Robert · May 25, 2006

Max Erickson wrote:
<snip>

Try getting rid of the lamba, it might make things clearer and it
simplifies debugging. Something like(this is just a sketch):

max

Yeah.. trying to keep everything on one line is becoming something of a
problem.

To make this easier, I followed something from another poster and came
up with this.

import re,base64

# Evaluate captured character as hex
def ret_hex(value):
return base64.b16encode(value)

def ret_ascii(value):
return base64.b16decode(value)

# Evaluate the value of whatever was matched
def eval_match(match):
return ret_ascii(match.group(0))

# Evaluate the value of whatever was matched
# def eval_match(match):
# return ret_hex(match.group(0))

out=open(r'e:\pycode\sigh.new2','wb')

# Read each line, pass any matches on line to function for
# line in file.readlines():
for line in open(r'e:\pycode\sigh.new','rb'):
print (re.sub('[^\w\s]',eval_match, line))

The char to hex pass works but omits the leading % at the start of each
hex value.

ie. 22 instead of %22

The hex to char pass does not appear to work at all.

No error is generated. It just appears to be ignored.

Max Erickson · May 25, 2006

Andrew Robert said:
import re,base64

# Evaluate captured character as hex
def ret_hex(value):
return base64.b16encode(value)

def ret_ascii(value):
return base64.b16decode(value)

Note that you can just do this:

from base64 import b16encode,b16decode

and use them directly, or

ret_hex=base64.b16encode

ret_ascii=base64.b16decode

if you want different names.

As far as the rest of your problem goes, I only see one pass being
made, is the code you posted the code you are running?

Also, is there some reason that base64.b16encode should be returning a
string that starts with a '%'?

All I would expect is:

base64.b16decode(base64.b16encode(input))==input

other than that I have no idea about the expected behavior.

max

Andrew Robert · May 25, 2006

Hi Everyone,

Thanks for all of your patience on this.

I finally got it to work.

Here is the completed test code showing what is going on.

Not cleaned up yet but it works for proof-of-concept purposes.

#!/usr/bin/python

import re,base64

# Evaluate captured character as hex
def ret_hex(value):
return '%'+base64.b16encode(value)

# Evaluate the value of whatever was matched
def enc_hex_match(match):
return ret_hex(match.group(0))

def ret_ascii(value):
return base64.b16decode(value)

# Evaluate the value of whatever was matched
def enc_ascii_match(match):

arg=match.group()

#remove the artifically inserted % sign
arg=arg[1:]

# decode the result
return ret_ascii(arg)

def file_encoder():
# Read each line, pass any matches on line to function for
# line in file.readlines():
output=open(r'e:\pycode\sigh.new','wb')
for line in open(r'e:\pycode\sigh.txt','rb'):
output.write( (re.sub('[^\w\s]',enc_hex_match, line)) )
output.close()

def file_decoder():
# Read each line, pass any matches on line to function for
# line in file.readlines():

output=open(r'e:\pycode\sigh.new2','wb')
for line in open(r'e:\pycode\sigh.new','rb'):
output.write(re.sub('%[0-9A-F][0-9A-F]',enc_ascii_match, line))
output.close()

file_encoder()

file_decoder()

John Machin · May 25, 2006

Hi Everyone,

Thanks for all of your patience on this.

I finally got it to work.

Here is the completed test code showing what is going on.

Consider doing what you should have done at the start: state what you
are trying to achieve. Not very many people have the patience that Max
showing ploughing through code that was both fugly and broken in order
to determine what it should have been doing.

What is the motivation for encoding characters like

Not cleaned up yet but it works for proof-of-concept purposes.

#!/usr/bin/python

import re,base64

# Evaluate captured character as hex
def ret_hex(value):
return '%'+base64.b16encode(value)

This is IMHO rather pointless and obfuscatory, calling a function in a
module when it can be done by a standard language feature. Why did you
change it from the original "%%%2X" % value (which would have been
better IMHO done as "%%%02X" % value)?

# Evaluate the value of whatever was matched
def enc_hex_match(match):
return ret_hex(match.group(0))

Why a second level of function call?

def ret_ascii(value):
return base64.b16decode(value)

See above.

# Evaluate the value of whatever was matched
def enc_ascii_match(match):

arg=match.group()

#remove the artifically inserted % sign

Don't bother, just ignore it.
return int(match()[1:], 16)

arg=arg[1:]

# decode the result
return ret_ascii(arg)

def file_encoder():
# Read each line, pass any matches on line to function for
# line in file.readlines():
output=open(r'e:\pycode\sigh.new','wb')
for line in open(r'e:\pycode\sigh.txt','rb'):
output.write( (re.sub('[^\w\s]',enc_hex_match, line)) )
output.close()

Why are you opening the file with "rb" but then reading it a line at a time?
For a binary file, the whole file may be one "line"; it would be safer
to read() blocks of say 8Kb.
For a text file, the only point of the binary mode might be to avoid any
sort of problem caused by OS-dependant definitions of "newline" i.e.
CRLF vs LF. I note that as \r and \n are whitespace, you are not
encoding them as %0D and %0A; is this deliberate?

def file_decoder():
# Read each line, pass any matches on line to function for
# line in file.readlines():

output=open(r'e:\pycode\sigh.new2','wb')
for line in open(r'e:\pycode\sigh.new','rb'):
output.write(re.sub('%[0-9A-F][0-9A-F]',enc_ascii_match, line))
output.close()

file_encoder()

file_decoder()

Questions about regex	3	May 29, 2009
quick regex question	0	Oct 28, 2004
bin2chr("01110011") # = 15 function ?	1	Feb 25, 2008
what is lambda used for in real code?	26	Dec 31, 2004
HTMLParser and non-ascii html pages	0	Sep 20, 2011
Newbie code review of parsing program Please	8	Nov 16, 2008
Question regarding lists and regex	2	Nov 9, 2006
Metaclass conflict TypeError exception: problem demonstration script	0	Feb 23, 2009

regex/lambda black magic

Andrew Robert

Max Erickson

Andrew Robert

Max Erickson

Andrew Robert

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads