regex/lambda black magic

A

Andrew Robert

Hi everyone,

I have two test scripts, an encoder and a decoder.

The encoder, listed below, works perfectly.


import re,sys
output = open(r'e:\pycode\out_test.txt','wb')
for line in open(r'e:\pycode\sigh.txt','rb') :
output.write( re.sub(r'([^\w\s])', lambda s: '%%%2X' %
ord(s.group()), line))


The decoder, well, I have hopes.


import re,sys
output = open(r'e:\pycode\new_test.txt','wb')
for line in open(r'e:\pycode\out_test.txt','rb') :
output.write( re.sub(r'([^\w\s])', lambda s: chr(int(s.group(), 16))
% ord(s.group()), line))


The decoder generates the following traceback:

Traceback (most recent call last):
File "E:\pycode\sample_decode_file_specials_from_hex.py", line 9, in ?
output.write( re.sub(r'([^\w\s])', lambda s: chr(int(s.group(), 16))
% ord(s.group()), line))
File "C:\Python24\lib\sre.py", line 142, in sub
return _compile(pattern, 0).sub(repl, string, count)
File "E:\pycode\sample_decode_file_specials_from_hex.py", line 9, in
<lambda>
output.write( re.sub(r'([^\w\s])', lambda s: chr(int(s.group(), 16))
% ord(s.group()), line))
ValueError: invalid literal for int(): %

Does anyone see what I am doing wrong?
 
M

Max Erickson

Andrew Robert said:
ValueError: invalid literal for int(): %

Does anyone see what I am doing wrong?

Try getting rid of the lamba, it might make things clearer and it
simplifies debugging. Something like(this is just a sketch):

def callback(match):
print match.group()
return chr(int(match.group(),16)) % ord(match.group())

output.write(re.sub('r([^\w\s])', callback, line)

It looks like your match.group is a '%' character:
Traceback (most recent call last):
File "<pyshell#108>", line 1, in ?
int('%', 16)
ValueError: invalid literal for int(): %

max
 
A

Andrew Robert

Max Erickson wrote:
<snip>

Try getting rid of the lamba, it might make things clearer and it
simplifies debugging. Something like(this is just a sketch):


max
Yeah.. trying to keep everything on one line is becoming something of a
problem.

To make this easier, I followed something from another poster and came
up with this.

import re,base64

# Evaluate captured character as hex
def ret_hex(value):
return base64.b16encode(value)

def ret_ascii(value):
return base64.b16decode(value)

# Evaluate the value of whatever was matched
def eval_match(match):
return ret_ascii(match.group(0))

# Evaluate the value of whatever was matched
# def eval_match(match):
# return ret_hex(match.group(0))

out=open(r'e:\pycode\sigh.new2','wb')

# Read each line, pass any matches on line to function for
# line in file.readlines():
for line in open(r'e:\pycode\sigh.new','rb'):
print (re.sub('[^\w\s]',eval_match, line))



The char to hex pass works but omits the leading % at the start of each
hex value.

ie. 22 instead of %22


The hex to char pass does not appear to work at all.

No error is generated. It just appears to be ignored.
 
M

Max Erickson

Andrew Robert said:
import re,base64

# Evaluate captured character as hex
def ret_hex(value):
return base64.b16encode(value)

def ret_ascii(value):
return base64.b16decode(value)

Note that you can just do this:

from base64 import b16encode,b16decode

and use them directly, or

ret_hex=base64.b16encode

ret_ascii=base64.b16decode

if you want different names.


As far as the rest of your problem goes, I only see one pass being
made, is the code you posted the code you are running?

Also, is there some reason that base64.b16encode should be returning a
string that starts with a '%'?

All I would expect is:

base64.b16decode(base64.b16encode(input))==input

other than that I have no idea about the expected behavior.

max
 
A

Andrew Robert

Hi Everyone,


Thanks for all of your patience on this.

I finally got it to work.


Here is the completed test code showing what is going on.

Not cleaned up yet but it works for proof-of-concept purposes.



#!/usr/bin/python

import re,base64

# Evaluate captured character as hex
def ret_hex(value):
return '%'+base64.b16encode(value)

# Evaluate the value of whatever was matched
def enc_hex_match(match):
return ret_hex(match.group(0))

def ret_ascii(value):
return base64.b16decode(value)

# Evaluate the value of whatever was matched
def enc_ascii_match(match):

arg=match.group()

#remove the artifically inserted % sign
arg=arg[1:]

# decode the result
return ret_ascii(arg)

def file_encoder():
# Read each line, pass any matches on line to function for
# line in file.readlines():
output=open(r'e:\pycode\sigh.new','wb')
for line in open(r'e:\pycode\sigh.txt','rb'):
output.write( (re.sub('[^\w\s]',enc_hex_match, line)) )
output.close()


def file_decoder():
# Read each line, pass any matches on line to function for
# line in file.readlines():

output=open(r'e:\pycode\sigh.new2','wb')
for line in open(r'e:\pycode\sigh.new','rb'):
output.write(re.sub('%[0-9A-F][0-9A-F]',enc_ascii_match, line))
output.close()




file_encoder()

file_decoder()
 
J

John Machin

Hi Everyone,


Thanks for all of your patience on this.

I finally got it to work.


Here is the completed test code showing what is going on.

Consider doing what you should have done at the start: state what you
are trying to achieve. Not very many people have the patience that Max
showing ploughing through code that was both fugly and broken in order
to determine what it should have been doing.

What is the motivation for encoding characters like
Not cleaned up yet but it works for proof-of-concept purposes.



#!/usr/bin/python

import re,base64

# Evaluate captured character as hex
def ret_hex(value):
return '%'+base64.b16encode(value)

This is IMHO rather pointless and obfuscatory, calling a function in a
module when it can be done by a standard language feature. Why did you
change it from the original "%%%2X" % value (which would have been
better IMHO done as "%%%02X" % value)?
# Evaluate the value of whatever was matched
def enc_hex_match(match):
return ret_hex(match.group(0))

Why a second level of function call?
def ret_ascii(value):
return base64.b16decode(value)

See above.

# Evaluate the value of whatever was matched
def enc_ascii_match(match):

arg=match.group()

#remove the artifically inserted % sign

Don't bother, just ignore it.
return int(match()[1:], 16)
arg=arg[1:]

# decode the result
return ret_ascii(arg)

def file_encoder():
# Read each line, pass any matches on line to function for
# line in file.readlines():
output=open(r'e:\pycode\sigh.new','wb')
for line in open(r'e:\pycode\sigh.txt','rb'):
output.write( (re.sub('[^\w\s]',enc_hex_match, line)) )
output.close()

Why are you opening the file with "rb" but then reading it a line at a time?
For a binary file, the whole file may be one "line"; it would be safer
to read() blocks of say 8Kb.
For a text file, the only point of the binary mode might be to avoid any
sort of problem caused by OS-dependant definitions of "newline" i.e.
CRLF vs LF. I note that as \r and \n are whitespace, you are not
encoding them as %0D and %0A; is this deliberate?
def file_decoder():
# Read each line, pass any matches on line to function for
# line in file.readlines():

output=open(r'e:\pycode\sigh.new2','wb')
for line in open(r'e:\pycode\sigh.new','rb'):
output.write(re.sub('%[0-9A-F][0-9A-F]',enc_ascii_match, line))
output.close()




file_encoder()

file_decoder()
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top