struct: type registration?

G

Giovanni Bajo

Hello,

given the ongoing work on struct (which I thought was a dead module), I was
wondering if it would be possible to add an API to register custom parsing
codes for struct. Whenever I use it for non-trivial tasks, I always happen to
write small wrapper functions to adjust the values returned by struct.

An example API would be the following:

============================================
def mystring_len():
return 20

def mystring_pack(s):
if len(s) > 20:
raise ValueError, "a mystring can be at max 20 chars"
s = (s + "\0"*20)[:20]
s = struct.pack("20s", s)
return s

def mystring_unpack(s):
assert len(s) == 20
s = struct.unpack("20s", s)[0]
idx = s.find("\0")
if idx >= 0:
s = s[:idx]
return s

struct.register("S", mystring_pack, mystring_unpack, mystring_len)

# then later
foo = struct.unpack("iilS", data)
============================================

This is only an example, any similar API which might fit better struct
internals would do as well.

As shown, the custom packer/unpacker can call the original pack/unpack as a
basis for their work. I guess an issue with this could be the endianess
problem: it would make sense if, when called recursively, struct.pack/unpack
used by the default the endianess specified by the external format string.
 
J

John Machin

Hello,

given the ongoing work on struct (which I thought was a dead module), I was
wondering if it would be possible to add an API to register custom parsing
codes for struct. Whenever I use it for non-trivial tasks, I always happen to
write small wrapper functions to adjust the values returned by struct.

An example API would be the following:

============================================
def mystring_len():
return 20

def mystring_pack(s):
if len(s) > 20:
raise ValueError, "a mystring can be at max 20 chars"
s = (s + "\0"*20)[:20]

Have you considered s.ljust(20, "\0") ?
s = struct.pack("20s", s)
return s

I am an idiot, so please be gentle with me: I don't understand why you
are using struct.pack at all:

|>>> import struct
|>>> x = ("abcde" + "\0" * 20)[:20]
|>>> x
'abcde\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
|>>> len(x)
20
|>>> y = struct.pack("20s", x)
|>>> y == x
True
|>>>

Looks like a big fat no-op to me; you've done all the heavy lifting
yourself.
def mystring_unpack(s):
assert len(s) == 20
s = struct.unpack("20s", s)[0]

Errrm, g'day, it's that pesky idiot again:

|>>> z = struct.unpack("20s", y)[0]
|>>> z
'abcde\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
|>>> z == y == x
True
idx = s.find("\0")
if idx >= 0:
s = s[:idx]
return s

Have you considered this:

|>>> z.rstrip("\0")
'abcde'
|>>> ("\0" * 20).rstrip("\0")
''
|>>> ("x" * 20).rstrip("\0")
'xxxxxxxxxxxxxxxxxxxx'
 
G

Giovanni Bajo

John said:
given the ongoing work on struct (which I thought was a dead
module), I was wondering if it would be possible to add an API to
register custom parsing codes for struct. Whenever I use it for
non-trivial tasks, I always happen to write small wrapper functions
to adjust the values returned by struct.

An example API would be the following:

============================================
def mystring_len():
return 20

def mystring_pack(s):
if len(s) > 20:
raise ValueError, "a mystring can be at max 20 chars"
s = (s + "\0"*20)[:20]

Have you considered s.ljust(20, "\0") ?

Right. This happened to be an example...
I am an idiot, so please be gentle with me: I don't understand why you
are using struct.pack at all:

Because I want to be able to parse largest chunks of binary datas with custom
formatting. Did you miss the whole point of my message:

struct.unpack("3liiSiiShh", data)

You need struct.unpack() to parse these datas, and you need custom
packer/unpacker to avoid post-processing the output of unpack() just because it
just knows of basic Python types. In binary structs, there happen to be *types*
which do not map 1:1 to Python types, nor they are just basic C types (like the
ones struct supports). Using custom formatter is a way to better represent
these types (instead of mapping them to the "most similar" type, and then
post-process it).

In my example, "S" is a basic-type which is a "A 0-terminated 20-byte string",
and expressing it in the struct format with the single letter "S" is more
meaningful in my code than using "20s" and then post-processing the resulting
string each and every time this happens.

import struct
x = ("abcde" + "\0" * 20)[:20]
x 'abcde\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
len(x) 20
y = struct.pack("20s", x)
y == x True

Looks like a big fat no-op to me; you've done all the heavy lifting
yourself.

Looks like you totally misread my message. Your string "x" is what I find in
binary data, and I need to *unpack* into a regular Python string, which would
be "abcde".

idx = s.find("\0")
if idx >= 0:
s = s[:idx]
return s

Have you considered this:
'abcde'


This would not work because, in the actual binary data I have to parse, only
the first \0 is meaningful and terminates the string (like in C). There is
absolutely no guarantees that the rest of the padding is made of \0s as well.
 
G

Giovanni Bajo

Giovanni said:
You need struct.unpack() to parse these datas, and you need custom
packer/unpacker to avoid post-processing the output of unpack() just
because it just knows of basic Python types. In binary structs, there
happen to be *types* which do not map 1:1 to Python types, nor they
are just basic C types (like the ones struct supports). Using custom
formatter is a way to better represent these types (instead of
mapping them to the "most similar" type, and then post-process it).

In my example, "S" is a basic-type which is a "A 0-terminated 20-byte
string", and expressing it in the struct format with the single
letter "S" is more meaningful in my code than using "20s" and then
post-processing the resulting string each and every time this happens.


Another compelling example is the SSH protocol:
http://www.openssh.com/txt/draft-ietf-secsh-architecture-12.txt
Go to section 4, "Data Type Representations Used in the SSH Protocols", and it
describes the data types used by the SSH protocol. In a perfect world, I would
write some custom packers/unpackers for those types which struct does not
handle already (like the "mpint" format), so that I could use struct to parse
and compose SSH messages. What I ended up doing was writing a new module
sshstruct.py from scratch, which duplicates struct's work, just because I
couldn't extend struct. Some examples:

client.py: cookie, server_algorithms, guess, reserverd =
sshstruct.unpack("16b10LBu", data[1:])
client.py: prompts = sshstruct.unpack("sssu" + "sB"*num_prompts,
pkt[1:])
connection.py: pkt = sshstruct.pack("busB", SSH_MSG_CHANNEL_REQUEST,
self.recipient_number, type, reply) + custom
kex.py: self.P, self.G = sshstruct.unpack("mm",pkt[1:])

Notice for instance how "s" is a SSH string and unpacks directly to a Python
string, and "m" is a SSH mpint (infinite precision integer) but unpacks
directly into a Python long. Using struct.unpack() this would have been
impossible and would have required much post-processing.

Actually, another thing that struct should support to cover the SSH protocol
(and many other binary protocols) is the ability to parse strings whose size is
not known at import-time (variable-length data types). For instance, type
"string" in the SSH protocol is a string prepended with its size as uint32. So
it's actual size depends on each instance. For this reason, my sshstruct did
not have the equivalent of struct.calcsize(). I guess that if there's a way to
extend struct, it would comprehend variable-size data types (and calcsize()
would return -1 or raise an exception).
 
J

John Machin

John said:
given the ongoing work on struct (which I thought was a dead
module), I was wondering if it would be possible to add an API to
register custom parsing codes for struct. Whenever I use it for
non-trivial tasks, I always happen to write small wrapper functions
to adjust the values returned by struct.

An example API would be the following:

============================================
def mystring_len():
return 20

def mystring_pack(s):
if len(s) > 20:
raise ValueError, "a mystring can be at max 20 chars"
s = (s + "\0"*20)[:20]
Have you considered s.ljust(20, "\0") ?

Right. This happened to be an example...
I am an idiot, so please be gentle with me: I don't understand why you
are using struct.pack at all:

Given a choice between whether I was referring to the particular
instance of using struct.pack two lines above, or whether I was doubting
the general utility of the struct module, you appear to have chosen the
latter, erroneously.
Because I want to be able to parse largest chunks of binary datas with custom
formatting. Did you miss the whole point of my message:
No.


struct.unpack("3liiSiiShh", data)

You need struct.unpack() to parse these datas, and you need custom
packer/unpacker to avoid post-processing the output of unpack() just because it
just knows of basic Python types. In binary structs, there happen to be *types*
which do not map 1:1 to Python types, nor they are just basic C types (like the
ones struct supports). Using custom formatter is a way to better represent
these types (instead of mapping them to the "most similar" type, and then
post-process it).

In my example, "S" is a basic-type which is a "A 0-terminated 20-byte string",
and expressing it in the struct format with the single letter "S" is more
meaningful in my code than using "20s" and then post-processing the resulting
string each and every time this happens.

import struct
x = ("abcde" + "\0" * 20)[:20]
x 'abcde\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
20
y = struct.pack("20s", x)
y == x
True
Looks like a big fat no-op to me; you've done all the heavy lifting
yourself.

Looks like you totally misread my message.

Not at all.

Your function:

def mystring_pack(s):
if len(s) > 20:
raise ValueError, "a mystring can be at max 20 chars"
s = (s + "\0"*20)[:20]
s = struct.pack("20s", s)
return s

can be even better replaced by (after reading the manual "For packing,
the string is truncated or padded with null bytes as appropriate to make
it fit.") by:

def mystring_pack(s):
if len(s) > 20:
raise ValueError, "a mystring can be at max 20 chars"
return s
# return s = (s + "\0"*20)[:20] # not needed, according to the manual
# s = struct.pack("20s", s)
# As I said, this particular instance of using struct.pack is a big
fat no-op.
Your string "x" is what I find in
binary data, and I need to *unpack* into a regular Python string, which would
be "abcde".

And you unpack it with a custom function that also contains a fat no-op:

def mystring_unpack(s):
assert len(s) == 20
s = struct.unpack("20s", s)[0] # does nothing
idx = s.find("\0")
if idx >= 0:
s = s[:idx]
return s
idx = s.find("\0")
if idx >= 0:
s = s[:idx]
return s
Have you considered this:
z.rstrip("\0")
'abcde'


This would not work because, in the actual binary data I have to parse, only
the first \0 is meaningful and terminates the string (like in C). There is
absolutely no guarantees that the rest of the padding is made of \0s as well.

Point taken.

Cheers,
John
 
G

Giovanni Bajo

John said:
Looks like you totally misread my message.

Not at all.

Your function:

def mystring_pack(s):
if len(s) > 20:
raise ValueError, "a mystring can be at max 20 chars"
s = (s + "\0"*20)[:20]
s = struct.pack("20s", s)
return s

can be even better replaced by (after reading the manual "For packing,
the string is truncated or padded with null bytes as appropriate to
make it fit.") by:

def mystring_pack(s):
if len(s) > 20:
raise ValueError, "a mystring can be at max 20 chars"
return s
# return s = (s + "\0"*20)[:20] # not needed, according to the
manual # s = struct.pack("20s", s)
# As I said, this particular instance of using struct.pack is a
big fat no-op.

John, the point of the example was to show that one could write custom
packer/unpacker which calls struct.pack/unpack and, after that,
post-processes the results to obtain some custom data type. Now, I apologize
if my example wasn't exactly the shortest, most compact, most pythonic piece
of code. It was not meant to be. It was meant to be very easy to read and
very clear in what it is being done. You are nitpicking that part of my code
is a no-op. Fine. Sorry if this confused you. I was just trying to show a
simple pattern:

custom packer: adjust data, call struct.pack(), return
custom unpacker: call struct.unpack(), adjust data, return

I should have chosen a most complex example probably, but I did not want to
confuse readers. It seems I have confused them by choosing too simple an
example.
 
S

Serge Orlov

Giovanni said:
Because I want to be able to parse largest chunks of binary datas with custom
formatting. Did you miss the whole point of my message:

struct.unpack("3liiSiiShh", data)

Did you want to write struct.unpack("Sheesh", data) ? Seriously, the
main problem of struct is that it uses ad-hoc abbreviations for
relatively rarely[1] used functions calls and that makes it hard to
read.

If you want to parse binary data use pyconstruct
<http://pyconstruct.wikispaces.com/>

[1] Relatively to regular expression and string formatting calls.
 
J

John Machin

John said:
Looks like you totally misread my message.
Not at all.

Your function:

def mystring_pack(s):
if len(s) > 20:
raise ValueError, "a mystring can be at max 20 chars"
s = (s + "\0"*20)[:20]
s = struct.pack("20s", s)
return s

can be even better replaced by (after reading the manual "For packing,
the string is truncated or padded with null bytes as appropriate to
make it fit.") by:

def mystring_pack(s):
if len(s) > 20:
raise ValueError, "a mystring can be at max 20 chars"
return s
# return s = (s + "\0"*20)[:20] # not needed, according to the
manual # s = struct.pack("20s", s)
# As I said, this particular instance of using struct.pack is a
big fat no-op.

John, the point of the example was to show that one could write custom
packer/unpacker which calls struct.pack/unpack and, after that,
post-processes the results to obtain some custom data type.

What you appear to be doing is proposing an API for extending struct by
registering custom type-codes (ASCII alphabetic?) each requiring three
call-back functions (mypacker, myunpacker, mylength).

Example registration for an "S" string (fixed storage length, true
length determined on unpacking by first occurrence of '\0' (if any)).

struct.register("S", packerS, unpackerS, lengthS)

You give no prescription for what those functions should do. You provide
"examples" which require reverse engineering to deduce of what they are
intended to be exemplars.

Simple-minded folk like myself might expect that the functions would
work something like this:

Packing: when struct.pack reaches the custom code in the format, it does
this (pseudocode):
obj = _get_next_arg()
itemstrg = mypacker(obj)
_append_to_output_string(itemstrg)

Unpacking: when struct.unpack reaches a custom code in the format, it
does this (pseudocode):
n = mylength()
# exception if < n bytes remain
obj = myunpacker(remaining_bytes[:n])
_append_to_output_tuple(obj)

Thus, in a simple case like the NUL-terminated string:

def lengthS():
return 20
def packerS(s):
assert len(s) <= 20
return s.ljust(20, '\0')
# alternatively, return struct.pack("20s", s)
def unpackerS(bytes):
assert len(bytes) == 20
i = bytes.find('\0')
if i >= 0:
return bytes[:i]
return bytes

In more complicated cases, it may be useful for either/both the
packer/unpacker custom functions to call struct.pack/unpack to assist in
the assembly/disassembly exercise. This should be (1) possible without
perturbing the state of the outer struct.pack/unpack invocation (2)
sufficiently obvious to warrant little more than a passing mention.
Now, I apologize
if my example wasn't exactly the shortest, most compact, most pythonic piece
of code. It was not meant to be. It was meant to be very easy to read and
very clear in what it is being done. You are nitpicking that part of my code
is a no-op. Fine.

Scarcely a nitpick. It was very clear that parts of it were doing
absolutely nothing in a rather byzantine & baroque fashion. What was
unclear was whether this was by accident or design. You say (*after* the
examples) that "As shown, the custom packer/unpacker can call the
original pack/unpack as a basis for their work. ... when called
recursively ...". What basis for what work? As for recursion, I see no
"19s", "18s", etc here :)

Sorry if this confused you.

It didn't. As a self-confessed idiot, I am resolutely and irredeemably
unconfused.
I was just trying to show a
simple pattern:

custom packer: adjust data, call struct.pack(), return
custom unpacker: call struct.unpack(), adjust data, return

I should have chosen a most complex example probably, but I did not want to
confuse readers. It seems I have confused them by choosing too simple an
example.

The problem was that you chose an example that had minimal justification
(i.e. only the length check) for a custom packer at all (struct.pack
pads the "s" format with NUL bytes) and no use at all for a call to
struct.unpack inside the custom unpacker.

Cheers,
John
 
J

John Machin

Giovanni said:
Because I want to be able to parse largest chunks of binary datas with custom
formatting. Did you miss the whole point of my message:

struct.unpack("3liiSiiShh", data)

Did you want to write struct.unpack("Sheesh", data) ? Seriously, the
main problem of struct is that it uses ad-hoc abbreviations for
relatively rarely[1] used functions calls and that makes it hard to
read.

Indeed. The first time I saw something like struct.pack('20H', ...) I
thought it was a FORTRAN format statement :)
If you want to parse binary data use pyconstruct
<http://pyconstruct.wikispaces.com/>

Looks promising on the legibility and functionality fronts. Can you make
any comment on the speed? Reason for asking is that Microsoft Excel
files have this weird "RK" format for expressing common float values in
32 bits (refer http://sc.openoffice.org, see under "Documentation"
heading). I wrote and support the xlrd module (see
http://cheeseshop.python.org/pypi/xlrd) for reading those files in
portable pure Python. Below is a function that would plug straight in as
an example of Giovanni's custom unpacker functions. Some of the files
can be very large, and reading rather slow.

Cheers,
John

from struct import unpack

def unpack_RK(rk_str): # arg is 4 bytes
flags = ord(rk_str[0])
if flags & 2:
# There's a SIGNED 30-bit integer in there!
i, = unpack('<i', rk_str)
i >>= 2 # div by 4 to drop the 2 flag bits
if flags & 1:
return i / 100.0
return float(i)
else:
# It's the most significant 30 bits
# of an IEEE 754 64-bit FP number
d, = unpack('<d', '\0\0\0\0' + chr(flags & 252) + rk_str[1:4])
if flags & 1:
return d / 100.0
return d
 
S

Serge Orlov

John said:
Looks promising on the legibility and functionality fronts. Can you make
any comment on the speed?

I don't know really. I used it for small data parsing, its performance
was acceptable. As I understand it is implemented right now as pure
python code using struct under the hood. The biggest concern is the
lack of comprehensive documentation, if that scares you, it's not for
you.
Reason for asking is that Microsoft Excel
files have this weird "RK" format for expressing common float values in
32 bits (refer http://sc.openoffice.org, see under "Documentation"
heading). I wrote and support the xlrd module (see
http://cheeseshop.python.org/pypi/xlrd) for reading those files in
portable pure Python. Below is a function that would plug straight in as
an example of Giovanni's custom unpacker functions. Some of the files
can be very large, and reading rather slow.

I *guess* that the *current* implementation of pyconstruct will make
parsing slightly slower. But you have to try to find out.
from struct import unpack

def unpack_RK(rk_str): # arg is 4 bytes
flags = ord(rk_str[0])
if flags & 2:
# There's a SIGNED 30-bit integer in there!
i, = unpack('<i', rk_str)
i >>= 2 # div by 4 to drop the 2 flag bits
if flags & 1:
return i / 100.0
return float(i)
else:
# It's the most significant 30 bits
# of an IEEE 754 64-bit FP number
d, = unpack('<d', '\0\0\0\0' + chr(flags & 252) + rk_str[1:4])
if flags & 1:
return d / 100.0
return d

I had to lookup what < means :) Since nobody except this function cares
about internals of RK number, you don't need to use pyconstruct to
parse at bit level. The code will be almost like you wrote except you
replace unpack('<d', with Construct.LittleFloat64("").parse( and plug
the unpack_RK into pyconstruct framework by deriving from Field class.
Sure, nobody is going to raise your paycheck because of this rewrite :)
The biggest benefit comes from parsing the whole data file with
pyconstruct, not individual fields.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top