Strange occasional marshal error

  • Thread starter Graham Stratton
  • Start date
G

Graham Stratton

Hi,

I'm using Python with ZeroMQ to distribute data around an HPC cluster.
The results have been good apart from one issue which I am completely
stuck with:

We are using marshal for serialising objects before distributing them
around the cluster, and extremely occasionally a corrupted marshal is
produced. The current workaround is to serialise everything twice and
check that the serialisations are the same. On the rare occasions that
they are not, I have dumped the files for comparison. It turns out
that there are a few positions within the serialisation where
corruption tends to occur (these positions seem to be independent of
the data of the size of the complete serialisation). These are:

4 bytes starting at 548867 (0x86003)
4 bytes starting at 4398083 (0x431c03)
4 bytes starting at 17595395 (0x10c7c03)
4 bytes starting at 19794819 (0x12e0b83)
4 bytes starting at 22269171 (0x153ccf3)
2 bytes starting at 25052819 (0x17e4693)
3 bytes starting at 28184419 (0x1ae0f63)

I note that the ratio between the later positions is almost exactly
1.125. Presumably this has something to do with memory allocation
somewhere?

Some datapoints:

- The phenomenon has been observed in a single-threaded process
without ZeroMQ
- I think the phenomenon has been observed in pickled as well as
marshalled data
- The phenomenon has been observed on different hardware

Unfortunately after quite a lot of work I still haven't managed to
reproduce this error on a single machine. Hopefully the above is
enough information for someone to speculate as to where the problem
is.

Many thanks in advance for any help.

Regards,

Graham
 
G

Graham Stratton

We are using marshal for serialising objects before distributing them
around the cluster, and extremely occasionally a corrupted marshal is
produced. The current workaround is to serialise everything twice and
check that the serialisations are the same. On the rare occasions that
they are not, I have dumped the files for comparison. It turns out
that there are a few positions within the serialisation where
corruption tends to occur (these positions seem to be independent of
the data of the size of the complete serialisation). These are:

4 bytes starting at 548867 (0x86003)
4 bytes starting at 4398083 (0x431c03)
4 bytes starting at 17595395 (0x10c7c03)
4 bytes starting at 19794819 (0x12e0b83)
4 bytes starting at 22269171 (0x153ccf3)
2 bytes starting at 25052819 (0x17e4693)
3 bytes starting at 28184419 (0x1ae0f63)

I modified marshal.c to print when it extends the string used to write
the marshal to. This gave me these results:
Resizing string from 50 to 1124 bytes
Resizing string from 1124 to 3272 bytes
Resizing string from 3272 to 7568 bytes
Resizing string from 7568 to 16160 bytes
Resizing string from 16160 to 33344 bytes
Resizing string from 33344 to 67712 bytes
Resizing string from 67712 to 136448 bytes
Resizing string from 136448 to 273920 bytes
Resizing string from 273920 to 548864 bytes
Resizing string from 548864 to 1098752 bytes
Resizing string from 1098752 to 2198528 bytes
Resizing string from 2198528 to 4398080 bytes
Resizing string from 4398080 to 8797184 bytes
Resizing string from 8797184 to 17595392 bytes
Resizing string from 17595392 to 19794816 bytes
Resizing string from 19794816 to 22269168 bytes
Resizing string from 22269168 to 25052814 bytes
Resizing string from 25052814 to 28184415 bytes
Resizing string from 28184415 to 31707466 bytes

Every corruption point occurs exactly three bytes above an extension
point (rounded to the nearest word for the last two). This clearly
isn't a coincidence, but I can't see where there could be a problem.
I'd be grateful for any pointers.

Thanks,

Graham
 
T

Tom Zych

Every corruption point occurs exactly three bytes above an extension
point (rounded to the nearest word for the last two). This clearly
isn't a coincidence, but I can't see where there could be a problem.
I'd be grateful for any pointers.

The intermittency sounds like a race condition, doesn't it? It might
be worthwhile to look into the call that's extending the string and
see if it could affect other data. Maybe objects are getting shuffled
around? Don't put too much stock in this, I'm just speculating based
on a bug I had in a C program years ago. I have no idea how CPython
handles these things.
 
M

MRAB

I modified marshal.c to print when it extends the string used to write
the marshal to. This gave me these results:

Resizing string from 50 to 1124 bytes
Resizing string from 1124 to 3272 bytes
Resizing string from 3272 to 7568 bytes
Resizing string from 7568 to 16160 bytes
Resizing string from 16160 to 33344 bytes
Resizing string from 33344 to 67712 bytes
Resizing string from 67712 to 136448 bytes
Resizing string from 136448 to 273920 bytes
Resizing string from 273920 to 548864 bytes
Resizing string from 548864 to 1098752 bytes
Resizing string from 1098752 to 2198528 bytes
Resizing string from 2198528 to 4398080 bytes
Resizing string from 4398080 to 8797184 bytes
Resizing string from 8797184 to 17595392 bytes
Resizing string from 17595392 to 19794816 bytes
Resizing string from 19794816 to 22269168 bytes
Resizing string from 22269168 to 25052814 bytes
Resizing string from 25052814 to 28184415 bytes
Resizing string from 28184415 to 31707466 bytes

Every corruption point occurs exactly three bytes above an extension
point (rounded to the nearest word for the last two). This clearly
isn't a coincidence, but I can't see where there could be a problem.
I'd be grateful for any pointers.
I haven't found the cause, but I have found something else I'm
suspicious of in the source for Python 3.2.

In marshal.c there's a function "w_object", and within that function is
this:

else if (PyAnySet_CheckExact(v)) {
PyObject *value, *it;

if (PyObject_TypeCheck(v, &PySet_Type))
w_byte(TYPE_SET, p);
else
w_byte(TYPE_FROZENSET, p);

"w_byte" is a macro which includes an if-statement, not a function.
Doesn't it need some braces? (There's are braces in the other places
they're needed.)
 
G

Guido van Rossum

This bug report doesn't mention the Python version nor the platform --
it could in theory be a bug in the platform compiler or memory
allocator. It would also be nice to provide the test program that
reproduces the issue. It would also be useful to start tracking it in
the issue tracker at bugs.python.org

Assuming it's 3.2, I would audit _PyBytes_Resize() and whatever it
uses -- if your hunch is right and there is a problem with resizing
that's where it's done.
I haven't found the cause, but I have found something else I'm
suspicious of in the source for Python 3.2.

In marshal.c there's a function "w_object", and within that function is
this:

   else if (PyAnySet_CheckExact(v)) {
       PyObject *value, *it;

       if (PyObject_TypeCheck(v, &PySet_Type))
           w_byte(TYPE_SET, p);
       else
           w_byte(TYPE_FROZENSET, p);

"w_byte" is a macro which includes an if-statement, not a function.
Doesn't it need some braces? (There's are braces in the other places
they're needed.)

That macro looks fine to me; looking at the definition of w_byte() it
has matched if/else clauses:

#define w_byte(c, p) if (((p)->fp)) putc((c), (p)->fp); \
else if ((p)->ptr != (p)->end) *(p)->ptr++ = (c);\
else w_more(c, p)

Although traditionally, just to be sure, we've enclosed similar macros
inside do { ... } while (0). Also it would be nice to call out its
macro-status by renaming it to W_BYTE -- I suppose at one point in the
past it was a plain function...
 
G

Graham Stratton

This bug report doesn't mention the Python version nor the platform --
it could in theory be a bug in the platform compiler or memory
allocator.

I've seen the problem with 2.6 and 2.7, on RHEL 4 (possibly with a
custom kernel, I can't check at the moment).
It would also be nice to provide the test program that
reproduces the issue.

I'm working on trying to reproduce it without the proprietary code
that uses it, but so far haven't managed it. There are some custom C
extensions in the system where this is observed, but since the code is
single-threaded I don't think they can have any effect during
marshalling.

Thanks,

Graham
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top