numpy 00 character bug?

N

Nathaniel Rook

Hello, all!

I've recently encountered a bug in NumPy's string arrays, where the 00
ASCII character ('\x00') is not stored properly when put at the end of a
string.

For example:

Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> print numpy.version.version 1.3.0
>>> arr = numpy.empty(1, 'S2')
>>> arr[0] = 'ab'
>>> arr
array(['ab'],
dtype='|S2')
array(['c'],
dtype='|S2')

It seems that the string array is using the 00 character to pad strings
smaller than the maximum size, and thus is treating any 00 characters at
the end of a string as padding. Obviously, as long as I don't use
smaller strings, there is no information lost here, but I don't want to
have to re-add my 00s each time I ask the array what it is holding.

Is this a well-known bug already? I couldn't find it on the NumPy bug
tracker, but I could have easily missed it, or it could be triaged,
deemed acceptable because there's no better way to deal with
arbitrary-length strings. Is there an easy way to avoid this problem?
Pretty much any performance-intensive part of my program is going to be
dealing with these arrays, so I don't want to just replace them with a
slower dictionary instead.

I can't imagine this issue hasn't come up before; I encountered it by
using NumPy arrays to store Python structs, something I can imagine is
done fairly often. As such, I apologize for bringing it up again!

Nathaniel
 
A

Aahz

I've recently encountered a bug in NumPy's string arrays, where the 00
ASCII character ('\x00') is not stored properly when put at the end of a
string.

You should ask about this on the NumPy mailing lists and/or report it on
the NumPy tracker:

http://scipy.org/
--
Aahz ([email protected]) <*> http://www.pythoncraft.com/

"Given that C++ has pointers and typecasts, it's really hard to have a
serious conversation about type safety with a C++ programmer and keep a
straight face. It's kind of like having a guy who juggles chainsaws
wearing body armor arguing with a guy who juggles rubber chickens wearing
a T-shirt about who's in more danger." --Roy Smith, c.l.py, 2004.05.23
 
C

Carl Banks

Hello, all!

I've recently encountered a bug in NumPy's string arrays, where the 00
ASCII character ('\x00') is not stored properly when put at the end of a
string.

For example:

Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> import numpy
 >>> print numpy.version.version
1.3.0
 >>> arr = numpy.empty(1, 'S2')
 >>> arr[0] = 'ab'
 >>> arr
array(['ab'],
       dtype='|S2')
 >>> arr[0] = 'c\x00'
 >>> arr
array(['c'],
       dtype='|S2')

It seems that the string array is using the 00 character to pad strings
smaller than the maximum size, and thus is treating any 00 characters at
the end of a string as padding.  Obviously, as long as I don't use
smaller strings, there is no information lost here, but I don't want to
have to re-add my 00s each time I ask the array what it is holding.

I am going to guess that it is done this way for the sake of
interoperability with Fortran, and that it is deliberate behavior.
Also, if it were accidental behavior, then it would probably happen
for internal nul bytes, but it doesn't.

The workaround I recommend is to add a superfluous character on the
end:
array(['a\x00x'],
dtype='|S3')

Then chop off the last character. (However it might turn out that
padding as necessary performs better.)
Is this a well-known bug already?  I couldn't find it on the NumPy bug
tracker, but I could have easily missed it, or it could be triaged,
deemed acceptable because there's no better way to deal with
arbitrary-length strings.  Is there an easy way to avoid this problem?
Pretty much any performance-intensive part of my program is going to be
dealing with these arrays, so I don't want to just replace them with a
slower dictionary instead.

I can't imagine this issue hasn't come up before; I encountered it by
using NumPy arrays to store Python structs, something I can imagine is
done fairly often.  As such, I apologize for bringing it up again!

I doubt a very high percentage of people who use numpy do character
manipulation, so I could see it as something that hasn't come up
before.


Carl Banks
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top