File.yaml?(fname)

D

Devin Mullins

Trans said:
what's the best way to determine if a file is yaml?
Naive answer:

def File.yaml?(fname)
YAML.load(IO.read(fname))
true
rescue ArgumentError
false
end

Though, open up irb -ryaml and keep running this line:
YAML.load Array.new(60){rand 256}.pack('c*')

I'm not sure that's what you're after. :)

And I'm guessing you didn't mean:
def File.yaml?(fname)
extname(fname) =~ /^ya?ml$/
end

Devin
 
J

Joel VanderWerf

Trans said:
what's the best way to determine if a file is yaml?

In light of the other responses, which show how hard it is to do this in
general, what about a pragmatic approach that might work in most of the
cases you are interested in?

Look at the first N lines.

If any line has _any_ non-printing characters, it's not correct YAML and
wasn't generated by YAML#dump.[1]

If any are longer than M chars or other binary file heuristics apply[2],
it's probably not a manually written YAML file.

If it passes at least _one_ of these two checks, then check to see if
80% of the (first N) lines match the following:

/^\s*(-|\?|[\w\s]*:)\s/

Maybe add some logic to skip blocks of text like this (so they don't
count against the 80%):

a: |
skip
me

Also, check for > in place of |.

And also skip blanks and comments /^\s*(#|$)/.

And then finally load it and rescue any ArgumentError.

There are probably a lot of corner cases that kill this approach if you
cannot tolerate false negatives (i.e., legit yaml that gets rejected by
the above).

---

[1] The YAML spec, http://yaml.org/spec/current.html, says nonprinting
chars are encoded (see 4.1.1. Character Set), and it seems to be true,
at least in the dump output:

irb(main):023:0> puts({"a"=>"\002"}.to_yaml)
---
a: !binary |
Ag==

However, YAML can load unescaped binary data, as Devin showed:

irb(main):025:0> YAML.load "a: \002"
=> {"a"=>"\002"}

[2] For example,
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/52548
 
A

ara.t.howard

what's the best way to determine if a file is yaml?

thanks,
t.

in ruby queue i detect whether stdin input is a normal list or yaml in this
way:

if first_non_blank_line =~ %r/^\s*---\s*$/
load_yaml_from_stdin
else
process_line first_non_blank_line
while((line = next_line[stdin]))
process_line line
end
end

not perfect, but's it worked well enough so far

cheers.


also, from the command line i've taken to this approach

list_input_on_stdin = ARGV.delete '-'
yaml_input_on_stdin = ARGV.delete '---'

for, for example

cat.rb - # dump stdin

cat.rb --- # load the yaml doc on stdin and dump that

note that '--' is used to indicate the end of options so it is not a good
flag.

regards.

-a
 
T

Trans

Joel said:
Trans said:
what's the best way to determine if a file is yaml?

In light of the other responses, which show how hard it is to do this in
general, what about a pragmatic approach that might work in most of the
cases you are interested in?

Look at the first N lines.

If any line has _any_ non-printing characters, it's not correct YAML and
wasn't generated by YAML#dump.[1]

If any are longer than M chars or other binary file heuristics apply[2],
it's probably not a manually written YAML file.

If it passes at least _one_ of these two checks, then check to see if
80% of the (first N) lines match the following:

/^\s*(-|\?|[\w\s]*:)\s/

Maybe add some logic to skip blocks of text like this (so they don't
count against the 80%):

a: |
skip
me

Also, check for > in place of |.

And also skip blanks and comments /^\s*(#|$)/.

And then finally load it and rescue any ArgumentError.

There are probably a lot of corner cases that kill this approach if you
cannot tolerate false negatives (i.e., legit yaml that gets rejected by
the above).

yikes! if that's what it takes then i must run away! :) i need
something snappy. actually it just occured to me that as of YAML 1.1
the document declaration is mandetory. I had forgotten about that. So
checking for an initial line starting with %YAML would do the trick as
long as docs where 1.1 compliant --at least in this regard.
Unfortuantely Syck itself isn't 1.1 compliant in this respect
whatsoever :-(

In the mean time I'm just going to go with ara's suggestion. the use of
an initial '---' is an acceptable requirment for my needs.

t.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top