perl 5 grammar

Mike Samuel · May 15, 2009

I maintain the syntax highlighter for code.google.com and perl support
is rather lacking.

I know perl has a complex grammar, but can someone point me at a
simple lexical grammar for perl 5 that will allow me to at least
identify comment, string, and regex boundaries?

cheers,
mike

Uri Guttman · May 15, 2009

MS> I maintain the syntax highlighter for code.google.com and perl support
MS> is rather lacking.

MS> I know perl has a complex grammar, but can someone point me at a
MS> simple lexical grammar for perl 5 that will allow me to at least
MS> identify comment, string, and regex boundaries?

forget about it. you can't properly parse perl5 since its syntax can
change based upon the modules and pragmas that you load. there are some
attempts at this with varying success. see cperl-mode.el for emacs
lisp. adam kennedy has a parser on cpan that is reasonable. there are
other attempts around.

uri

Jürgen Exner · May 15, 2009

Mike Samuel said:
I maintain the syntax highlighter for code.google.com and perl support
is rather lacking.

Surprise, surprise. Considering what Google has done to Usenet I wonder
why so many people couldn't care less.

I know perl has a complex grammar, but can someone point me at a
simple lexical grammar for perl 5 that will allow me to at least
identify comment, string, and regex boundaries?

The old saying goes "only perl can parse Perl".

Comments are easy: anything following a # sign in the same line or
anything enclosed as POD.

Strings are a different story, because there is no single set of
characters (like single or double quotes) identifying a string but there
are numerous operations and functions, which turn their argument into a
string, notably the quote and quote-like operators

Customary Generic Meaning Interpolates
'' q{} Literal no
"" qq{} Literal yes
`` qx{} Command yes (unless '' is
delimiter)
qw{} Word list no
// m{} Pattern match yes (unless '' is
delimiter)
qr{} Pattern yes (unless '' is
delimiter)
s{}{} Substitution yes (unless '' is
delimiter)
tr{}{} Transliteration no (but see below)

for which you can use any number of delimiter, e.g. in m -foo*bar- the
text 'foo*bar' is a string (and an RE). This cannot be parsed on the
lexical level.

Same goes for regex boundaries. There isn't a given set of characters
like /.../., but a regexp is identified by its position as argument for
a specific operation. The first arguments in s/// and m// are regular
expressions, no matter if you are using the slash or some other
delimiter and the first argument of tr/// is not an RE, although I used
the slash. Again, this cannot be determined on the lexical level.

jue

Uri Guttman · May 15, 2009

JE> Same goes for regex boundaries. There isn't a given set of characters
JE> like /.../., but a regexp is identified by its position as argument for
JE> a specific operation. The first arguments in s/// and m// are regular
JE> expressions, no matter if you are using the slash or some other
JE> delimiter and the first argument of tr/// is not an RE, although I used
JE> the slash. Again, this cannot be determined on the lexical level.

i wouldn't say s/// and m// have 'first' arguments but i get your
point. but the first arg to split is always a regex and it doesn't need
to be marked with // as any expression will do. and =~ will make its
right arg a regex as well even without m//.

uri

Jürgen Exner · May 15, 2009

Uri Guttman said:
JE> Same goes for regex boundaries. There isn't a given set of characters
JE> like /.../., but a regexp is identified by its position as argument for
JE> a specific operation. The first arguments in s/// and m// are regular
JE> expressions, no matter if you are using the slash or some other
JE> delimiter and the first argument of tr/// is not an RE, although I used
JE> the slash. Again, this cannot be determined on the lexical level.

i wouldn't say s/// and m// have 'first' arguments but i get your
point.

Well, maybe not technically, but how would you call them? Operands? I
have always been looking for a good term and couldn't find one I liked.

Besides
s(foo)(bar)
is valid and there you got a first argument, at least lexically

but the first arg to split is always a regex and it doesn't need
to be marked with // as any expression will do. and =~ will make its
right arg a regex as well even without m//.

Good point, those are two more which are not listed under quote and
quote-like ops.

jue

Uri Guttman · May 15, 2009

JE> Well, maybe not technically, but how would you call them? Operands? I
JE> have always been looking for a good term and couldn't find one I liked.

JE> Besides
JE> s(foo)(bar)
JE> is valid and there you got a first argument, at least lexically

true. i usually say regex and replacement parts (or left and right part
or first and second) of s///. it is a single operator with its own
syntax so i wouldn't use arguments but it is a minor nit. for m// i just
say the regex itself.

JE> Good point, those are two more which are not listed under quote and
JE> quote-like ops.

and a proper highlighter would mark those as regexes but not
strings. and =~ has an even worse case, a full expression! how would you
syntax highlight this:

$foo =~ bar() ;
or
$foo =~ join( '|', @list_of_parts ) ;

and is the replacement part of s///e highlighted as a string or an
expression? what about s///ee!!

i disable highlighting in emacs as i get blurry eyes from all the
colors. if i wanted psychedelic code, i would take the appropriate
meds.

uri

Tad J McClellan · May 15, 2009

Comments are easy: anything following a # sign in the same line or
anything enclosed as POD.

But not as easy as you describe:

$str =~ m#no comment here#;

print "no#comment#here#either\n";

print q#not a comment#;

:-(

Jürgen Exner · May 16, 2009

Tad J McClellan said:
But not as easy as you describe:

$str =~ m#no comment here#;

print "no#comment#here#either\n";

print q#not a comment#;

Oh man, why do you always have to find those pesky special cases?
Thanks for pointing this out, you are absolutely right, of course.

jue

Nathan Keel · May 16, 2009

Jürgen Exner said:
Oh man, why do you always have to find those pesky special cases?
Thanks for pointing this out, you are absolutely right, of course.

jue

I don't think a # character in a string is that special of a case, but
you surely knew all of what Tad posted anyway.

Randal L. Schwartz · May 16, 2009

Mike> I maintain the syntax highlighter for code.google.com and perl support
Mike> is rather lacking.

Mike> I know perl has a complex grammar, but can someone point me at a
Mike> simple lexical grammar for perl 5 that will allow me to at least
Mike> identify comment, string, and regex boundaries?

Can't be done at a static level. Ever.

Proof is at http://www.perlmonks.org/index.pl?node_id=44722

The best you can do is say "this is likely 95% correct for 95% of the test
cases I've thought of".

And I'd be willing to bet you that if you show me a static lexer, I can find a
valid Perl program that will break it within a few minutes. And probably
even likely in someone's production code.

print "Just another Perl hacker,"; # the original

Mike Samuel · May 16, 2009

Mike> I maintain the syntax highlighter for code.google.com and perl support
Mike> is rather lacking.

Mike> I know perl has a complex grammar, but can someone point me at a
Mike> simple lexical grammar for perl 5 that will allow me to at least
Mike> identify comment, string, and regex boundaries?

Can't be done at a static level. Ever.

Proof is athttp://www.perlmonks.org/index.pl?node_id=44722

The best you can do is say "this is likely 95% correct for 95% of the test
cases I've thought of".

And I'd be willing to bet you that if you show me a static lexer, I can find a
valid Perl program that will break it within a few minutes. And probably
even likely in someone's production code.

Thanks for the explanation all.

Ecmascript has the same property where a regular lexical grammar is
impossible since the meaning of '/' depends on the production.
But in Ecmascript, there is a lexical grammar that is correct for all
programs a non-malicious coder is likely to write. There are a few
places where this breaks down like (a++/b/i) vs (a = ++/b/i) but the
latter is useless.

But in perl, these syntactic irregularities show up frequently in
production code?

Is there a 95% solution that seems to work reasonably well?

Mart van de Wege · May 16, 2009

Ecmascript has the same property where a regular lexical grammar is
impossible since the meaning of '/' depends on the production.
But in Ecmascript, there is a lexical grammar that is correct for all
programs a non-malicious coder is likely to write. There are a few
places where this breaks down like (a++/b/i) vs (a = ++/b/i) but the
latter is useless.

But in perl, these syntactic irregularities show up frequently in
production code?

Is there a 95% solution that seems to work reasonably well?

I personally find that both Emacs CPerl mode and Eclipse
E.P.I.C. deal pretty well with Perl.

Mart

Eric Pozharski · May 16, 2009

I find vim's syntax highlighting perfectly adequate, and it's not too
complex (unlike using PPI or the code from cperl-mode). It's possible to
confuse it, but most 'ordinary' code is highlighted correctly.

That's true if F<syntax/perl.vim> is off vim-scripts.sf.net. However
for F<perl.vim> distributed with Debian's Lenny (and I suppose that
ubuntu thing too), that 'ordinary code' is a way limited. OP seems to
be from Mac world, so I can't comment what would be 'ordinary code' for
his ordinary perl.vim.

Eric Pozharski · May 16, 2009

Oh man, why do you always have to find those pesky special cases?
Thanks for pointing this out, you are absolutely right, of course.

He just kept C<$#x> thing for another turn. I think, that if hash-sign
has leading space, than it will comment

{4484:3} [0:255]$ perl -wle 'print q|abc| =~ s # / '
Substitution pattern not terminated at -e line 1.

Nathan Keel · May 16, 2009

Eric said:
Oh man, why do you always have to find those pesky special cases?
Thanks for pointing this out, you are absolutely right, of course.

Click to expand...

He just kept C<$#x> thing for another turn. I think, that if
hash-sign has leading space, than it will comment

{4484:3} [0:255]$ perl -wle 'print q|abc| =~ s # / '
Substitution pattern not terminated at -e line 1.

The hash sign isn't the cause of the problem there. Replace it with
anything else, and you'll still see the same error.

Martijn Lievaart · May 16, 2009

Comments are easy: anything following a # sign in the same line ...

And even that is not true (and many existing syntax highlighters get this
wrong as well):

while (<>) {
/^#/ and next;
...
}

M4

Uri Guttman · May 17, 2009

MS> But in perl, these syntactic irregularities show up frequently in
MS> production code?

MS> Is there a 95% solution that seems to work reasonably well?

as others have said the better ones can do the 95%/95%. but even then
they are easy to break. here docs can sometimes do it. i know i have
seen unmatched braces (in data or comments, etc) do it even when
escaped. but then i disable colorizing when i can as it is actually
distracting to me. my eyes have to parse different colors just to read
the code!

uri

FAQ 1.4 What are Perl 4, Perl 5, or Perl 6?	0	Feb 27, 2011
FAQ 1.4 What are Perl 4, Perl 5, or Perl 6?	0	Jan 23, 2011
help with "grammar"	0	May 13, 2009
Alternative Ruby grammar	22	Nov 16, 2007
search and replace in Perl	4	Jan 18, 2010
With this artifact, everyone can easily invent new languages	5	Jan 11, 2014
looking for some kind of IDL for Perl types only	2	Apr 11, 2012
Which Perl 5 OO extension can be seen as "standard" (defacto, quasi)?	9	Jun 15, 2007

perl 5 grammar

Mike Samuel

Uri Guttman

Jürgen Exner

Uri Guttman

Jürgen Exner

Uri Guttman

Tad J McClellan

Jürgen Exner

Nathan Keel

Randal L. Schwartz

Mike Samuel

Mart van de Wege

Eric Pozharski

Eric Pozharski

Nathan Keel

Martijn Lievaart

Uri Guttman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads