perl 5 grammar

M

Mike Samuel

I maintain the syntax highlighter for code.google.com and perl support
is rather lacking.

I know perl has a complex grammar, but can someone point me at a
simple lexical grammar for perl 5 that will allow me to at least
identify comment, string, and regex boundaries?

cheers,
mike
 
U

Uri Guttman

MS> I maintain the syntax highlighter for code.google.com and perl support
MS> is rather lacking.

MS> I know perl has a complex grammar, but can someone point me at a
MS> simple lexical grammar for perl 5 that will allow me to at least
MS> identify comment, string, and regex boundaries?

forget about it. you can't properly parse perl5 since its syntax can
change based upon the modules and pragmas that you load. there are some
attempts at this with varying success. see cperl-mode.el for emacs
lisp. adam kennedy has a parser on cpan that is reasonable. there are
other attempts around.

uri
 
J

Jürgen Exner

Mike Samuel said:
I maintain the syntax highlighter for code.google.com and perl support
is rather lacking.

Surprise, surprise. Considering what Google has done to Usenet I wonder
why so many people couldn't care less.
I know perl has a complex grammar, but can someone point me at a
simple lexical grammar for perl 5 that will allow me to at least
identify comment, string, and regex boundaries?

The old saying goes "only perl can parse Perl".

Comments are easy: anything following a # sign in the same line or
anything enclosed as POD.

Strings are a different story, because there is no single set of
characters (like single or double quotes) identifying a string but there
are numerous operations and functions, which turn their argument into a
string, notably the quote and quote-like operators

Customary Generic Meaning Interpolates
'' q{} Literal no
"" qq{} Literal yes
`` qx{} Command yes (unless '' is
delimiter)
qw{} Word list no
// m{} Pattern match yes (unless '' is
delimiter)
qr{} Pattern yes (unless '' is
delimiter)
s{}{} Substitution yes (unless '' is
delimiter)
tr{}{} Transliteration no (but see below)

for which you can use any number of delimiter, e.g. in m -foo*bar- the
text 'foo*bar' is a string (and an RE). This cannot be parsed on the
lexical level.

Same goes for regex boundaries. There isn't a given set of characters
like /.../., but a regexp is identified by its position as argument for
a specific operation. The first arguments in s/// and m// are regular
expressions, no matter if you are using the slash or some other
delimiter and the first argument of tr/// is not an RE, although I used
the slash. Again, this cannot be determined on the lexical level.

jue
 
U

Uri Guttman

JE> Same goes for regex boundaries. There isn't a given set of characters
JE> like /.../., but a regexp is identified by its position as argument for
JE> a specific operation. The first arguments in s/// and m// are regular
JE> expressions, no matter if you are using the slash or some other
JE> delimiter and the first argument of tr/// is not an RE, although I used
JE> the slash. Again, this cannot be determined on the lexical level.

i wouldn't say s/// and m// have 'first' arguments but i get your
point. but the first arg to split is always a regex and it doesn't need
to be marked with // as any expression will do. and =~ will make its
right arg a regex as well even without m//.

uri
 
J

Jürgen Exner

Uri Guttman said:
JE> Same goes for regex boundaries. There isn't a given set of characters
JE> like /.../., but a regexp is identified by its position as argument for
JE> a specific operation. The first arguments in s/// and m// are regular
JE> expressions, no matter if you are using the slash or some other
JE> delimiter and the first argument of tr/// is not an RE, although I used
JE> the slash. Again, this cannot be determined on the lexical level.

i wouldn't say s/// and m// have 'first' arguments but i get your
point.

Well, maybe not technically, but how would you call them? Operands? I
have always been looking for a good term and couldn't find one I liked.

Besides
s(foo)(bar)
is valid and there you got a first argument, at least lexically :)
but the first arg to split is always a regex and it doesn't need
to be marked with // as any expression will do. and =~ will make its
right arg a regex as well even without m//.

Good point, those are two more which are not listed under quote and
quote-like ops.

jue
 
U

Uri Guttman

JE> Well, maybe not technically, but how would you call them? Operands? I
JE> have always been looking for a good term and couldn't find one I liked.

JE> Besides
JE> s(foo)(bar)
JE> is valid and there you got a first argument, at least lexically :)

true. i usually say regex and replacement parts (or left and right part
or first and second) of s///. it is a single operator with its own
syntax so i wouldn't use arguments but it is a minor nit. for m// i just
say the regex itself.

JE> Good point, those are two more which are not listed under quote and
JE> quote-like ops.

and a proper highlighter would mark those as regexes but not
strings. and =~ has an even worse case, a full expression! how would you
syntax highlight this:

$foo =~ bar() ;
or
$foo =~ join( '|', @list_of_parts ) ;

:)

and is the replacement part of s///e highlighted as a string or an
expression? what about s///ee!! :)

i disable highlighting in emacs as i get blurry eyes from all the
colors. if i wanted psychedelic code, i would take the appropriate
meds. :)

uri
 
T

Tad J McClellan

Comments are easy: anything following a # sign in the same line or
anything enclosed as POD.


But not as easy as you describe:

$str =~ m#no comment here#;

print "no#comment#here#either\n";

print q#not a comment#;

:-(
 
J

Jürgen Exner

Tad J McClellan said:
But not as easy as you describe:

$str =~ m#no comment here#;

print "no#comment#here#either\n";

print q#not a comment#;

Oh man, why do you always have to find those pesky special cases?
Thanks for pointing this out, you are absolutely right, of course.

jue
 
N

Nathan Keel

Jürgen Exner said:
Oh man, why do you always have to find those pesky special cases?
Thanks for pointing this out, you are absolutely right, of course.

jue

I don't think a # character in a string is that special of a case, but
you surely knew all of what Tad posted anyway.
 
R

Randal L. Schwartz

Mike> I maintain the syntax highlighter for code.google.com and perl support
Mike> is rather lacking.

Mike> I know perl has a complex grammar, but can someone point me at a
Mike> simple lexical grammar for perl 5 that will allow me to at least
Mike> identify comment, string, and regex boundaries?

Can't be done at a static level. Ever.

Proof is at http://www.perlmonks.org/index.pl?node_id=44722

The best you can do is say "this is likely 95% correct for 95% of the test
cases I've thought of".

And I'd be willing to bet you that if you show me a static lexer, I can find a
valid Perl program that will break it within a few minutes. And probably
even likely in someone's production code. :)

print "Just another Perl hacker,"; # the original
 
M

Mike Samuel

Mike> I maintain the syntax highlighter for code.google.com and perl support
Mike> is rather lacking.

Mike> I know perl has a complex grammar, but can someone point me at a
Mike> simple lexical grammar for perl 5 that will allow me to at least
Mike> identify comment, string, and regex boundaries?

Can't be done at a static level.  Ever.

Proof is athttp://www.perlmonks.org/index.pl?node_id=44722

The best you can do is say "this is likely 95% correct for 95% of the test
cases I've thought of".
And I'd be willing to bet you that if you show me a static lexer, I can find a
valid Perl program that will break it within a few minutes.  And probably
even likely in someone's production code. :)

Thanks for the explanation all.

Ecmascript has the same property where a regular lexical grammar is
impossible since the meaning of '/' depends on the production.
But in Ecmascript, there is a lexical grammar that is correct for all
programs a non-malicious coder is likely to write. There are a few
places where this breaks down like (a++/b/i) vs (a = ++/b/i) but the
latter is useless.

But in perl, these syntactic irregularities show up frequently in
production code?

Is there a 95% solution that seems to work reasonably well?
 
M

Mart van de Wege

Ecmascript has the same property where a regular lexical grammar is
impossible since the meaning of '/' depends on the production.
But in Ecmascript, there is a lexical grammar that is correct for all
programs a non-malicious coder is likely to write. There are a few
places where this breaks down like (a++/b/i) vs (a = ++/b/i) but the
latter is useless.

But in perl, these syntactic irregularities show up frequently in
production code?

Is there a 95% solution that seems to work reasonably well?
I personally find that both Emacs CPerl mode and Eclipse
E.P.I.C. deal pretty well with Perl.

Mart
 
E

Eric Pozharski

I find vim's syntax highlighting perfectly adequate, and it's not too
complex (unlike using PPI or the code from cperl-mode). It's possible to
confuse it, but most 'ordinary' code is highlighted correctly.

That's true if F<syntax/perl.vim> is off vim-scripts.sf.net. However
for F<perl.vim> distributed with Debian's Lenny (and I suppose that
ubuntu thing too), that 'ordinary code' is a way limited. OP seems to
be from Mac world, so I can't comment what would be 'ordinary code' for
his ordinary perl.vim.
 
E

Eric Pozharski

Oh man, why do you always have to find those pesky special cases?
Thanks for pointing this out, you are absolutely right, of course.

He just kept C<$#x> thing for another turn. I think, that if hash-sign
has leading space, than it will comment

{4484:3} [0:255]$ perl -wle 'print q|abc| =~ s # / '
Substitution pattern not terminated at -e line 1.
 
N

Nathan Keel

Eric said:
Oh man, why do you always have to find those pesky special cases?
Thanks for pointing this out, you are absolutely right, of course.

He just kept C<$#x> thing for another turn. I think, that if
hash-sign has leading space, than it will comment

{4484:3} [0:255]$ perl -wle 'print q|abc| =~ s # / '
Substitution pattern not terminated at -e line 1.

The hash sign isn't the cause of the problem there. Replace it with
anything else, and you'll still see the same error.
 
M

Martijn Lievaart

Comments are easy: anything following a # sign in the same line ...

And even that is not true (and many existing syntax highlighters get this
wrong as well):

while (<>) {
/^#/ and next;
...
}


M4
 
U

Uri Guttman

MS> But in perl, these syntactic irregularities show up frequently in
MS> production code?

MS> Is there a 95% solution that seems to work reasonably well?

as others have said the better ones can do the 95%/95%. but even then
they are easy to break. here docs can sometimes do it. i know i have
seen unmatched braces (in data or comments, etc) do it even when
escaped. but then i disable colorizing when i can as it is actually
distracting to me. my eyes have to parse different colors just to read
the code!

uri
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top