Can't find a syntax error, hoping a second set of eyes will help

Discussion in 'Perl Misc' started by Jason C, Sep 24, 2012.

  1. Jason C

    Jason C Guest

    Can someone look at this and tell me what I'm messing up? I've been coding all night, and my eyes have gone fuzzy :)

    while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {
    if ($2 =~ /^http/i) {
    $text =~ s#<a[^>]*? href=(["'])*([^\1>]*)\1[^>]*?>(.*?)</a>#$2#gsi;
    }
    }

    The error is on the while() line (at least, I remove it and no more error). The error just says:

    syntax error at blah.cgi line 239, near "if"
    syntax error at blah.cgi line 246, near "}"

    The purpose of the function is to remove the <a href=...></a> code in submitted text, but only if the linked text begins with http.

    TIA,

    Jason
     
    Jason C, Sep 24, 2012
    #1
    1. Advertising

  2. Jason C

    Uri Guttman Guest

    >>>>> "JC" == Jason C <> writes:

    JC> Can someone look at this and tell me what I'm messing up? I've been coding all night, and my eyes have gone fuzzy :)
    JC> while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {

    why do you think the # marks the start of a regex? only if you use m//
    can you change the regex delim from /.
    and ^ will not invert a char class for \1 as \1 isn't a char class
    element. so even if you fix the regex delim, that will fail. finally,
    why are you parsing out urls with a regex when there are modules that do
    it correctly?

    uri
     
    Uri Guttman, Sep 24, 2012
    #2
    1. Advertising

  3. Jason C

    Jason C Guest

    On Monday, September 24, 2012 1:03:03 AM UTC-4, Ben Morrow wrote:
    >
    > > while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {

    > ^^ m
    >
    > (I would suggest finding a highlighting editor. It makes this sort of
    > syntactic mistake much easier to spot.)


    Thanks, Ben. I didn't realize the m//; was required; since you can change the delimiter with s/// ad hoc, I thought you could here, too.

    I'm using Notepad++, and while it helps me catch opening and ending brackets, it didn't do a lot in recognizing syntax errors (at least, not that I know of). What editor do you recommend?
     
    Jason C, Sep 24, 2012
    #3
  4. Jason C

    Jason C Guest

    On Monday, September 24, 2012 1:23:40 AM UTC-4, Uri Guttman wrote:

    > why do you think the # marks the start of a regex? only if you use m//
    > can you change the regex delim from /.


    Thanks to you, too, Uri. Like I replied to Ben a second ago, I thought thatsince you could replace the delimiter in s/// ad hoc, that you could in m//, too. Learn something new every day! :)


    > and ^ will not invert a char class for \1 as \1 isn't a char class
    > element. so even if you fix the regex delim, that will fail.


    Oh. Now THAT I did NOT know at all! It does explain a few other errors I'vehad, though, and couldn't figure out.


    > finally,
    > why are you parsing out urls with a regex when there are modules that do
    > it correctly?


    Two reasons:

    1. I've been working with regex for a year or two, and while it's by no means a strong point in my vocabulary (yet), I'm at least familiar enough withit to usually figure it out.

    2. I briefly looked for a module that would handle this correctly, but wasn't sure what to look for. And, I'm not sure that it warrants the including of a full module if it could potentially be done in a simple regex. If you can recommend a module that would be more stable and/or faster than what I'm doing, though, then I would definitely appreciate the reference!

    FWIW, this modification did work:

    while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {
    $pattern = $1$2$3;
    $repl = $2;

    if ($2 =~ /^http/i) {
    $text =~ s/$pattern/$repl/gsi;
    }
    }

    Admittedly, I'm not sure why $2 is stored long enough for the if() statement, but inside of the if() statement it's empty. Storing them to a differentvariable worked for this purpose, but if there's a better way, I'm very much open to it.
     
    Jason C, Sep 24, 2012
    #4
  5. Jason C <> writes:

    >> > while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {

    >> ^^ m

    >
    > Thanks, Ben. I didn't realize the m//; was required; since you can
    > change the delimiter with s/// ad hoc, I thought you could here, too.


    You can change the delimiter, but the m is only optional when you use
    the // delimiters.

    //Makholm
     
    Peter Makholm, Sep 24, 2012
    #5
  6. Jason C

    Marc Girod Guest

    On Sep 24, 10:28 am, Jason C <> wrote:

    > What editor do you recommend?


    GNU emacs with cperl-mode

    Marc
     
    Marc Girod, Sep 24, 2012
    #6
  7. Jason C

    anotheranne Guest

    Jason C wrote:

    > Can someone look at this and tell me what I'm messing up? I've been coding all night, and my eyes have gone fuzzy :)
    >
    > while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {
    > if ($2 =~ /^http/i) {
    > $text =~ s#<a[^>]*? href=(["'])*([^\1>]*)\1[^>]*?>(.*?)</a>#$2#gsi;
    > }
    > }


    Whatever other errors your regex may have, I would suggest that
    you stick with the regular m// and s/// constructs. You should of
    course then escape the '/' in </a> . Changing this should make it run.

    Don't use # as an eye-easy replacement for / because a) it is the perl
    character for a comment, and b) in a regex (at least with the /x
    modifier) it is also a metacharacter. Trouble will come your way if
    you use this.

    If you do want to get away from // and /// then use balanced
    delimiters like m{} and s{}{} . See p319 in Friedl MASTERING REGULAR
    EXPRESSIONS. O'Reilly.

    When use use any alternate to m// the m is then mandatory. Only when
    using // can you omit the m. thus // or m{} are valid constructs.

    Also you can remove the ';' after the gsi

    hope this helps.

    anotheranne
    >
    > The error is on the while() line (at least, I remove it and no more error). The error just says:
    >
    > syntax error at blah.cgi line 239, near "if"
    > syntax error at blah.cgi line 246, near "}"
    >
    > The purpose of the function is to remove the <a href=...></a> code in submitted text, but only if the linked text begins with http.
    >
    > TIA,
    >
    > Jason
     
    anotheranne, Sep 24, 2012
    #7
  8. Jason C

    anotheranne Guest

    Jason C wrote:

    > On Monday, September 24, 2012 1:03:03 AM UTC-4, Ben Morrow wrote:
    >>
    >> > while ($text =~ #<a[^>]* href=(["'])*[^\1>]*\1[^>]*?>(.*?)</a>#gsi) {

    >> ^^ m
    >>
    >> (I would suggest finding a highlighting editor. It makes this sort of
    >> syntactic mistake much easier to spot.)

    >
    > Thanks, Ben. I didn't realize the m//; was required; since you can change the delimiter with s/// ad hoc, I thought you could here, too.
    >
    > I'm using Notepad++, and while it helps me catch opening and ending brackets, it didn't do a lot in recognizing syntax errors (at least, not that I know of). What editor do you recommend?


    Padre is a nice perl IDE.

    http://padre.perlide.org/

    anotheranne
     
    anotheranne, Sep 24, 2012
    #8
  9. Jason C

    Scott Bryce Guest

    On 9/24/2012 3:28 AM, Jason C wrote:
    > I'm using Notepad++,


    I assume that means you are on a Windows box.

    > What editor do you recommend?


    I like UltraEdit.
     
    Scott Bryce, Sep 24, 2012
    #9
  10. Jason C

    Uri Guttman Guest

    >>>>> "JC" == Jason C <> writes:

    JC> On Monday, September 24, 2012 1:23:40 AM UTC-4, Uri Guttman wrote:
    >> why do you think the # marks the start of a regex? only if you use m//
    >> can you change the regex delim from /.


    JC> Thanks to you, too, Uri. Like I replied to Ben a second ago, I
    JC> thought that since you could replace the delimiter in s/// ad hoc,
    JC> that you could in m//, too. Learn something new every day! :)

    but s/// has the s to mark the next char. =~ ## has no leading marker so it
    would just be a comment. also using # for the delimiter is just a bad
    idea as it confuses many readers.

    >> finally,
    >> why are you parsing out urls with a regex when there are modules that do
    >> it correctly?


    JC> Two reasons:

    JC> 1. I've been working with regex for a year or two, and while it's
    JC> by no means a strong point in my vocabulary (yet), I'm at least
    JC> familiar enough with it to usually figure it out.

    good that you are studying them but it still is the wrong tool for
    this. learning when regexes aren't a good solution is part of learning
    regexes.

    JC> 2. I briefly looked for a module that would handle this correctly,
    JC> but wasn't sure what to look for. And, I'm not sure that it
    JC> warrants the including of a full module if it could potentially be
    JC> done in a simple regex. If you can recommend a module that would
    JC> be more stable and/or faster than what I'm doing, though, then I
    JC> would definitely appreciate the reference!

    JC> FWIW, this modification did work:

    JC> while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {

    it will fail if the opening quote is " and the string has a ' inside
    it. perfectly legal html but you can't parse it that way.

    JC> Admittedly, I'm not sure why $2 is stored long enough for the if()
    JC> statement, but inside of the if() statement it's empty. Storing
    JC> them to a different variable worked for this purpose, but if
    JC> there's a better way, I'm very much open to it.

    you need to read more about regexes and the $1 stuff. they live until
    the next regex is run (they are global).

    uri
     
    Uri Guttman, Sep 24, 2012
    #10
  11. Jason C

    Jason C Guest

    On Monday, September 24, 2012 11:03:04 AM UTC-4, Ben Morrow wrote:
    > > FWIW, this modification did work:
    > >
    > > while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {
    > > $pattern = $1$2$3;

    > ^^ ^^
    > I think not...


    Blah, sorry; that's what I get for trying to type of dummy code at 5am. In practice, I put it in quotes:

    $pattern = "$1$2$3";


    > > if ($2 =~ /^http/i) {
    > > $text =~ s/$pattern/$repl/gsi;

    >
    > This almost certainly doesn't do what you think. If nothing else, you
    > want to \Q $pattern.


    Excellent point about \Q. What do you mean, though, that it doesn't do what I think?


    > What are you trying to do here: strip tags?


    Yes and no. I'm using a contenteditable instead of a textarea, and I've discovered that when someone copy-and-pastes an URL from Chrome or FF, it's automatically making the URL a link. Eg:

    <a href="http://www.google.com">http://www.google.com</a>

    But of course, if you just type the address, then it doesn't. So on my end, I was using URI::Find to convert addresses to links, and ending up with a mess like:

    <a href="<a href="http://www.google.com">http://www.google.com</a>"><a href="http://www.google.com">http://www.google.com</a></a>

    So, my goal here is to remove the <a href> tag, but only if the linked text is an URL.


    > Why not
    > just do one s/// (or, you know, use a module)?


    I had originally tried doing it with a simple s///, but couldn't figure out how to make it conditional. Like this:

    $text =~ s#<a[^>]*? href=(["'])*([^\1>]*)\1[^>]*?>(.*?)</a>#$2#gsi
    if ($3 =~ /^http/i);

    This worked correctly if I removed the if() statement. In testing, I changed the replacement to:

    1 - $1, 2 - $2, 3 - $3

    just to make sure that $3 did begin with http, and it did, so I couldn't figure out why the if() wasn't catching it unless it was dropping the $3 value before reaching the if().


    > > Admittedly, I'm not sure why $2 is stored long enough for the if()
    > > statement, but inside of the if() statement it's empty. Storing them to
    > > a different variable worked for this purpose, but if there's a better
    > > way, I'm very much open to it.

    >
    > The $N variables last until the next successful pattern match. In this
    > case, the '$2 =~ /^http/i' in the condition of the if clears them all
    > (even though it doesn't capture anything).


    Ahh, that makes sense. I mistakenly thought that, since I wasn't assigning $N, then they would retain the previous value.


    > In general I prefer to assign captures to real variables right away:
    >
    > while (my ($tag, $url) = m#(<a...>(.*?)</a>)#gsi) {
    >
    > (notice also that captures can be nested, and DTRT).


    Great to know! Thanks.
     
    Jason C, Sep 25, 2012
    #11
  12. Jason C

    Jason C Guest

    On Monday, September 24, 2012 11:03:04 AM UTC-4, Ben Morrow wrote:

    > while (my ($tag, $url) = m#(<a...>(.*?)</a>)#gsi) {


    In this, how does it know that we're testing $test? Or, did you mean to type something like:

    while (my (tag, $url) = $text =~ m#(<a...>(.*?)</a>)#gsi)
     
    Jason C, Sep 25, 2012
    #12
  13. Jason C

    Jason C Guest

    On Monday, September 24, 2012 3:44:44 PM UTC-4, Uri Guttman wrote:

    > JC> while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {
    >
    > it will fail if the opening quote is " and the string has a ' inside
    > it. perfectly legal html but you can't parse it that way.


    I'll probably discard this idea and pursue a module, like you guys suggested. But for the sake of learning...

    I recognized this issue, too, which is why I was originally using [^\1], like so:

    (["'])*([^\1>]*)\1

    I think it was you that pointed out that I can't negate a backreference like that, though.

    What would be the correct way to do this, if I can't negate a backreference as a character class?
     
    Jason C, Sep 25, 2012
    #13
  14. Jason C

    Jim Gibson Guest

    In article <>,
    Jason C <> wrote:

    > On Monday, September 24, 2012 3:44:44 PM UTC-4, Uri Guttman wrote:
    >
    > > JC> while ($text =~ m#(<a[^>]* href=["'].*?["'].*?>)(.*?)(</a>)#gsi) {
    > >
    > > it will fail if the opening quote is " and the string has a ' inside
    > > it. perfectly legal html but you can't parse it that way.

    >
    > I'll probably discard this idea and pursue a module, like you guys suggested.
    > But for the sake of learning...
    >
    > I recognized this issue, too, which is why I was originally using [^\1], like
    > so:
    >
    > (["'])*([^\1>]*)\1
    >
    > I think it was you that pointed out that I can't negate a backreference like
    > that, though.
    >
    > What would be the correct way to do this, if I can't negate a backreference as a character class?


    Capture the leading delimiter and use a backreference that is not in a
    character class:

    while ($text =~ m{(<a[^>]* href=(["']).*?\2.*?>)(.*?)(</a>)}gsi) {
    ^^

    --
    Jim Gibson
     
    Jim Gibson, Sep 25, 2012
    #14
  15. Jason C

    Kaz Kylheku Guest

    On 2012-09-26, Eli the Bearded <*@eli.users.panix.com> wrote:
    >:r! cat $PHTML/some.links.html


    UUOC infects the the vi command line!

    :r!cat <file> -> :r <file>
     
    Kaz Kylheku, Sep 26, 2012
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. KatB
    Replies:
    1
    Views:
    419
    Alvin Bruney
    Oct 19, 2003
  2. darrel
    Replies:
    0
    Views:
    363
    darrel
    Jun 21, 2004
  3. Phil Winstanley [Microsoft MVP ASP.NET]

    Re: need second pair of eyes: databinder.eval problem

    Phil Winstanley [Microsoft MVP ASP.NET], Jun 21, 2004, in forum: ASP .Net
    Replies:
    3
    Views:
    446
    darrel
    Jun 21, 2004
  4. =?Utf-8?B?Q2hyaXMgQmF0ZXM=?=

    Ok, here is something strange I am hoping you can help with

    =?Utf-8?B?Q2hyaXMgQmF0ZXM=?=, May 9, 2006, in forum: ASP .Net
    Replies:
    1
    Views:
    342
    =?Utf-8?B?SmFtZXMgSmVmZmVyaWVz?=
    May 11, 2006
  5. MW
    Replies:
    14
    Views:
    230
    Lori Fleetwood
    Aug 29, 2003
Loading...

Share This Page