Regexp: Negation with backreference?

Discussion in 'Perl Misc' started by j.vimal, May 30, 2006.

  1. j.vimal

    j.vimal Guest

    Hi
    I would like to extract the anchors from a page. This is the simple
    pattern I wrote:
    /(<[aA]\\s[^>]*>[^<]*<\/a>)/

    Note that it is to be used with a programming language, say php, but
    the syntax is same that of Perl (almost) except for escape sequences.

    Now, after I have got all the anchors, I want to parse them, to get the
    href and title attributes.
    For the href, I wrote

    \\bhref\\s*=\\s*(["'])([^\\1])\\1

    I search for href at the start of a word boundary, then skip spaces,
    then the equal to, then skip spaces, then, I get the quotes. This is
    reference 1. Now, I want to continue till I dont encounter the same
    reference 1. Then, the last character is again reference 1.

    So, is this syntax right? It doesnt seem to work for me ...

    And, ofcourse, the quotes need not be there. I will change it :)

    Thanks!
     
    j.vimal, May 30, 2006
    #1
    1. Advertising

  2. j.vimal

    Paul Lalli Guest

    j.vimal wrote:
    > Hi
    > I would like to extract the anchors from a page. This is the simple
    > pattern I wrote:
    > /(<[aA]\\s[^>]*>[^<]*<\/a>)/


    Wrong approach. Use an HTML Parsing module to parse HTML.

    > Note that it is to be used with a programming language, say php, but
    > the syntax is same that of Perl (almost) except for escape sequences.


    Wow, coincidentally, this is almost a group that deals with languages
    other than Perl!

    comp.lang.php is over there---->

    Paul Lalli
     
    Paul Lalli, May 30, 2006
    #2
    1. Advertising

  3. j.vimal

    j.vimal Guest

    Ok ... But say I really want to do it this way, to learn Regexp, :)
    Then ?

    But why do you say that this is a wrong way? Are there performance
    issues?
     
    j.vimal, May 30, 2006
    #3
  4. j.vimal

    Paul Lalli Guest

    j.vimal wrote:
    > Ok ... But say I really want to do it this way, to learn Regexp, :)


    There is no such thing. Regexps are not a universal concept. You can
    not take on regular expression for Perl and just assume it will work
    the same way in any other language.

    > Then ?
    >
    > But why do you say that this is a wrong way? Are there performance
    > issues?


    No, there are ability issues. Regular expressions cannot (correctly)
    parse HTML.

    Paul Lalli
     
    Paul Lalli, May 30, 2006
    #4
  5. j.vimal

    j.vimal Guest

    Ok. Then, I think, or my purpose, it suits.
    My purpose is just to visualize the various links in a given wikipedia
    article. Since they follow a common method to address their links,
    Regular expressions would serve my purpose without much overhead of a
    HTML parser :)

    Thanks
    Vimal
     
    j.vimal, May 30, 2006
    #5
  6. Gunnar Hjalmarsson, May 30, 2006
    #6
  7. j.vimal

    Xicheng Jia Guest

    j.vimal wrote:
    > Hi
    > I would like to extract the anchors from a page. This is the simple
    > pattern I wrote:
    > /(<[aA]\\s[^>]*>[^<]*<\/a>)/
    >
    > Note that it is to be used with a programming language, say php, but
    > the syntax is same that of Perl (almost) except for escape sequences.
    >
    > Now, after I have got all the anchors, I want to parse them, to get the
    > href and title attributes.
    > For the href, I wrote
    >
    > \\bhref\\s*=\\s*(["'])([^\\1])\\1


    this pattern matches only "one" character between two quotes or in
    $2.:)

    And I guess [^\\1] does not work as you thought it might be [^"] or
    [^']. you can try the non-greedy form of dot* which will immediately
    jump to the next \1 and then backtrack:

    \bhref\s*=\s*(["'])(.*?)\1

    or you may use conditional construct if two balanced quotes are
    optional:

    \bhref\s*=\s*(["'])?(.*?)(?(1)\1|\s)
    (untested)

    BTW. why would you use double backslashes to escape those special
    characters??

    Xicheng

    > I search for href at the start of a word boundary, then skip spaces,
    > then the equal to, then skip spaces, then, I get the quotes. This is
    > reference 1. Now, I want to continue till I dont encounter the same
    > reference 1. Then, the last character is again reference 1.
    >
    > So, is this syntax right? It doesnt seem to work for me ...
    >
    > And, ofcourse, the quotes need not be there. I will change it :)
    >
    > Thanks!
     
    Xicheng Jia, May 30, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Fredrik Lundh

    backreference in regexp

    Fredrik Lundh, Jan 31, 2006, in forum: Python
    Replies:
    2
    Views:
    365
    =?ISO-8859-1?Q?Sch=FCle_Daniel?=
    Jan 31, 2006
  2. George Sakkis

    Negation in regular expressions

    George Sakkis, Sep 8, 2006, in forum: Python
    Replies:
    6
    Views:
    548
  3. Joao Silva
    Replies:
    16
    Views:
    402
    7stud --
    Aug 21, 2009
  4. Bryan Kennerley

    regexp help - substring of a backreference

    Bryan Kennerley, Aug 7, 2010, in forum: Ruby
    Replies:
    4
    Views:
    125
    Bryan Kennerley
    Aug 7, 2010
  5. Nisse Engström

    RegExp: Backreference in ClassRange

    Nisse Engström, Jun 22, 2005, in forum: Javascript
    Replies:
    1
    Views:
    96
    Michael Winter
    Jun 22, 2005
Loading...

Share This Page