FAQ 6.12 Can I use Perl regular expressions to match balanced text?

Discussion in 'Perl Misc' started by PerlFAQ Server, Jan 9, 2011.

  1. This is an excerpt from the latest version perlfaq6.pod, which
    comes with the standard Perl distribution. These postings aim to
    reduce the number of repeated questions as well as allow the community
    to review and update the answers. The latest version of the complete
    perlfaq is at http://faq.perl.org .

    --------------------------------------------------------------------

    6.12: Can I use Perl regular expressions to match balanced text?

    (contributed by brian d foy)

    Your first try should probably be the "Text::Balanced" module, which is
    in the Perl standard library since Perl 5.8. It has a variety of
    functions to deal with tricky text. The "Regexp::Common" module can also
    help by providing canned patterns you can use.

    As of Perl 5.10, you can match balanced text with regular expressions
    using recursive patterns. Before Perl 5.10, you had to resort to various
    tricks such as using Perl code in "(??{})" sequences.

    Here's an example using a recursive regular expression. The goal is to
    capture all of the text within angle brackets, including the text in
    nested angle brackets. This sample text has two "major" groups: a group
    with one level of nesting and a group with two levels of nesting. There
    are five total groups in angle brackets:

    I have some <brackets in <nested brackets> > and
    <another group <nested once <nested twice> > >
    and that's it.

    The regular expression to match the balanced text uses two new (to Perl
    5.10) regular expression features. These are covered in perlre and this
    example is a modified version of one in that documentation.

    First, adding the new possessive "+" to any quantifier finds the longest
    match and does not backtrack. That's important since you want to handle
    any angle brackets through the recursion, not backtracking. The group
    "[^<>]++" finds one or more non-angle brackets without backtracking.

    Second, the new "(?PARNO)" refers to the sub-pattern in the particular
    capture group given by "PARNO". In the following regex, the first
    capture group finds (and remembers) the balanced text, and you need that
    same pattern within the first buffer to get past the nested text. That's
    the recursive part. The "(?1)" uses the pattern in the outer capture
    group as an independent part of the regex.

    Putting it all together, you have:

    #!/usr/local/bin/perl5.10.0

    my $string =<<"HERE";
    I have some <brackets in <nested brackets> > and
    <another group <nested once <nested twice> > >
    and that's it.
    HERE

    my @groups = $string =~ m/
    ( # start of capture group 1
    < # match an opening angle bracket
    (?:
    [^<>]++ # one or more non angle brackets, non backtracking
    |
    (?1) # found < or >, so recurse to capture group 1
    )*
    > # match a closing angle bracket

    ) # end of capture group 1
    /xg;

    $" = "\n\t";
    print "Found:\n\t@groups\n";

    The output shows that Perl found the two major groups:

    Found:
    <brackets in <nested brackets> >
    <another group <nested once <nested twice> > >

    With a little extra work, you can get the all of the groups in angle
    brackets even if they are in other angle brackets too. Each time you get
    a balanced match, remove its outer delimiter (that's the one you just
    matched so don't match it again) and add it to a queue of strings to
    process. Keep doing that until you get no matches:

    #!/usr/local/bin/perl5.10.0

    my @queue =<<"HERE";
    I have some <brackets in <nested brackets> > and
    <another group <nested once <nested twice> > >
    and that's it.
    HERE

    my $regex = qr/
    ( # start of bracket 1
    < # match an opening angle bracket
    (?:
    [^<>]++ # one or more non angle brackets, non backtracking
    |
    (?1) # recurse to bracket 1
    )*
    > # match a closing angle bracket

    ) # end of bracket 1
    /x;

    $" = "\n\t";

    while( @queue )
    {
    my $string = shift @queue;

    my @groups = $string =~ m/$regex/g;
    print "Found:\n\t@groups\n\n" if @groups;

    unshift @queue, map { s/^<//; s/>$//; $_ } @groups;
    }

    The output shows all of the groups. The outermost matches show up first
    and the nested matches so up later:

    Found:
    <brackets in <nested brackets> >
    <another group <nested once <nested twice> > >

    Found:
    <nested brackets>

    Found:
    <nested once <nested twice> >

    Found:
    <nested twice>



    --------------------------------------------------------------------

    The perlfaq-workers, a group of volunteers, maintain the perlfaq. They
    are not necessarily experts in every domain where Perl might show up,
    so please include as much information as possible and relevant in any
    corrections. The perlfaq-workers also don't have access to every
    operating system or platform, so please include relevant details for
    corrections to examples that do not work on particular platforms.
    Working code is greatly appreciated.

    If you'd like to help maintain the perlfaq, see the details in
    perlfaq.pod.
     
    PerlFAQ Server, Jan 9, 2011
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. PerlFAQ Server
    Replies:
    0
    Views:
    105
    PerlFAQ Server
    Jan 19, 2011
  2. PerlFAQ Server
    Replies:
    0
    Views:
    101
    PerlFAQ Server
    Feb 25, 2011
  3. PerlFAQ Server
    Replies:
    0
    Views:
    165
    PerlFAQ Server
    Apr 19, 2011
  4. PerlFAQ Server
    Replies:
    0
    Views:
    157
    PerlFAQ Server
    Apr 28, 2011
  5. Noman Shapiro
    Replies:
    0
    Views:
    235
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page