RegEx - matching previous match

Discussion in 'Perl Misc' started by j ellings, Feb 27, 2008.

  1. j  ellings

    j ellings Guest

    Hello.

    I have an html file converted from PDF that includes the following
    sample lines:

    (html has been converted)

    <i><b>Z & A Newsstand</b></i><br>
    <i>Retail Food: Mobile Food Vendor</i><br>
    <i>2 N 10th St</i><br>
    <i>Philadelphia, PA 19107</i><br>
    <b>Inspection Date</b><br>
    <i>4/11/07</i><br>
    No Critical Violations<br>
    <i>4/11/07</i><br>
    No Critical Violations<br>
    <i>11/28/06</i><br>
    No Critical Violations<br>
    <i>4/24/06</i><br>
    No Critical Violations<br>
    <i><b>Newstand</b></i><br>
    <i>Retail Food: Mobile Food Vendor</i><br>
    <i>32 N 10th St</i><br>
    <i>Philadelphia, PA 19107</i><br>
    <b>Inspection Date</b><br>
    <i>7/2/07</i><br>
    No Critical Violations<br>
    <i><b>Pudgies Deli</b></i><br>
    <i>Retail Food: Restaurant, Eat-in</i><br>
    <i>46 N 10th St</i><br>
    <i>Philadelphia, PA 19107</i><br>
    <b>Inspection Date</b><br>
    <i>1/11/07</i><br>
    No Critical Violations<br>
    <i>9/25/06</i><br>
    No Critical Violations<br>
    <i>8/7/06</i><br>
    No Critical Violations<br>


    I am trying to capture the information between the <i><b>
    tags as these are the only unique delimiters between entries.

    My regex is as follows:

    while ($html =~ m{<i><b>(.*?)<i><b>}gs) {
    #do something
    }

    Unfortunately, the regex will match the first instance( Z &amp; A
    Newsstand), but ignore the second (Newstand) and then match on the
    third (Pudgies Deli).

    I can see that the match is working according to what I wrote; I am
    trying to fine tune it so that I can grab every match. Is there a way
    to include the previous &lt;i&gt;&lt;b&gt; in the next match such that
    it will not skip a potential match?

    Any suggestions or advice would be most appreciated.

    John

    Any
     
    j ellings, Feb 27, 2008
    #1
    1. Advertising

  2. j ellings wrote:
    >
    > (html has been converted)


    Yes, but why on earth did you post the data in that format?

    <non-html data snipped>

    > I am trying to capture the information between the &lt;i&gt;&lt;b&gt;
    > tags as these are the only unique delimiters between entries.
    >
    > My regex is as follows:
    >
    > while ($html =~ m{<i><b>(.*?)<i><b>}gs) {
    > #do something
    > }
    >
    > Unfortunately, the regex will match the first instance( Z &amp; A
    > Newsstand), but ignore the second (Newstand) and then match on the
    > third (Pudgies Deli).
    >
    > I can see that the match is working according to what I wrote; I am
    > trying to fine tune it so that I can grab every match. Is there a way
    > to include the previous &lt;i&gt;&lt;b&gt; in the next match such that
    > it will not skip a potential match?


    A zero-width positive look-ahead assertion may be what you are after;
    see "perldoc perlre".

    while ($html =~ m{<i><b>(.*?)(?=<i><b>)}gs) {
    ---------------------------------^^^------^

    Another approach that doesn't slurp the whole file into a scalar variable:

    local $/ = '<i><b>';
    while ( my $html = <> ) {
    #do something
    }

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Feb 28, 2008
    #2
    1. Advertising

  3. j ellings <> wrote:
    > Hello.
    >
    > I have an html file converted from PDF that includes the following
    > sample lines:
    >
    > (html has been converted)



    Why has HTML been converted?

    This is a plain-text medium...


    > &lt;i&gt;&lt;b&gt;Z &amp; A Newsstand&lt;/b&gt;&lt;/i&gt;&lt;br&gt;

    ^^ ^^
    ^^ ^^


    > My regex is as follows:
    >
    > while ($html =~ m{<i><b>(.*?)<i><b>}gs) {



    End tags have slash characters in them that your pattern will not match.

    Your data closes the bold before the italic, but your regex looks
    for the italic close before the bold close.


    > I can see that the match is working according to what I wrote;



    You have a strange definition of "working" then...


    > trying to fine tune it so that I can grab every match. Is there a way
    > to include the previous &lt;i&gt;&lt;b&gt; in the next match such that
    > it will not skip a potential match?



    You do not need a way to include the previous <i><b> in the next match.


    > Any suggestions or advice would be most appreciated.



    while ($html =~ m{<i><b>(.*?)</b></i>}gs) {


    --
    Tad McClellan
    email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
     
    Tad J McClellan, Feb 28, 2008
    #3
  4. j  ellings

    j ellings Guest

    On Feb 27, 8:21 pm, Gunnar Hjalmarsson <> wrote:

    >
    > A zero-width positive look-ahead assertion may be what you are after;
    > see "perldoc perlre".
    >
    > while ($html =~ m{<i><b>(.*?)(?=<i><b>)}gs) {
    > ---------------------------------^^^------^
    >
    > Another approach that doesn't slurp the whole file into a scalar variable:
    >
    > local $/ = '<i><b>';
    > while ( my $html = <> ) {
    > #do something
    > }
    >
    > --
    > Gunnar Hjalmarsson


    Thanks Gunnar, this worked perfectly; apologies for the formatting.
     
    j ellings, Feb 28, 2008
    #4
  5. j  ellings

    j ellings Guest

    On Feb 27, 8:21 pm, Tad J McClellan <> wrote:
    >
    > You do not need a way to include the previous <i><b> in the next match.
    >
    > > Any suggestions or advice would be most appreciated.

    >
    > while ($html =~ m{<i><b>(.*?)</b></i>}gs) {
    >
    > --
    > Tad McClellan
    > email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"


    Tad

    Thanks for the suggestion. Your regex will match the first instance
    of opening and closing of the <b><i> tags; what I needed it to do was
    to match the opening of the two tags. My original regex did capture
    between two opening instances, but only after skipping one.
     
    j ellings, Feb 28, 2008
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. hiwa
    Replies:
    0
    Views:
    653
  2. Xah Lee
    Replies:
    1
    Views:
    959
    Ilias Lazaridis
    Sep 22, 2006
  3. Xah Lee
    Replies:
    8
    Views:
    471
    Ilias Lazaridis
    Sep 26, 2006
  4. Xah Lee
    Replies:
    2
    Views:
    231
    Xah Lee
    Sep 25, 2006
  5. j  ellings

    RegEx - matching previous match

    j ellings, Feb 27, 2008, in forum: Perl Misc
    Replies:
    0
    Views:
    101
    j ellings
    Feb 27, 2008
Loading...

Share This Page