capturing multiple patterns per line

Discussion in 'Perl Misc' started by ccc31807, Feb 5, 2010.

  1. ccc31807

    ccc31807 Guest

    This is a newbie question, I admit, but I don't know the answer.

    Suppose I am parsing a file line by line, and I want to push to an
    array all substrings on that line that match a pattern. For example,
    consider the listing below. @urls SHOULD contain this: @urls = (http://
    google.com, http://yahoo.com, http://amazon.com, http://ebay.com)
    Instead, it contains only the last value. Using the g modifier doesn't
    help.

    I know why @urls contains only the last value, but I don't know how to
    get all the values.

    Thanks, CC.

    -------listing---------------
    use strict;
    use warnings;

    my @urls;
    while (<DATA>)
    {
    if (/<a.*href="([^"]+)/) { push @urls, $1; }
    }

    print @urls;
    exit(0);

    __DATA__
    <html>\n
    <body>\n
    <h1>My Favorite Sites</h1>\n
    <p>\n
    My favorite sites are <a href="http://google.com">Google</a>, <a
    href="http://yahoo.com">Yahoo</a>, <a href="http://amazon.com">Amazon</
    a>, and <a href="http://ebay.com">Ebay</a>.\n
    </p>\n
    </body>\n
    </html>\n
    ccc31807, Feb 5, 2010
    #1
    1. Advertising

  2. ccc31807 <> wrote:
    >This is a newbie question, I admit, but I don't know the answer.
    >
    >Suppose I am parsing a file line by line, and I want to push to an
    >array all substrings on that line that match a pattern. For example,
    >consider the listing below. @urls SHOULD contain this: @urls = (http://
    >google.com, http://yahoo.com, http://amazon.com, http://ebay.com)
    >Instead, it contains only the last value. Using the g modifier doesn't
    >help.
    >
    >I know why @urls contains only the last value, but I don't know how to
    >get all the values.


    Cannot repro your problem. The code you posted adds all three URLs to
    the array and prints them in one contiguous line.

    C:\tmp>t.pl
    http://google.comhttp://amazon.comhttp://ebay.com

    jue
    Jürgen Exner, Feb 5, 2010
    #2
    1. Advertising

  3. ccc31807

    ccc31807 Guest

    On Feb 5, 11:30 am, Jürgen Exner <> wrote:
    > Cannot repro your problem. The code you posted adds all three URLs to
    > the array and prints them in one contiguous line.
    >
    > C:\tmp>t.plhttp://google.comhttp://amazon.comhttp://ebay.com


    This is a mystery. I've run the script on both a Windows and Linux
    machine with the same results. Besides, your output should also
    include Yahoo, which it doesn't.

    I was able to do what I wanted with the following hack. I'm not real
    happy about it, but it works. Still, I'd rather know how to do it with
    a RE.

    CC.

    ---------hack---------------
    while (<DATA>)
    {
    my @line = split /<a/;
    foreach my $url (@line)
    {
    if (/<a.*href="([^"]+)/) { push @urls, $1; }
    }
    }
    ccc31807, Feb 5, 2010
    #3
  4. ccc31807

    Willem Guest

    ccc31807 wrote:
    ) This is a newbie question, I admit, but I don't know the answer.
    )
    ) Suppose I am parsing a file line by line, and I want to push to an
    ) array all substrings on that line that match a pattern. For example,
    ) consider the listing below. @urls SHOULD contain this: @urls = (http://
    ) google.com, http://yahoo.com, http://amazon.com, http://ebay.com)
    ) Instead, it contains only the last value. Using the g modifier doesn't
    ) help.
    )
    ) I know why @urls contains only the last value, but I don't know how to
    ) get all the values.

    I think you don't actually know why it only contains the last value,
    because there are two separate issues with your code.

    ) Thanks, CC.
    )
    ) -------listing---------------
    ) use strict;
    ) use warnings;
    )
    ) my @urls;
    ) while (<DATA>)
    ) {
    ) if (/<a.*href="([^"]+)/) { push @urls, $1; }
    ) }

    First of all, the .* in there will match everything, so in this case it
    will match everything from the first <a to the last href="..."

    Second, with the /g modifier, the results will not all be put in $1

    And third, obviously, this is a lot easier in perl if you realise that it
    can do a lot of set processing:

    while (<DATA>)
    {
    push @urls, /<a.*?href="(.*?)"/gi;
    }

    Or even:

    @urls = map { /<a.*?href="(.*?)"/gi } <DATA>

    Although that is a lot more memory hungry.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
    Willem, Feb 5, 2010
    #4
  5. ccc31807

    ccc31807 Guest

    On Feb 5, 11:58 am, Willem <> wrote:
    >   while (<DATA>)
    >   {
    >     push @urls, /<a.*?href="(.*?)"/gi;
    >   }


    Yes, yes, yes, you are entirely right. I thought that the non-greedy
    modifier might do the trick, but (1) I didn't realize that the greedy
    version would skip all the way to the last one to the detriment of my
    search, and (2) I didn't carefully think through exactly where I
    should use the non-greedy modifier.

    Thanks, CC.
    ccc31807, Feb 5, 2010
    #5
  6. ccc31807 wrote:
    > On Feb 5, 11:30 am, Jürgen Exner <> wrote:
    >> Cannot repro your problem. The code you posted adds all three URLs to
    >> the array and prints them in one contiguous line.
    >>
    >> C:\tmp>t.plhttp://google.comhttp://amazon.comhttp://ebay.com

    >
    > This is a mystery. I've run the script on both a Windows and Linux
    > machine with the same results. Besides, your output should also
    > include Yahoo, which it doesn't.
    >
    > I was able to do what I wanted with the following hack. I'm not real
    > happy about it, but it works. Still, I'd rather know how to do it with
    > a RE.
    >
    > ---------hack---------------
    > while (<DATA>)
    > {
    > my @line = split /<a/;
    > foreach my $url (@line)
    > {
    > if (/<a.*href="([^"]+)/) { push @urls, $1; }


    That is short for:

    if ($_ =~ /<a.*href="([^"]+)/)

    So you are not using the results from split() at all and the foreach
    loop is superfluous. But if you changed that to:

    if ($url =~ /<a.*href="([^"]+)/)

    Then it wouldn't work because "split /<a/" removes the string '<a' from
    all input and the regular expression requires a match with '<a'.

    > }
    > }




    John
    --
    The programmer is fighting against the two most
    destructive forces in the universe: entropy and
    human stupidity. -- Damian Conway
    John W. Krahn, Feb 5, 2010
    #6
  7. ccc31807 <> wrote:
    >On Feb 5, 11:30 am, Jürgen Exner <> wrote:
    >> Cannot repro your problem. The code you posted adds all three URLs to
    >> the array and prints them in one contiguous line.
    >>
    >> C:\tmp>t.plhttp://google.comhttp://amazon.comhttp://ebay.com

    >
    >This is a mystery. I've run the script on both a Windows and Linux
    >machine with the same results. Besides, your output should also
    >include Yahoo, which it doesn't.


    After reading the other responses I realize that I was looking at the
    wrong problem. You wrote "Instead, it contains only the last value. "
    Running your code I saw three distinct values. Three is more than "only
    the last", so obviously your claim was wrong.
    You never mentioned that you were talking about the RE not
    extracting/capturing all the elements from a _SINGLE(!!!)_ line.

    Thank you very much for throwing red herring around.

    jue
    Jürgen Exner, Feb 5, 2010
    #7
  8. On 05/02/2010 16:56, ccc31807 wrote:
    > On Feb 5, 11:30 am, Jürgen Exner<> wrote:
    >> Cannot repro your problem. The code you posted adds all three URLs to
    >> the array and prints them in one contiguous line.
    >>
    >> C:\tmp>t.plhttp://google.comhttp://amazon.comhttp://ebay.com

    >
    > This is a mystery. I've run the script on both a Windows and Linux
    > machine with the same results. Besides, your output should also
    > include Yahoo, which it doesn't.


    Thats because your DATA lines have been reformatted and split onto
    several lines!

    >
    > I was able to do what I wanted with the following hack. I'm not real
    > happy about it, but it works. Still, I'd rather know how to do it with
    > a RE.


    Not every job should be done with an RE

    >
    > ---------hack---------------
    > while (<DATA>)
    > {
    > my @line = split /<a/;
    > foreach my $url (@line)
    > {
    > if (/<a.*href="([^"]+)/) { push @urls, $1; }
    > }
    > }


    -------------8<-------------
    #!/usr/bin/perl
    use strict;
    use warnings;
    my @urls;
    while (<DATA>)
    {
    push @urls, /<a href="([^"]+)/g;
    }
    print join(',',@urls), "\n";
    __DATA__
    xxx
    x <a href="g">G</a><a href="y">Y</a> x
    x <a href="a">A</a><a href="e">E</a> x
    xxx
    -------------8<-------------
    g,y,a,e
    RedGrittyBrick, Feb 5, 2010
    #8
  9. ccc31807

    ccc31807 Guest

    On Feb 5, 1:50 pm, RedGrittyBrick <>
    wrote:
    > Thats because your DATA lines have been reformatted and split onto
    > several lines!


    Yeah, I saw that before I posted, which is why I use '\n' to mark the
    ends of the 'real' lines.


    > Not every job should be done with an RE


    No, but in accord with TIMTOWTDI, I wanted to see how it could be done
    with an RE.

    > while (<DATA>)
    > {
    >     push @urls, /<a href="([^"]+)/g;}
    >
    > print join(',',@urls), "\n";


    I'm having fun playing with the suggestions offered, and am actually
    learning in the process. ;-)

    Thanks, CC.
    ccc31807, Feb 5, 2010
    #9
  10. ccc31807

    Guest

    On Fri, 5 Feb 2010 08:17:05 -0800 (PST), ccc31807 <> wrote:

    >This is a newbie question, I admit, but I don't know the answer.
    >
    >Suppose I am parsing a file line by line, and I want to push to an
    >array all substrings on that line that match a pattern. For example,
    >consider the listing below. @urls SHOULD contain this: @urls = (http://
    >google.com, http://yahoo.com, http://amazon.com, http://ebay.com)
    >Instead, it contains only the last value. Using the g modifier doesn't
    >help.
    >
    >I know why @urls contains only the last value, but I don't know how to
    >get all the values.
    >
    >Thanks, CC.
    >
    >-------listing---------------
    >use strict;
    >use warnings;
    >
    >my @urls;
    >while (<DATA>)
    >{
    > if (/<a.*href="([^"]+)/) { push @urls, $1; }
    >}
    >
    >print @urls;
    >exit(0);
    >
    >__DATA__
    ><html>\n
    ><body>\n
    ><h1>My Favorite Sites</h1>\n
    ><p>\n
    >My favorite sites are <a href="http://google.com">Google</a>, <a
    >href="http://yahoo.com">Yahoo</a>, <a href="http://amazon.com">Amazon</
    >a>, and <a href="http://ebay.com">Ebay</a>.\n
    ></p>\n
    ></body>\n
    ></html>\n


    If you want to parse with a little bit more conformity,
    something like this (albeit deficient) might work better
    when you come across possible gotcha's.

    -sln

    use strict;
    use warnings;

    my @urls;
    {
    local $/;
    @urls = <DATA> =~
    /<a\s+[^>]*?(?<=\s)href\s*=\s*["'](.+?)["'][^>]*?\s*\/?>/sg;

    # Or, if you want to be more precise and don't mind the quotes:
    #/<a\s+[^>]*?(?<=\s)href\s*=\s*(".+?"|'.+?')[^>]*?\s*\/?>/sg
    }

    print $_,"\n" for @urls;
    exit(0);

    __DATA__
    <html>\n
    <body>\n
    <h1>My Favorite Sites</h1>\n
    <p>\n
    My favorite sites are <a asdfhref=http://google.com" href='http://gg.com' >Google</a>, <a
    href="http://yahoo.com">Yahoo</a>, <a href="http://amazon.com">Amazon</
    a>, and <a href="http://ebay.com">Ebay</a>.\n
    </p>\n
    </body>\n
    </html>\n
    ---------
    http://gg.com
    http://yahoo.com
    http://amazon.com
    http://ebay.com
    , Feb 6, 2010
    #10
  11. On 2010-02-11 02:03, David Harmon <> wrote:
    > On Fri, 5 Feb 2010 09:12:54 -0800 (PST) in comp.lang.perl.misc, ccc31807
    ><> wrote,
    >>On Feb 5, 11:58 am, Willem <> wrote:
    >>>   while (<DATA>)
    >>>   {
    >>>     push @urls, /<a.*?href="(.*?)"/gi;
    >>>   }

    >>
    >>Yes, yes, yes, you are entirely right. I thought that the non-greedy
    >>modifier might do the trick, but

    >
    > Instead of .*? I think [^>]*? would be more accurate.


    Nope. ">" is allowed in a double-quoted parameter value.

    hp
    Peter J. Holzer, Feb 11, 2010
    #11
  12. ccc31807

    Guest

    On Thu, 11 Feb 2010 13:21:41 +0100, "Peter J. Holzer" <> wrote:

    >On 2010-02-11 02:03, David Harmon <> wrote:
    >> On Fri, 5 Feb 2010 09:12:54 -0800 (PST) in comp.lang.perl.misc, ccc31807
    >><> wrote,
    >>>On Feb 5, 11:58 am, Willem <> wrote:
    >>>>   while (<DATA>)
    >>>>   {
    >>>>     push @urls, /<a.*?href="(.*?)"/gi;
    >>>>   }
    >>>
    >>>Yes, yes, yes, you are entirely right. I thought that the non-greedy
    >>>modifier might do the trick, but

    >>
    >> Instead of .*? I think [^>]*? would be more accurate.

    >
    >Nope. ">" is allowed in a double-quoted parameter value.
    >
    > hp


    In single quotes as well.
    Yes, > is allowed in a double/single quote attval.
    Its also allowed in content surrounded by quotes.

    So, CC's regex will match: <a/>href=" > "
    Clearly, a guard must be in place to thwart this.
    [^>]*? is a good candidate but where do you put it?

    CC's regex will also match: <aBBB Zhref="some stuff"
    So, its not really a good regex for this.

    However, you can use [^>]*? to flesh out the tag-att/val form.
    There are 5 or 6 sub-pattern forms in an expression.
    At least 1 complete form for tag-att/val's is needed.

    A complete sub-pattern (form), that will parse any tag-att/val
    markup is this:
    <(?:($Name)(\s+(?:(?:".*?")|(?:'.*?')|(?:[^>]*?))+)\s*(\/?))>

    Where, tag and ".*?" and '.*?' and [^>]*? consume all valid text between <>.
    Easier said than done. After this a further parsing is necessary on the
    capture groups to separate data and detect errors.

    The form above can be combined with the seconday parsing when there
    is specifiic information available. Like CC's <a href= .. data.
    Still, a complete form is needed.

    As a side note, xml is stricter than html when it comes to quoting
    values in att/val pairs. Html is not so strict and allows for unquoted
    vals and standalone unquoted attributes as well.
    The form above accomodates both, strictures can be enforced later
    and the bottom line is the *form* integrity is maintained in the stream
    and does not overflow into invalid teritory.

    So, CC's regex could be made into combined modified form,
    though still inadequite because it is a standalone form where
    other forms are missing that could negate the results.

    Yes you were right about the ">", but without [^>]*? in a couple
    of places, it won't work:

    /<a\s+[^>]*?(?<=\s)href\s*=\s*["'](.+?)["'][^>]*?\s*\/?>/
    or
    /<a\s+[^>]*?(?<=\s)href\s*=\s*(".+?"|'.+?')[^>]*?\s*\/?>/ # quotes captured

    -sln
    , Feb 11, 2010
    #12
  13. ccc31807

    Guest

    On Feb 5, 9:17 am, ccc31807 <> wrote:
    > Suppose I am parsing a file line by line, and I want to push to an
    > array all substrings on that line that match a pattern. For example,
    > consider the listing below. @urls SHOULD contain this: @urls = (http://
    > google.com,http://yahoo.com,http://amazon.com,http://ebay.com)
    > Instead, it contains only the last value. Using the g modifier doesn't
    > help.


    (My apologies if someone has already answered to your
    satisfaction.)

    Try using the /g modifier, changing "if" to "while", and changing
    "<a.*href" to just "href" (since "a" and "href" are not guaranteed to
    occur together on the same line). So your script would look like:

    -------listing---------------
    use strict;
    use warnings;
    my @urls;
    while (<DATA>)
    {
    while (/href="([^"]+)/g) { push @urls, $1; }
    }

    print join "\n", @urls;
    __DATA__
    <html>\n
    <body>\n
    <h1>My Favorite Sites</h1>\n
    <p>\n
    My favorite sites are <a href="http://google.com">Google</a>, <a
    href="http://yahoo.com">Yahoo</a>, <a href="http://
    amazon.com">Amazon</
    a>, and <a href="http://ebay.com">Ebay</a>.\n
    </p>\n
    </body>\n
    </html>\n
    -------end of listing---------------

    (I added a call to join() in the print() statement to make the
    output a little easier to read.)

    Running this modified program, I get as output:

    http://google.com
    http://yahoo.com
    http://amazon.com
    http://ebay.com

    This is what you want, right?

    (And consider using the /i modifier, as HTML tags are not required
    to be lower-case.)

    Hope this helps,

    -- Jean-Luc
    , Feb 11, 2010
    #13
  14. ccc31807

    Guest

    On Thu, 11 Feb 2010 09:07:44 -0800 (PST), "" <> wrote:

    >On Feb 5, 9:17 am, ccc31807 <> wrote:
    >> Suppose I am parsing a file line by line, and I want to push to an
    >> array all substrings on that line that match a pattern. For example,
    >> consider the listing below. @urls SHOULD contain this: @urls = (http://
    >> google.com,http://yahoo.com,http://amazon.com,http://ebay.com)
    >> Instead, it contains only the last value. Using the g modifier doesn't
    >> help.

    >
    > (My apologies if someone has already answered to your
    >satisfaction.)
    >
    > Try using the /g modifier, changing "if" to "while", and changing
    >"<a.*href" to just "href" (since "a" and "href" are not guaranteed to
    >occur together on the same line). So your script would look like:
    >


    Other non-guarantees:
    - "href" and "=" could have a span of lines between them
    - attribute value may be single quoted
    - attribute value may not be quoted at all
    - a quoted value may span several lines

    The list is too long to write.

    -sln
    , Feb 11, 2010
    #14
  15. On 2010-02-11 16:34, <> wrote:
    > On Thu, 11 Feb 2010 13:21:41 +0100, "Peter J. Holzer" <> wrote:
    >
    >>On 2010-02-11 02:03, David Harmon <> wrote:
    >>> On Fri, 5 Feb 2010 09:12:54 -0800 (PST) in comp.lang.perl.misc, ccc31807
    >>><> wrote,
    >>>>On Feb 5, 11:58 am, Willem <> wrote:
    >>>>>   while (<DATA>)
    >>>>>   {
    >>>>>     push @urls, /<a.*?href="(.*?)"/gi;
    >>>>>   }
    >>>>
    >>>>Yes, yes, yes, you are entirely right. I thought that the non-greedy
    >>>>modifier might do the trick, but
    >>>
    >>> Instead of .*? I think [^>]*? would be more accurate.

    >>
    >>Nope. ">" is allowed in a double-quoted parameter value.

    >
    > In single quotes as well.
    > Yes, > is allowed in a double/single quote attval.
    > Its also allowed in content surrounded by quotes.


    I'm not sure what you mean by "content surrounded by quotes". It is
    allowed in #PCDATA, quotes have nothing to do with it.


    > So, CC's regex will match: <a/>href=" > "
    > Clearly, a guard must be in place to thwart this.


    I'm not worried much about matching some invalid HTML. In some (but not
    all!) situations you just know that the input is valid.
    I'm much more worried that it doesn't match some valid HTML like

    * Single quotes:
    <a href='http://example.com'>
    * Multiple lines:
    <a
    href="http://example.com">
    * extra whitespace:
    <a href = "http://example.com">

    All of these occur in valid, real-life HTML code.

    But again, if you know the input (e.g., all the HTML files were produced
    by the same program or written by the same person) that may not be an
    issue.

    > [^>]*? is a good candidate but where do you put it?


    Don't worry about this. If you want a robust HTML parser, use one of the
    modules already available. Writing the 756th HTML parser may be fun but
    it isn't productive.

    hp
    Peter J. Holzer, Feb 12, 2010
    #15
  16. On 2010-02-13 21:25, David Harmon <> wrote:
    > On Thu, 11 Feb 2010 13:21:41 +0100 in comp.lang.perl.misc, "Peter J.
    > Holzer" <> wrote,
    >>On 2010-02-11 02:03, David Harmon <> wrote:
    >>> On Fri, 5 Feb 2010 09:12:54 -0800 (PST) in comp.lang.perl.misc, ccc31807
    >>><> wrote,
    >>>>On Feb 5, 11:58 am, Willem <> wrote:
    >>>>>   while (<DATA>)
    >>>>>   {
    >>>>>     push @urls, /<a.*?href="(.*?)"/gi;
    >>>>>   }
    >>>>
    >>>>Yes, yes, yes, you are entirely right. I thought that the non-greedy
    >>>>modifier might do the trick, but
    >>>
    >>> Instead of .*? I think [^>]*? would be more accurate.

    >>
    >>Nope. ">" is allowed in a double-quoted parameter value.

    >
    > OK, thanks for that. But then, "href=" is not in all "<a" tags, i.e.
    > the ones that specify "name=" instead. So the "href=" matched above
    > might not even be part of a tag.


    Oh, you meant the first .*?. I thought you meant the second one.
    Yes, the first one is a drastic oversimplification. But changing it to
    [^>]*? makes it only marginally better.

    hp
    Peter J. Holzer, Feb 13, 2010
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?S01aX3N0YXRl?=

    Quick one - Is SESSION per browser instance or per IP Address?

    =?Utf-8?B?S01aX3N0YXRl?=, Apr 4, 2006, in forum: ASP .Net
    Replies:
    7
    Views:
    5,864
    gerry
    Apr 10, 2006
  2. Razvan
    Replies:
    1
    Views:
    405
    tony vee
    Sep 10, 2004
  3. Hugo
    Replies:
    10
    Views:
    1,292
    Matt Humphrey
    Oct 18, 2004
  4. crichmon
    Replies:
    4
    Views:
    471
    Mabden
    Jul 7, 2004
  5. Aditya Mahajan
    Replies:
    4
    Views:
    87
    mortee
    Oct 15, 2007
Loading...

Share This Page