matching multiple occurrences in the same line

michal.shmueli · Apr 27, 2005

Hi,
I have a problem with pattern matching:
i have one very long line, and i'm looking of all occurrences of this
string : <td class="year" rowspan="2">
in the line. Actually, after each iccurrence of this string there is a
number which i need to parse and print, for example i need to extract
345 from this:
<td class="year" rowspan="2">345

i wrote the follow:

while(<FILE>){
chomp($_);
if (~ m/<td class="year" rowspan="2">(\d+).+</) {print OUT "\t$1";}
}

but it just give me the first occurrence of the pattern.
what's wrong in this?

thanks a lot for your help

Michal

Gunnar Hjalmarsson · Apr 27, 2005

i have one very long line, and i'm looking of all occurrences of this
string : <td class="year" rowspan="2">
in the line. Actually, after each iccurrence of this string there is a
number which i need to parse and print, for example i need to extract
345 from this:
<td class="year" rowspan="2">345

i wrote the follow:

while(<FILE>){
chomp($_);

Why do you chomp()?

if (~ m/<td class="year" rowspan="2">(\d+).+</) {print OUT "\t$1";}

------------^
What's that?

Use while instead of if, and add the /g modifier. Furthermore, the

.+<

part is not only redundant, but since regular expressions are greedy by
default, also that part prevents you from finding more than one occurrence.

Gunnar Hjalmarsson · Apr 27, 2005

JayEs said:
Since that looks a lot like HTML, why not use HTML::TokeParser and save
yourself from the regex hassles?

The OP is looking for *all* occurrences of that fixed string. The fact
that it's HTML does not make the OP's problem a HTML parsing problem
that 'requires' a parsing module. It can easily be handled using a
regex, even if the string in question starts with '<' and ends with '>'. ;-)

michal.shmueli · Apr 27, 2005

JayEs said:
Since that looks a lot like HTML, why not use HTML::TokeParser and save
yourself from the regex hassles?

JS

i've tried the following code but it's not working...

use HTML::TokeParser;

$file="res.html"
$p = HTML::TokeParser->new($file);
if ($p->get_tag("td")) {
my $td = $p->get_trimmed_text;
print "Td: $td\n";
}

Am i missing something?

thanks again

michal.shmueli · Apr 27, 2005

yap.. sorry. i've changed a bit and it's working properly...

thanks

michal.shmueli · Apr 27, 2005

Actually, i don't want to use the html parser- it's ok, but i need to
parse more patterns which are not part of the table. so anyway i tried
the follow as you suggested:
while(<FILE>){
while(~ gm/<td class="year" rowspan="2">(\d+)./) {print OUT "\t$1";}

now i get some compliation errors.
the original line (part) is : <td class="year" rowspan="2">2004</td><td
class="veh" rowspan="2"><a

many thanks

JayEs · Apr 27, 2005

The OP is looking for *all* occurrences of that fixed string. The fact
that it's HTML does not make the OP's problem a HTML parsing problem

<SNIP>

Entirely correct! I simply offered another solution for the same problem.
Tim Toady? ;-)
The fact that the OP is looking for a value (ALL of them) that is prefixed
with the same HTML tag, makes TokeParser a good alternative IMHO. Later the
OP states that he can't use TokeParser because he needs to do more string
matching on non-HTML, but I didn't have that info at the time...

Anyway, both suggestions work on the original problem

JS

Tad McClellan · Apr 27, 2005

so anyway i tried
the follow as you suggested:
while(<FILE>){
while(~ gm/<td class="year" rowspan="2">(\d+)./) {print OUT "\t$1";}

^
^

That is the bitwise negation operator.

Why are you using that there?

What is the point of the final dot in your pattern?

now i get some compliation errors.

Well yes, because that is not the change that was suggested.

It was suggested to add a "g" option to the pattern match operator.

See perlop.pod for how to add pattern match options.

Gunnar Hjalmarsson · Apr 27, 2005

Actually, i don't want to use the html parser- it's ok, but i need to
parse more patterns which are not part of the table.

Not sure I follow you. The more complex the task is, the more likely a
parsing module is suitable.

so anyway i tried the follow as you suggested:
while(<FILE>){
while(~ gm/<td class="year" rowspan="2">(\d+)./) {print OUT "\t$1";}

----------^-^------------------------------------^

That's not what I suggested.
- The '~' character is still there. (I suppose you don't know what
it's supposed to do.)
- Modifiers shall be appended, not prepended, to the regex.
- The dot is still redundant.

For a regex to be a suitable alternative to a module (in certain cases),
you need to know how regexes work. It's obvious that you need to read up
on it:

perldoc perlrequick
perldoc perlretut
perldoc perlre

Good luck!

michal.shmueli · Apr 28, 2005

thanks for all the help.
it seems to work fine. what i need is search for 3 different patterns
which appears in this line many times- they always appear in this
order. Moreover the second one (listing_id may appear twice)

so the code i wrote:

while(<FILE>){
while((m/<td class="year" rowspan="2">(\d+)/g) ||
((m/listing_id=(\d+)/g) ||
((m/<td class="price">(\S+)/g) ||) {print OUT "\t$1";}

but it seems to be an infinite loop

any ideas?

Michal

Gunnar Hjalmarsson · Apr 28, 2005

what i need is search for 3 different patterns
which appears in this line many times- they always appear in this
order. Moreover the second one (listing_id may appear twice)

so the code i wrote:

while(<FILE>){
while((m/<td class="year" rowspan="2">(\d+)/g) ||
((m/listing_id=(\d+)/g) ||
((m/<td class="price">(\S+)/g) ||) {print OUT "\t$1";}

but it seems to be an infinite loop

No it's not, since it doesn't even complile. Please copy and paste code
that you post; don't re-type it!

any ideas?

Your approach seems odd to me, and I prefer not to comment on it. This
is an alternative approach:

my $s1 = '<td\s+class="year"\s+rowspan="2">(\d+)';
my $s2 = 'listing_id=(\d+)';
my $s3 = '<td\s+class="price">(\d+(?:\.\d+)?)';

my $pattern = qr($s1|$s2|$s3);

my $data = do { local $/; <FILE> };

print "\t$+" while $data =~ /$pattern/g;

It has a few advantages compared to what you were trying to do, but
there are most certainly details, that only you know about, requiring
further tweaking. For instance, the pattern for price:

\d+(?:\.\d+)?

may or may not be correct in this case. Maybe you'd use Regex::Common's
method for matching numbers instead. In any case, I doubt that just \S+
will give you what you want.

Please use the docs if there are details in the above suggestion that
you don't understand.

How can I calculate the last payment of the year to be the sum of all previous payments for that year and subtracting it from Research Costs value?	7	Aug 22, 2023
Command Line Arguments	0	Mar 7, 2023
Filter table rows based on multiple checkboxes value	2	Jan 13, 2023
FAQ 4.29 How can I count the number of occurrences of a substring within a string?	0	Jan 4, 2011
Implementing Many Stacks in the Same Program	1	Aug 10, 2021
Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
regular expressions and matching delimeters	17	May 21, 2014
Traceback (most recent call last): File "<string>", line 23, in <module>TypeError: '>' not supported between instances of 'complex' and 'in	1	Dec 1, 2023

matching multiple occurrences in the same line

michal.shmueli

Gunnar Hjalmarsson

Gunnar Hjalmarsson

michal.shmueli

michal.shmueli

michal.shmueli

JayEs

Tad McClellan

Gunnar Hjalmarsson

michal.shmueli

Gunnar Hjalmarsson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads