matching multiple occurrences in the same line

M

michal.shmueli

Hi,
I have a problem with pattern matching:
i have one very long line, and i'm looking of all occurrences of this
string : <td class="year" rowspan="2">
in the line. Actually, after each iccurrence of this string there is a
number which i need to parse and print, for example i need to extract
345 from this:
<td class="year" rowspan="2">345

i wrote the follow:

while(<FILE>){
chomp($_);
if (~ m/<td class="year" rowspan="2">(\d+).+</) {print OUT "\t$1";}
}

but it just give me the first occurrence of the pattern.
what's wrong in this?

thanks a lot for your help

Michal
 
G

Gunnar Hjalmarsson

i have one very long line, and i'm looking of all occurrences of this
string : <td class="year" rowspan="2">
in the line. Actually, after each iccurrence of this string there is a
number which i need to parse and print, for example i need to extract
345 from this:
<td class="year" rowspan="2">345

i wrote the follow:

while(<FILE>){
chomp($_);

Why do you chomp()?
if (~ m/<td class="year" rowspan="2">(\d+).+</) {print OUT "\t$1";}
------------^
What's that?

Use while instead of if, and add the /g modifier. Furthermore, the

.+<

part is not only redundant, but since regular expressions are greedy by
default, also that part prevents you from finding more than one occurrence.
 
G

Gunnar Hjalmarsson

JayEs said:
Since that looks a lot like HTML, why not use HTML::TokeParser and save
yourself from the regex hassles?

The OP is looking for *all* occurrences of that fixed string. The fact
that it's HTML does not make the OP's problem a HTML parsing problem
that 'requires' a parsing module. It can easily be handled using a
regex, even if the string in question starts with '<' and ends with '>'. ;-)
 
M

michal.shmueli

JayEs said:
Since that looks a lot like HTML, why not use HTML::TokeParser and save
yourself from the regex hassles?

JS

i've tried the following code but it's not working...

use HTML::TokeParser;

$file="res.html"
$p = HTML::TokeParser->new($file);
if ($p->get_tag("td")) {
my $td = $p->get_trimmed_text;
print "Td: $td\n";
}

Am i missing something?

thanks again
 
M

michal.shmueli

Actually, i don't want to use the html parser- it's ok, but i need to
parse more patterns which are not part of the table. so anyway i tried
the follow as you suggested:
while(<FILE>){
while(~ gm/<td class="year" rowspan="2">(\d+)./) {print OUT "\t$1";}

now i get some compliation errors.
the original line (part) is : <td class="year" rowspan="2">2004</td><td
class="veh" rowspan="2"><a

many thanks
 
J

JayEs

The OP is looking for *all* occurrences of that fixed string. The fact
that it's HTML does not make the OP's problem a HTML parsing problem

<SNIP>

Entirely correct! I simply offered another solution for the same problem.
Tim Toady? ;-)
The fact that the OP is looking for a value (ALL of them) that is prefixed
with the same HTML tag, makes TokeParser a good alternative IMHO. Later the
OP states that he can't use TokeParser because he needs to do more string
matching on non-HTML, but I didn't have that info at the time...

Anyway, both suggestions work on the original problem :)

JS
 
T

Tad McClellan

so anyway i tried
the follow as you suggested:
while(<FILE>){
while(~ gm/<td class="year" rowspan="2">(\d+)./) {print OUT "\t$1";}
^
^

That is the bitwise negation operator.

Why are you using that there?

What is the point of the final dot in your pattern?

now i get some compliation errors.


Well yes, because that is not the change that was suggested.

It was suggested to add a "g" option to the pattern match operator.

See perlop.pod for how to add pattern match options.
 
G

Gunnar Hjalmarsson

Actually, i don't want to use the html parser- it's ok, but i need to
parse more patterns which are not part of the table.

Not sure I follow you. The more complex the task is, the more likely a
parsing module is suitable.
so anyway i tried the follow as you suggested:
while(<FILE>){
while(~ gm/<td class="year" rowspan="2">(\d+)./) {print OUT "\t$1";}
----------^-^------------------------------------^

That's not what I suggested.
- The '~' character is still there. (I suppose you don't know what
it's supposed to do.)
- Modifiers shall be appended, not prepended, to the regex.
- The dot is still redundant.

For a regex to be a suitable alternative to a module (in certain cases),
you need to know how regexes work. It's obvious that you need to read up
on it:

perldoc perlrequick
perldoc perlretut
perldoc perlre

Good luck!
 
M

michal.shmueli

thanks for all the help.
it seems to work fine. what i need is search for 3 different patterns
which appears in this line many times- they always appear in this
order. Moreover the second one (listing_id may appear twice)

so the code i wrote:

while(<FILE>){
while((m/<td class="year" rowspan="2">(\d+)/g) ||
((m/listing_id=(\d+)/g) ||
((m/<td class="price">(\S+)/g) ||) {print OUT "\t$1";}


but it seems to be an infinite loop

any ideas?

Michal
 
G

Gunnar Hjalmarsson

what i need is search for 3 different patterns
which appears in this line many times- they always appear in this
order. Moreover the second one (listing_id may appear twice)

so the code i wrote:

while(<FILE>){
while((m/<td class="year" rowspan="2">(\d+)/g) ||
((m/listing_id=(\d+)/g) ||
((m/<td class="price">(\S+)/g) ||) {print OUT "\t$1";}


but it seems to be an infinite loop

No it's not, since it doesn't even complile. Please copy and paste code
that you post; don't re-type it!
any ideas?

Your approach seems odd to me, and I prefer not to comment on it. This
is an alternative approach:

my $s1 = '<td\s+class="year"\s+rowspan="2">(\d+)';
my $s2 = 'listing_id=(\d+)';
my $s3 = '<td\s+class="price">(\d+(?:\.\d+)?)';

my $pattern = qr($s1|$s2|$s3);

my $data = do { local $/; <FILE> };

print "\t$+" while $data =~ /$pattern/g;

It has a few advantages compared to what you were trying to do, but
there are most certainly details, that only you know about, requiring
further tweaking. For instance, the pattern for price:

\d+(?:\.\d+)?

may or may not be correct in this case. Maybe you'd use Regex::Common's
method for matching numbers instead. In any case, I doubt that just \S+
will give you what you want.

Please use the docs if there are details in the above suggestion that
you don't understand.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top