Help with pattern matching

ExecMan · Apr 10, 2012

Hi,

I have a file containing URL's, and I am trying to scan a log and get
a count of the matching string. But, I think because the input
contains slashes I am not getting a match. Any help please?? I'm
pretty new to this:

#!/usr/bin/perl
open (FILE,"monday.csv") or die $!;
while(<FILE>) {
chomp($_);
($tag, $url) = split(',', $_);
$url_tags{$tag} = $url;
$url_counts{$tag} = 0;
}
close(FILE);

open (FILE,"<","/home/httpdlogs/apache2/access_log") or die "Can't
open apache log!";
foreach $tag (keys(%url_tags)) {
$url = $url_tags{$tag};
$count = grep { /$url/ } <FILE>;
$url_counts{$tag} = $count;
}
close(FILE);

The $url contains slashes, how can I get around this??

Wolf Behrenhoff · Apr 11, 2012

Am 11.04.2012 04:15, schrieb ExecMan:

Hi,

I have a file containing URL's, and I am trying to scan a log and get
a count of the matching string. But, I think because the input
contains slashes I am not getting a match. Any help please?? I'm
pretty new to this:

#!/usr/bin/perl
open (FILE,"monday.csv") or die $!;

Prefer using the open with three arguments (as done for access_log).
Also, the FILE handle is global -> better use a "normal" variable here
(my $FILE).

while(<FILE>) {
chomp($_);
($tag, $url) = split(',', $_);
$url_tags{$tag} = $url;
$url_counts{$tag} = 0;
}
close(FILE);

open (FILE,"<","/home/httpdlogs/apache2/access_log") or die "Can't
open apache log!";
foreach $tag (keys(%url_tags)) {
$url = $url_tags{$tag};
$count = grep { /$url/ } <FILE>;

Two problems in this line:
a) problem with the regular expression. You probably don't want to match
a regexp here but test if $url is contained in file? Then use /\Q$url/ -
read what \Q does in perldoc perlre.

b) grep reads the whole file. After reading the whole file, any attempt
to read more reads nothing because you are already at the end of the
file. So the foreach loop can only return results the first time the
loop is started. Two solutions:
b1) read the whole file into an array @access_log before the foreach
loop. In the loop, use @access_log instead of <FILE>.
b2) read the file line by line and execute the foreach loop for every
line. Then of course you need to add the $count to %url_count.

- Wolf

Justin C · Apr 11, 2012

[snip]

$count = grep { /$url/ } <FILE>;

The $url contains slashes, how can I get around this??

Don't use / as a regex delimiter. See perlretut and search for
'delimiters'.

Justin.

Wolf Behrenhoff · Apr 11, 2012

Am 11.04.2012 11:51, schrieb Justin C:

[snip]

$count = grep { /$url/ } <FILE>;

The $url contains slashes, how can I get around this??

Click to expand...

Don't use / as a regex delimiter. See perlretut and search for
'delimiters'.

What is wrong with / as delimiter?

- Wolf

ExecMan · Apr 11, 2012

Quoth ExecMan <[email protected]>:

The slashes are not the problem. Perl isn't like shell, which expands
variable before doing word splitting: perl finds the end of /$url/ at
compile time, before it knows whether $url will contain slashes or not.

You have two problems here. The first and most obvious is that <FILE>,
in list context, reads the file to the end, breaks it into lines, and
*leaves the file pointer at the end of the file*. That means that next
time round the loop, the file pointer is already at the end, and <FILE>
returns the empty list.

There are several ways to fix this. The simplest is to read the file
once into an array, and run the grep over the array instead:

open (FILE, "<", "/...") or die ...;
@log = <FILE>;
close FILE;

foreach $tag (keys(%url_tags)) {
$url = $url_tags{$tag};
$count = grep { /$url/ } @log;
...
}

(I'm deliberately omitting several important correction to that code
I'll mention below, so you can see which changes are relevant here.)

The second problem is that while perl looks for the closing slash before
interpolation $url, it looks for regex metacharacters afterwards. This
means that if one of your URLs contains, say, '+', perl will interpret
it as a 'match at least once' pattern character. To fix this you need
the \Q escape, which says 'quote everything from here until \E':

$count = grep { /\Q$url/ } @log;

Some more general remarks:

You don't appear to be using 'warnings' or 'strict'. Until you know
enough to know better, start *every* Perl program with

use warnings;
use strict;

This will then start yelling at you about 'Global symbol requires
explicit package name': this means you need to go through and declare
all your variables with 'my'. The point of this is that it makes it much
less likely you'll reuse a variable without meaning to by mistake, or
that you'll misspell a variable name and get a completely new variable
with no warning.

You should also be keeping your filehandles in 'my' variables, for the
same reason. As your program gets longer, it becomes increasingly likely
you will use FILE for something else somewhere else, and you'll get a
mess. 'my' variables aren't visible outside the block they're declared
in, so that can't happen.

If that file 'monday.csv' is actually CSV, generated from some other
program, you can't safely parse it like that. CSV has (rather
ill-defined) quoting rules, to allow entries to contain ',', and a lot
of programs randomly quote CSV when they didn't really need to. Reading
it is a lot harder than it seems, and you should use a module from CPAN,
such as Text::CSV.

Ben

Hi,

I got around what I thought was a slash issue like this. Not sure if
it is the fastest thing:

foreach $tag (keys(%url_tags)) {
open (FILE,"/home/httpdlogs/apache2/access_log") or die "Can't open
apache log!";
$url = $url_tags{$tag};
$url =~ s/([\\\/\^\$\*\+\?\=\@\{\}\[\]\<\>])/\\$&/g;
$count = grep { /$url/ } <FILE>;
$url_counts{$tag} = $count;
close(FILE);
}

Also, about reading the file into an array. Problem is the file could
be a couple of million lines long. Isn't that a lot to be reading
into an array? If not, and it would be faster, then maybe I'll change
the code.

Justin C · Apr 11, 2012

Am 11.04.2012 11:51, schrieb Justin C:

[snip]

$count = grep { /$url/ } <FILE>;

The $url contains slashes, how can I get around this??

Click to expand...

Don't use / as a regex delimiter. See perlretut and search for
'delimiters'.

Click to expand...

What is wrong with / as delimiter?

Apparently nothing, I was misunderstanding where his problem was.

Justin.

Wolf Behrenhoff · Apr 11, 2012

Am 11.04.2012 15:31, schrieb ExecMan:

Hi,

I got around what I thought was a slash issue like this. Not sure if
it is the fastest thing:

foreach $tag (keys(%url_tags)) {
open (FILE,"/home/httpdlogs/apache2/access_log") or die "Can't open
apache log!";
$url = $url_tags{$tag};
$url =~ s/([\\\/\^\$\*\+\?\=\@\{\}\[\]\<\>])/\\$&/g;

What are you doing?!!! Way to many slashes to be able to read this.
I guess you are trying to achieve what just a \Q would do.

Did you even read Ben's and/or my answer?

$count = grep { /$url/ } <FILE>;
$url_counts{$tag} = $count;
close(FILE);
}

Also, about reading the file into an array. Problem is the file could
be a couple of million lines long. Isn't that a lot to be reading
into an array? If not, and it would be faster, then maybe I'll change
the code.

Now you are reading the file multiple times. Do you really think that is
better?

If the log file is really too large (probably it isn't) then read it
line by line as suggested in my previous posting in b2).

- Wolf

Rainer Weikusat · Apr 11, 2012

Wolf Behrenhoff said:
Am 11.04.2012 11:51, schrieb Justin C:

[snip]

$count = grep { /$url/ } <FILE>;

The $url contains slashes, how can I get around this??

Click to expand...

Don't use / as a regex delimiter. See perlretut and search for
'delimiters'.

Click to expand...

What is wrong with / as delimiter?

Nothing. There's 'something wrong'/ an inherent limitation with the
concept of 'a delimiter characters', namely, that occurrence of this
character inside a pattern will need to be escaped. As an alternative,
Perl supports using arbitrary delimiter characters so that a character
which doesn't appear inside the pattern can be used if such a
character exists. As hinted at by the second part of this sentence,
this doesn't really solve to 'problem' because it is conceivable that
no suitable character can be found. Further drawbacks are that it adds
significant 'optical noise' to the text and that it is not compatible
with the regular expression syntax used by other UNIX(*) tools
anymore. Because of this, I have so far just continued to use the
/-separator + /-escaping syntax also supported by, say, sed.

Rainer Weikusat · Apr 11, 2012

[...]

I suspect Justin is making the same mistake as the OP, and confusing

my $url = "http://foo";
/$url/;

, which works just fine, with

/http://foo/;

, which doesn't, and needs to be rewritten as

m!http://foo!;

/http:\/\/foo/ would also work.

ExecMan · Apr 11, 2012

Am 11.04.2012 15:31, schrieb ExecMan:

Hi,

Click to expand...

I got around what I thought was a slash issue like this. Not sure if
it is the fastest thing:

Click to expand...

foreach $tag (keys(%url_tags)) {
open (FILE,"/home/httpdlogs/apache2/access_log") or die "Can't open
apache log!";
$url = $url_tags{$tag};
$url =~ s/([\\\/\^\$\*\+\?\=\@\{\}\[\]\<\>])/\\$&/g;

Click to expand...

What are you doing?!!! Way to many slashes to be able to read this.
I guess you are trying to achieve what just a \Q would do.

Did you even read Ben's and/or my answer?

$count = grep { /$url/ } <FILE>;
$url_counts{$tag} = $count;
close(FILE);
}

Click to expand...

Also, about reading the file into an array. Problem is the file could
be a couple of million lines long. Isn't that a lot to be reading
into an array? If not, and it would be faster, then maybe I'll change
the code.

Click to expand...

Now you are reading the file multiple times. Do you really think that is
better?

If the log file is really too large (probably it isn't) then read it
line by line as suggested in my previous posting in b2).

- Wolf

Ok, your solution seems to work. Nice:

open (FILE,"<","/home/httpdlogs/apache2/access_log") or die "Can't
open log!";
@log = <FILE>;
close (FILE);

foreach $tag (keys(%url_tags)) {
$url = $url_tags{$tag};
$count = grep { /\Q$url/ } @log;
$url_counts{$tag} = $count;
}

I'm just worried about a 4 million line file going into an array. As
long as it does not take up too many resources. If the file is say,
300MB, that is a lot to put into an array......

Rainer Weikusat · Apr 11, 2012

[...]

Also, about reading the file into an array. Problem is the file could
be a couple of million lines long. Isn't that a lot to be reading
into an array? If not, and it would be faster, then maybe I'll change
the code.

Whether it is 'a lot' depends on how much memory you want to dedicate
to this task and if you're reasonably sure that your the size of your
input will never exceed that. The latter is especially problematic
because an out-of-memory failure happening because of 'a large input'
is a very bad situation for rewriting the code supposed to process
that input. There's also the question if this task is so much more
important than all other tasks running on the same computer that
you're willing to maximize its resource usage in order to minimize the
wallclock time it needs to complete.

Except in cases where it is known that the input file will always be
'rather small', eg, if it is a configuration file, the safe,
conservative choice is to process it line-by-line and assume that the
buffering layer between the Perl code and the system I/O facilities
will employ a 'sensible' buffering strategy.

Martijn Lievaart · Apr 11, 2012

Except in cases where it is known that the input file will always be
'rather small', eg, if it is a configuration file, the safe,
conservative choice is to process it line-by-line and assume that the
buffering layer between the Perl code and the system I/O facilities will
employ a 'sensible' buffering strategy.

This is such important advice. Absolutely spot on. Well worded too.
Should go in a FAQ somewhere as this subject actually comes up fairly
often.

M4

Rainer Weikusat · Apr 11, 2012

Ben Morrow said:
Quoth Rainer Weikusat <[email protected]>:
[...]

Further drawbacks are that it adds significant 'optical noise' to the
text and that it is not compatible with the regular expression syntax
used by other UNIX(*) tools anymore.

Click to expand...

Perl's regexes are not compatible with those of other Unix tools in any
case.

Contrived counter-example:

[rw@sapphire]~ $echo 'a/b' | sed 's/\//\/\//'
a//b
[rw@sapphire]~ $echo 'a/b' | perl -pe 's/\//\/\//'
a//b

This is a feature: egrep-style regexen rapidly become unreadable,
especially when used from a language which makes you quote them
again.

That's your opinion, not mine. My opinion is that code which uses
non-uniform ad hoc syntax is much harder to read than code which
consistently uses one syntax.

Tim McDaniel · Apr 11, 2012

Ben Morrow said:
Ben Morrow said:

Quoth Rainer Weikusat <[email protected]>:
[...]

Further drawbacks are that it adds significant 'optical noise' to
the text and that it is not compatible with the regular expression
syntax used by other UNIX(*) tools anymore.

Click to expand...

Click to expand...

I believe that there has never been "the" regular expression syntax in
UNIX tools: I believe that most tools chose their own implementations.
Heck, even egrep wasn't compatible with grep.

Perl's regexes are not compatible with those of other Unix tools in any
case.

Click to expand...

Contrived counter-example:

[rw@sapphire]~ $echo 'a/b' | sed 's/\//\/\//'
a//b
[rw@sapphire]~ $echo 'a/b' | perl -pe 's/\//\/\//'
a//b

An experienced Perl person should know what he meant: *in general*,
Perl's regexps are not identical with those of other tools (except
those that use Perl or libraries designed to be Perl regexp).

$ echo 'a(b)' | sed -e 's/(b)/{B}/'
a{B}
$ echo 'a(b)' | perl -pe 's/(b)/{B}/'
a({B})

Or any other case where Perl's metacharacters differ from sed, ed,
grep, egrep, or whatnot.

Rainer Weikusat · Apr 11, 2012

[...]

Perl's regexes are not compatible with those of other Unix tools in any
case.

Click to expand...

Contrived counter-example:

[rw@sapphire]~ $echo 'a/b' | sed 's/\//\/\//'
a//b
[rw@sapphire]~ $echo 'a/b' | perl -pe 's/\//\/\//'
a//b

Click to expand...

An experienced Perl person should know what he meant: *in general*,

Yes. And as 'experienced Perl persons' both of you should know that I
know that the Perl regular expression (sub-)language is not identical
to the regular expression language used by sed (or anything else) and
that this is besides the point since the topic of conversations was
'delimiter characters for regular expressions': Somebody who use
anything except Perl has to deal with the // convention, anyway,
consequently, keeping it for Perl doesn't make things worse than they
already are.

Mladen Gogala · Apr 11, 2012

Prefer using the open with three arguments (as done for access_log).
Also, the FILE handle is global -> better use a "normal" variable here

That really messes up working with formats. Formats are a very useful
piece of Perl and I use them a lot.

Mladen Gogala · Apr 11, 2012

Consider switching to Perl6::Form instead. It's a great deal saner.

(Don't be put off by the 'Perl6' prefix. It's a perfectly ordinary Perl
5 module, that just happens to have been written as a demonstration of
the Perl 6 way of doing formats.)

Thanks a lot! Looks much simpler than associating formats through handles
using fdopen(fileno(...). As a matter of fact, it looks very, very
interesting. I will try it over the weekend. That will probably remove
some of the complaints by perlcritic. That is a really useful nagging
piece of software.
BTW, I am a DBA, not a programmer. I find Perl ideal for writing quick
and pretty reports that have to be run from crontab. Don't judge me too
harshly.

J. Gleixner · Apr 12, 2012

On 04/11/12 09:38, ExecMan wrote:
[...]

I'm just worried about a 4 million line file going into an array. As
long as it does not take up too many resources. If the file is say,
300MB, that is a lot to put into an array......

There are many ways to do what you are asking, however think about
what is it -exactly- that you're trying to count? Is the 'url' to match
against the referrer? Is it hits to certain pages? Something else?

If what you're after doesn't occur in most of the lines, you might
be able to greatly reduce the number of lines you want to look at
by narrowing down your universe to only lines that might contain what
you're after.

e.g.
open( FILE, '/bin/egrep 'abc123|thispage|zzzyyyxxx'
/home/httpdlogs/apache2/access_log |' );

or..

/bin/egrep 'abc123|thispage|zzzyyyxxx'
/home/httpdlogs/apache2/access_log | myprogram.pl

where myprogram.pl reads from STDIN.

That approach can be used to not include lines too. e.g. those
with '.gif' or '.jpg', or any other string, using '-v'.

If you're after referrer, or certain strings in the URL part of the
line, then possibly parse every line and store the specific field
you're after, into a hash, with the count as the value,. Once
they are all gathered, go through that hash looking for those that
contain your 'urls'. That way you parse the file once, gathering only
the relevant data, then go through that as many times as you need.
You'll avoid having to run multiple regular expressions on every
single line in the log.

Possibly you could use Apache:

arseLog to do all of
the work and then simply grep/filter the output as needed.

See also:
perldoc -q 'How do I efficiently match many regular expressions at once'

ExecMan · Apr 12, 2012

On 04/11/12 09:38, ExecMan wrote:
[...]

I'm just worried about a 4 million line file going into an array. As
long as it does not take up too many resources. If the file is say,
300MB, that is a lot to put into an array......

Click to expand...

There are many ways to do what you are asking, however think about
what is it -exactly- that you're trying to count? Is the 'url' to match
against the referrer? Is it hits to certain pages? Something else?

If what you're after doesn't occur in most of the lines, you might
be able to greatly reduce the number of lines you want to look at
by narrowing down your universe to only lines that might contain what
you're after.

e.g.
open( FILE, '/bin/egrep 'abc123|thispage|zzzyyyxxx'
/home/httpdlogs/apache2/access_log |' );

or..

/bin/egrep 'abc123|thispage|zzzyyyxxx'
/home/httpdlogs/apache2/access_log | myprogram.pl

where myprogram.pl reads from STDIN.

That approach can be used to not include lines too. e.g. those
with '.gif' or '.jpg', or any other string, using '-v'.

If you're after referrer, or certain strings in the URL part of the
line, then possibly parse every line and store the specific field
you're after, into a hash, with the count as the value,. Once
they are all gathered, go through that hash looking for those that
contain your 'urls'. That way you parse the file once, gathering only
the relevant data, then go through that as many times as you need.
You'll avoid having to run multiple regular expressions on every
single line in the log.

Possibly you could use Apache:arseLog to do all of
the work and then simply grep/filter the output as needed.

See also:
perldoc -q 'How do I efficiently match many regular expressions at once'

I love this. From a simple programming question I am accumulating all
this wisdom in Perl.

Another thing I was wondering about is why to use the 'strict'
method? Advantages? Disadvantages?

Tim McDaniel · Apr 12, 2012

Another thing I was wondering about is why to use the 'strict'
method? Advantages? Disadvantages?

You mean

use strict;

? The technical term is "Perl pragma". The purpose is described at
the top of the man page as "strict - Perl pragma to restrict unsafe
constructs".

Since it restricts unsafe constructs, it is deemed a very very good
idea indeed. If it flags an error, it's much more likely than not
(though not guaranteed) that you're doing something wrong than that
you're doing something reasonable and need to turn off "use strict".

The usual idiom is

use strict;
use warnings;

(or vice versa). My ork-place codes so that they often add
no warnings 'uninitialized';
but I prefer to just not code that way, and I suspect that that's the
most common warning to turn off (though I could easily be wrong).

JavaFX tags not wrapping around	0	Sep 25, 2024
find a matching pattern in file and find it in another file too	2	Mar 13, 2008
Push regex search result into hash with multiple values	14	May 19, 2014
Help with my responsive home page	2	Dec 14, 2022
Pattern Matching and skipping	36	Sep 6, 2006
multiline pattern matching from file	1	Jan 22, 2007
Help; Pattern Matching	5	Sep 26, 2003
I want to code whatsapp phone number validator. The script will	1	Feb 20, 2023

Help with pattern matching

ExecMan

Wolf Behrenhoff

Justin C

Wolf Behrenhoff

ExecMan

Justin C

Wolf Behrenhoff

Rainer Weikusat

Rainer Weikusat

ExecMan

Rainer Weikusat

Martijn Lievaart

Rainer Weikusat

Tim McDaniel

Rainer Weikusat

Mladen Gogala

Mladen Gogala

J. Gleixner

ExecMan

Tim McDaniel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads