Using split to count matches, but exclude certain patterns

S

surfitupdotcom

I have script that recursively greps for a term and counts the
occurrences of it in each file. It works fine but now I want to
exclude matches where the term has an underscore in front or after
it. I have tried to continue using split on (not underscore)
$search_term(not underscore) in below examples but my results are not
right yet. Input is a string in $grep_out and I want to count any
number of occurrences. I can not break string up into words since a
correct match may not have spaces or any certain character around it.
Let me know if I have not provided enough info, or should post whole
script.... Thanks in advance for any assist, John

Attempts so far:
# @surewords = split(/(?<!\_)${search_term}(?!\_)/im,
$grep_out);
# @surewords = split(/\_{0}${search_term}\_{0}/im,
$grep_out);
@surewords = split(/[^\_]${search_term}[^\_]/im,
$grep_out);
# @surewords = split(/(^|[^\_])${search_term}($|[^\_])/im,
$grep_out);
 
P

Paul Lalli

I have script that recursively greps for a term and counts the
occurrences of it in each file. It works fine but now I want to
exclude matches where the term has an underscore in front or after
it. I have tried to continue using split on (not underscore)
$search_term(not underscore) in below examples but my results are not
right yet. Input is a string in $grep_out and I want to count any
number of occurrences. I can not break string up into words since a
correct match may not have spaces or any certain character around it.
Let me know if I have not provided enough info, or should post whole
script.... Thanks in advance for any assist, John

Attempts so far:
# @surewords = split(/(?<!\_)${search_term}(?!\_)/im,
$grep_out);

_ is not special. No need to backslash it. This code says to split
on any $search_term that is not *immediately* preceded by or
*immediately* followed by an underscore. Is that what you meant?
# @surewords = split(/\_{0}${search_term}\_{0}/im,
$grep_out);

A quantifier of {0} is a no-op. Frankly, I think that should be a
syntax error, or at least a warning.
@surewords = split(/[^\_]${search_term}[^\_]/im,
$grep_out);

This says to include the not-underscore character in the split
delimiter.
# @surewords = split(/(^|[^\_])${search_term}($|[^\_])/im,
$grep_out);

That's a modification of the above, allowing $search_term to come at
the beginning or end of the string as well.


Please provide some sample input and sample output, so people have a
chance to know what it is you're trying to acheive. This and other
good advice can be found in the Posting Guidelines, which are posted
here twice a week.

Paul Lali
 
P

Paul Lalli

I have script that recursively greps for a term and counts the
occurrences of it in each file. It works fine but now I want to
exclude matches where the term has an underscore in front or after
it. I have tried to continue using split

As a side note to my other response, split() is a very bad way to
attempt to count occurrences of a string:

$ perl -le'
print scalar(@foo = split /foo/, "barfoobazfoobiff");
print scalar(@foo = split /foo/, "barfoobazbifffoo");
print scalar(@foo = split /foo/, "barbazbifffoofoo");
'
3
2
1


I rather strongly suggest you read:
$ perldoc -q count
Found in /opt2/Perl5_8_4/lib/perl5/5.8.4/pod/perlfaq4.pod
How can I count the number of occurrences of a substring
within a string?

Paul Lalli
 
S

surfitupdotcom

As a side note to my other response, split() is a very bad way to
attempt to count occurrences of a string:

$ perl -le'
print scalar(@foo = split /foo/, "barfoobazfoobiff");
print scalar(@foo = split /foo/, "barfoobazbifffoo");
print scalar(@foo = split /foo/, "barbazbifffoofoo");
'
3
2
1

I rather strongly suggest you read:
$ perldoc -q count
Found in /opt2/Perl5_8_4/lib/perl5/5.8.4/pod/perlfaq4.pod
How can I count the number of occurrences of a substring
within a string?

Paul Lalli

You read me correctly, idea was to split on any occurrence of my
search term that does not have an underscore before or after it.
Counting matches using split worked fine until I tried to exclude
certain patterns. I will look at the perldoc you suggested but here
is more info for the thread. Thanks, John

Sample input: super _super_ _super super SUPER SUPER_ blahsuper
Desired output: super super SUPER super

Current output using split(/(?<!_)${search_term}(?!_)/i, $grep_out);
Array contents- _super_ _super SUPER_ blah
 
T

Tad McClellan

I have script that recursively greps

Attempts so far:
# @surewords = split(/(?<!\_)${search_term}(?!\_)/im,
$grep_out);
# @surewords = split(/\_{0}${search_term}\_{0}/im,
$grep_out);
@surewords = split(/[^\_]${search_term}[^\_]/im,
$grep_out);
# @surewords = split(/(^|[^\_])${search_term}($|[^\_])/im,
$grep_out);


There is no recursion anywhere in that code.

Perhaps you meant "repeatedly" instead?
 
S

surfitupdotcom

I have script that recursively greps
Attempts so far:
# @surewords = split(/(?<!\_)${search_term}(?!\_)/im,
$grep_out);
# @surewords = split(/\_{0}${search_term}\_{0}/im,
$grep_out);
@surewords = split(/[^\_]${search_term}[^\_]/im,
$grep_out);
# @surewords = split(/(^|[^\_])${search_term}($|[^\_])/im,
$grep_out);

There is no recursion anywhere in that code.

Perhaps you meant "repeatedly" instead?

The recursion is elsewhere in the script. By the time it gets to this
split each line of $grep_out has one or more hits of the search term.
 
S

surfitupdotcom

On Aug 1, 1:02 pm, (e-mail address removed) wrote:

(snipped)


You read me correctly, idea was tospliton any occurrence of my
search term that does not have an underscore before or after it.
Counting matches usingsplitworked fine until I tried to exclude
certain patterns. I will look at the perldoc you suggested but here
is more info for the thread. Thanks, John
Sample input: super _super_ _super super SUPER SUPER_ blahsuper
Desired output: super super SUPER super

How did you plan on getting rid of the 'blah' substring by
doing asplit?


Current output usingsplit(/(?<!_)${search_term}(?!_)/i, $grep_out);
Array contents- _super_ _super SUPER_ blah

Your description said 'a underscore before ... OR
a underscore after'; so you also need an "OR" in your
regular expression. This is known as "Alternation"
(see perldoc perlre).

use Data::Dumper;

my $term = 'super';

my $string = 'super _super_ _super super SUPER SUPER_ blahsuper';

my @fragments =split(
/_\Q$term\E_? # exclude term with underscore in front
# (optional trailing _)
| # OR
_?\Q$term\E_/xi # exclude term with underscore afterward
# (optional leading _)
, $string);

print Dumper \@fragments;

__END__

I get:

$VAR1 = [
'super ',
' ',
' super SUPER ',
' blahsuper'
];

Is that what you wanted? As Paul said, there's
probably a better way to "count" things than
usingsplit.


Thanks all for the assist. After further experimentation I did switch
to using option other than split for this task. I did sharpen my
regexp along the way so everything worked out. Take care, John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top