Finding if something is in a list

C

ccc31807

Actually the data is more like IN2 and what I want is everyone who
read book 4.

Replace the IN2 loop with something like this:

my %people;
while (<IN2>)
{
chomp;
my ($name, $date, $books) = split /,/;
my @books = split(/;/, $books);
foreach my $ele (@books) { push @{$people{$ele}}, $_; }
}

####print "People who read book 4:\n@{$people{4}}\n";

foreach my $k (keys %people)
{
print "$k => @{$people{$k}}\n";
}

This gives you the data on the people who read all the books. To get
data for a specific book, uncomment the #### line.

What this does is create a hash indexed to your book number, with the
value of the hash an anonymous array to which you push the line (as a
scalar) containing the data from your IN2 file. You get the data out
by using the hash key of the book you want, dereferencing the value as
an anonymous array. It may look hairy, but the third time you do it
you won't even think about it.

CC.
 
S

sln

I (almost) wish that the syntax for grouping without backrefs
were at least as terse as the syntax for grouping with backrefs.
Having to add extra punctuation to indicate *not* doing something
just seems counterintuitive.

The (?:) syntax was a later addition to the language, when () was
already well established, so that wasn't really an option. (?:) is
also slightly harder to read for people familiar with regexp syntaxes
other than Perl's (and for those of us who first learned Perl before
it had (?:)). I'm not saying that's an excuse for creating backrefs
unnecessarily, but there is some pressure to use () because it works.

I think that (?: ) was a logical step in the process, in the face of
( ) which doesen't make sense when combined with quantifiers.
In that case, it really doesen't work and is basically useless for capture
in this sense of (\s([\w]+\s*)*\s)+.
And its very hard to read.

In that sense, (?: ) has moderately easier to discern than a capture
grouping (As a bonus you get extra's (?imsx-imsx: ) ) but imho all groupings,
especially nested, are hard to read.

When modifying or reading a regexs groupings, its sometimes more important
to me to separate the capture ones as it shifts the output when alterred.
Most unique syntax is taken already.

In need of a tool, I tried to cull out the start of the capture groups
separate from the non-capture. I didn't even attempt closures, although
if the start can be determined, I'd imagine the ends can too, but not sure.

-sln

-------------------
use strict;
use warnings;
require 5.010_000;

##
my $rxgroup = qr/
([[:cntrl:]] | $) # Formatting control character
| # or, the rest ..
(?:
(?<!\\) # Not an escape behind us
(?:\\.)* # 0 or more "escape + any char"
(?:
# Exclude character class'
\[
\]?
(?: \\.| \[:[a-z]*:\] | [^\]\n] )*
(\n?)
(?: \\.| \[:[a-z]*:\] | [^\]] )*
\]
|
(?# Exclude extended comments )
\(\?(\#) [^)]* \)
|
# Exclude free comments
(\#) (?:[^\n])*
|
# Start of a capture group
\( # (
(?:
(?!\?) # unnamed: not a ? in front of us
| # or (Perl 5.10 and above)
# named: a ?<name> or ?'name' is ok
(?= \?[<'][^\W\d][\w]*['>] )
)
)
)
/x;

my $testrx = qr/
\(\$th(\\(?:.) [(]
(?# Extended lines
of comment
)
\\\\(.\)\\\(.)(i(s))\t(i(s)) ] )
/x;

##
# Sample object
print FindRXCaptureGroups(
qr/ \(\$th(\\(?:.) [(] \\\\(.\)\\\(.)(i(s))\t(i(s)) ] )/x ), "\n";

# Sample reference
print FindRXCaptureGroups( \$testrx ), "\n";

# Show groups for that which finds the groups
print FindRXCaptureGroups( \$rxgroup ),"\n";

exit(0);

##
sub FindRXCaptureGroups
{
@_ > 0 || die "Expected a parameter";
my $sample;

if ( ref( $_[0]) eq 'SCALAR' ) { $sample = $_[0] }
elsif (ref(\$_[0]) eq 'SCALAR' ) { $sample = \$_[0] }
elsif (ref( $_[0]) eq 'Regexp' ) { $sample = \$_[0] }
elsif (ref( $_[0]) eq 'REF' &&
ref(${$_[0]}) eq 'Regexp') { $sample = $_[0] }
else {
die "Not a string, Regexp object, or reference to one";
}
my ($All,
$grpstring,
$group,
$lastpos ) = ('', '', 1, 0);

while ($$sample =~ /$rxgroup/g )
{
if (defined $1) {
my $cntrlen = length $1;
my $cntrlcode = $cntrlen ? $1 : "\n";

$All .= substr( $$sample, $lastpos, ($+[0]-$lastpos-$cntrlen) ) . $cntrlcode;
$grpstring .= '-' x ($+[0]-$lastpos-$cntrlen) . $cntrlcode;
$lastpos = $+[0];
if ($cntrlcode eq "\n") {
$All .= $grpstring if ($grpstring =~ /\d/);
$grpstring = '';
}
next;
}
if (defined $2) {
my ($cntrlcode, $match0, $match2) = ($2, $+[0], $+[2]);

if (length( $2 ) && $grpstring =~ /\d/) {
$All .= substr( $$sample, $lastpos, ($match2-$lastpos) );
$grpstring .= '-' x ($match2-$lastpos-1) . $cntrlcode;
$lastpos = $match2;
$All .= $grpstring;
$grpstring = '';
}
$All .= substr( $$sample, $lastpos, ($match0-$lastpos) );
$grpstring .= '-' x ($match0-$lastpos);
$lastpos = $match0;
next;
}
if (defined $3 || defined $4) {
$All .= substr( $$sample, $lastpos, ($+[0]-$lastpos) );
$grpstring .= '-' x ($+[0]-$lastpos);
$lastpos = $+[0];
next;
}

$All .= substr( $$sample, $lastpos, ($+[0]-$lastpos) );
$grpstring .= '-' x ($+[0]-$lastpos-1) . $group++ % 10;
$lastpos = $+[0];
}
return $All;
}
__END__


(?x-ism: \(\$th(\\(?:.) [(] \\\\(.\)\\\(.)(i(s))\t(i(s)) ] ))
---------------1----------------2---------3-4-----5-6--------

(?x-ism:
\(\$th(\\(?:.) [(]
----------1-----------
(?# Extended lines
of comment
)
\\\\(.\)\\\(.)(i(s))\t(i(s)) ] )
--------2---------3-4-----5-6-------
)

(?x-ism:
([[:cntrl:]] | $) # Formatting control character
-----1------------------------------------------------
| # or, the rest ..
(?:
(?<!\\) # Not an escape behind us
(?:\\.)* # 0 or more "escape + any char"
(?:
# Exclude character class'
\[
\]?
(?: \\.| \[:[a-z]*:\] | [^\]\n] )*
(\n?)
-----------------2----
(?: \\.| \[:[a-z]*:\] | [^\]] )*
\]
|
(?# Exclude extended comments )
\(\?(\#) [^)]* \)
-------------------3------------
|
# Exclude free comments
(\#) (?:[^\n])*
--------------4--------------
|
# Start of a capture group
\( # (
(?:
(?!\?) # unnamed: not a ? in front of us
| # or (Perl 5.10 and above)
# named: a ?<name> or ?'name' is ok
(?= \?[<'][^\W\d][\w]*['>] )
)
)
)
)
 
S

sln

On Tue, 23 Nov 2010 13:52:05 -0800, (e-mail address removed) wrote:
[snip preamble]
|
# Exclude free comments
(\#) (?:[^\n])*
^^
In the interest of readability, this grouping can be removed.
(\#) [^\n]*

[snip lines and lines of stuff]

-sln
 
P

Peter Scott

I (almost) wish that the syntax for grouping without backrefs were at
least as terse as the syntax for grouping with backrefs. Having to add
extra punctuation to indicate *not* doing something just seems
counterintuitive.

Understand that the basic syntax for regular expressions was created
first by mathematicians before computers existed, then transliterated for
computers before Perl existed, and so the Perl developers were starting
from a syntax that had already been developed according to certain
assumptions, and this constrained their choices as they extended the
syntax.

If you want to see what pattern matching syntax can look like when all
the legacy is thrown out or up for debate, see rules in Perl 6.
 
J

John W. Krahn

Uri said:
DS> Can't change the string - it is coming from another application and I
DS> can't change the data format.

that makes no sense. you CAN always change it for internal use like
searching. are you looking into this string many times? if so, spliting
the values out to a hash and searching that will be much faster and
simpler. no need for much other than split and a hash lookup:

my %is_book_num = map { $_ => 1 } split /;/, $string ;

that will create a leading empty field which shouldn't matter in your
lookups. if you are worried about it, then either grep that out or use a
different way to grab them (\d+ comes to mind) in a regex:

my %is_book_num = map { $_ => 1 } $string =~ /(\d+)/ ;

That only captures the first \d+ value, which is OK if that is all that
is required. To capture all \d+ values:

my %is_book_num = map { $_ => 1 } $string =~ /\d+/g;



John
 
D

Dave Saville

What this does is create a hash indexed to your book number, with the
value of the hash an anonymous array to which you push the line (as a
scalar) containing the data from your IN2 file. You get the data out
by using the hash key of the book you want, dereferencing the value as
an anonymous array. It may look hairy, but the third time you do it
you won't even think about it.

Well what I think about is all the work the box is doing, making a
hash and then processing it, keeping the whole file in memory -
assuming it fits, and thus possibly causing a paging problem, when all
it needs is a loop reading one record at a time and either passing it
over or processing it based on a test on what in reality is a pretty
short list of possible numbers. I appreciate the clever idea - and can
use it for a similar, but more complicated, requirement.

I know what you are going to say - It doesn't matter - But I am afraid
I come from *way* back in programming when one had to worry about such
things and wrote tight code. Usually in assembler. :)
 
C

ccc31807

Well what I think about is all the work the box is doing, making a
hash and then processing it, keeping the whole file in memory -
assuming it fits, and thus possibly causing a paging problem, when all
it needs is a loop reading one record at a time and either passing it
over or processing it based on a test on what in reality is a pretty
short list of possible numbers.

I agree that you use the simplest, easiest tool that gets the job
done. If your job requires producing a report from the analysis of two
(or more) input files, you will arrange your work in (at least) three
steps: (1) reading in the data to specific data structures, (2)
processing the data contained in the data structures, and (3) reading
out the processed data to an output file.

One of the great strengths of Perl is that it gives you sophisticated
structures, arrays of hashes, hashes of hashes, hashes of arrays,
etc., nested to arbitrary depths, and doesn't require you to travel
around our elbow to get to your nose.

CC.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top