One Off Script

J

JP Ogden

I am currently working on a PERL script that will allow me to mine
records out of a very large data set (1 million+ records) that have a
specific field that is one off (possibly two off). In other words I
would like for my script to find records where "Field A" is 123 and
123B. Not sure where to start and how to get this done? Any thoughts
would be appreciated.
 
E

Eric Schwartz

I am currently working on a PERL script that will allow me to mine
records out of a very large data set (1 million+ records) that have a
specific field that is one off (possibly two off). In other words I
would like for my script to find records where "Field A" is 123 and
123B. Not sure where to start and how to get this done? Any thoughts
would be appreciated.

As a short aside, consider that when you looked for an answer on
Google Groups, or whatever your favourite USENET search engine is (you
*did* look, didn't you?), if you had seen a topic called "One Off
Script", you probably wouldn't have thought it relevant, right? If
you pick Subject: lines that have to do with what you're asking about,
it helps others find the answers later on, as well.

Anyway, as per your actual question: until you show us some actual
data (and preferably, any code you've tried writing already), we can't
be of much help. Just to pick one aspect out of the air, we don't
know if your data is in a flat file, or a database, or available from
an HTTP server somehow. We don't know what format it's in. We don't
know what "one off" means in this context-- is it the same value as in
another row, only with one more character appended? Does the appended
character have to be a 'B', or just alphabetical, or could it be
anything? For that matter, do you want a list of all rows that have
another row where "Field A" is one off (whatever that means), or do
you have a 'key row' which denotes the value for rows to be one off
from?

I'm not trying to put you or your question down; I just want to point
out that from all that you've given us so far, the answer could be
anything from 5 to 50 lines of code (or more). We simply don't know
enough to help.

-=Eric
 
J

Jeff 'japhy' Pinyan

[posted & mailed]

123 5675.68 5/24/03 Misc Misc2
123A 8756.67 7/3/03 Code Code2

and

0234 10456.45 6/4/02 Man Man2
234 456.34 10/5/02 Talk Talk2

because the records in "Field 1" are what I am calling "one-off."

Do you mean that the values in Field 1 are different in that one of them
merely has an additional character appended or prepended to it? Or are
"abc1def" and "abc2def" also one-off? Or are "abcdef" and "abc1def" also
one-off?

If it's only pre- and appending that matters, you can get it done rather
easily, I'd imagine.
 
J

JP Ogden

The characters in "Field 1" could be the same number -

Like 1234 and 1235

Or 234- and 334-

The characters in "Field 1" could be as Jeff put them -

abc1def and abc2def

Or 045D/123 and 055D/123

The characters in "Field 1" could also be as Tad put them -

123 and 12X3

Or add-34 and add34

The list goes on...


Here is a sample of my data:
Field 1 Field 2 Field 3 Field 4 Field 5
123 5675.68 5/24/03 Misc Misc2
E4678 345.76 6/23/02 Test Test2
123A 8756.67 7/3/03 Code Code2
0234 10456.45 6/4/02 Man Man2
234 456.34 10/5/02 Talk Talk2
675-02 1045.45 3/5/03 Level Level1
etc...

I would like to isolate only the records where the records in "Field
1" are "one-off." The results from the above sample would look like
this:
Field 1 Field 2 Field 3 Field 4 Field 5
123 5675.68 5/24/03 Misc Misc2
123A 8756.67 7/3/03 Code Code2

and

0234 10456.45 6/4/02 Man Man2
234 456.34 10/5/02 Talk Talk2

because the records in "Field 1" are what I am calling "one-off."


You mean one character shorter, by either taking the first or
the last character off?

Or should 123 and 12X3 be "one off" too?

I'll assume the former.

The characters in "Field 1" can be anything.


I will assume "anything except whitespace" below.

Any help would be greatly appreciated.


--------------------------------------------------------
#!/usr/bin/perl
use strict;
use warnings;

my %seen;
while ( <DATA> ) {
$seen{$1} = $_ if /^(\S+)/;
}

my %reported;
foreach my $f1 ( sort keys %seen ) {
foreach my $shorter ( substr($f1, 0, -1), substr($f1, 1) ) {
if ( $seen{$shorter} and not $reported{ "$seen{$shorter}:$f1" }) {
$reported{ "$seen{$shorter}:$f1" } = 1;
print $seen{$shorter}, $seen{$f1}, "\n";
}
}
}

__DATA__
123 5675.68 5/24/03 Misc Misc2
E4678 345.76 6/23/02 Test Test2
123A 8756.67 7/3/03 Code Code2
0234 10456.45 6/4/02 Man Man2
234 456.34 10/5/02 Talk Talk2
675-02 1045.45 3/5/03 Level Level1
--------------------------------------------------------




[ snip TOFU.
Please learn the proper way of formatting followups.
]
 
S

Steven Kuo

(e-mail address removed) (Tad McClellan) wrote in message news:<[email protected]>...
JP Ogden said:
[(snipped]

I would like to isolate only the records where the records in "Field
1" are "one-off." The results from the above sample would look like


[ snipped ]


[ snipped and rearranged ]

The characters in "Field 1" could be the same number -

Like 1234 and 1235

Or 234- and 334-

The characters in "Field 1" could be as Jeff put them -

abc1def and abc2def

Or 045D/123 and 055D/123

The characters in "Field 1" could also be as Tad put them -

123 and 12X3

Or add-34 and add34

The list goes on...



Here's a subroutine to get you started -- I presume you don't
use wide-characters in fields:


if (off_by_one("add-34", "add34")) {
print "Off by one\n";
}


sub off_by_one {
# returns undef, 0, or 1
# the latter indicates that the two arguments are "off-by-1"
my ($length1,$length2) = map length, my ($field1, $field2) = @_;
return if (abs($length1 - $length2) > 1);
if ($length1 == $length2) {
my $difference = grep $_ != 0 => unpack 'C*', $field1 ^ $field2;
return ($difference == 1)? 1 : 0;
} else {
my @chars1 = unpack 'C*' => $field1;
my @chars2 = unpack 'C*' => $field2;
my $i = 0;
while (defined $chars1[$i]
and defined $chars2[$i]
and $chars1[$i] == $chars2[$i]) { # compare unsigned int
++$i;
}
substr(($length1 > $length2)? $field1 : $field2, $i, 1, '');
return ($field1 eq $field2)? 1 : 0;
}
}

P.S. Please do not top-post.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
474,262
Messages
2,571,058
Members
48,769
Latest member
Clifft

Latest Threads

Top