One Off Script

JP Ogden · Jul 21, 2003

I am currently working on a PERL script that will allow me to mine
records out of a very large data set (1 million+ records) that have a
specific field that is one off (possibly two off). In other words I
would like for my script to find records where "Field A" is 123 and
123B. Not sure where to start and how to get this done? Any thoughts
would be appreciated.

Eric Schwartz · Jul 21, 2003

I am currently working on a PERL script that will allow me to mine
records out of a very large data set (1 million+ records) that have a
specific field that is one off (possibly two off). In other words I
would like for my script to find records where "Field A" is 123 and
123B. Not sure where to start and how to get this done? Any thoughts
would be appreciated.

As a short aside, consider that when you looked for an answer on
Google Groups, or whatever your favourite USENET search engine is (you
*did* look, didn't you?), if you had seen a topic called "One Off
Script", you probably wouldn't have thought it relevant, right? If
you pick Subject: lines that have to do with what you're asking about,
it helps others find the answers later on, as well.

Anyway, as per your actual question: until you show us some actual
data (and preferably, any code you've tried writing already), we can't
be of much help. Just to pick one aspect out of the air, we don't
know if your data is in a flat file, or a database, or available from
an HTTP server somehow. We don't know what format it's in. We don't
know what "one off" means in this context-- is it the same value as in
another row, only with one more character appended? Does the appended
character have to be a 'B', or just alphabetical, or could it be
anything? For that matter, do you want a list of all rows that have
another row where "Field A" is one off (whatever that means), or do
you have a 'key row' which denotes the value for rows to be one off
from?

I'm not trying to put you or your question down; I just want to point
out that from all that you've given us so far, the answer could be
anything from 5 to 50 lines of code (or more). We simply don't know
enough to help.

-=Eric

Jeff 'japhy' Pinyan · Jul 21, 2003

[posted & mailed]

123 5675.68 5/24/03 Misc Misc2
123A 8756.67 7/3/03 Code Code2

and

0234 10456.45 6/4/02 Man Man2
234 456.34 10/5/02 Talk Talk2

because the records in "Field 1" are what I am calling "one-off."

Do you mean that the values in Field 1 are different in that one of them
merely has an additional character appended or prepended to it? Or are
"abc1def" and "abc2def" also one-off? Or are "abcdef" and "abc1def" also
one-off?

If it's only pre- and appending that matters, you can get it done rather
easily, I'd imagine.

JP Ogden · Jul 22, 2003

The characters in "Field 1" could be the same number -

Like 1234 and 1235

Or 234- and 334-

The characters in "Field 1" could be as Jeff put them -

abc1def and abc2def

Or 045D/123 and 055D/123

The characters in "Field 1" could also be as Tad put them -

123 and 12X3

Or add-34 and add34

The list goes on...

Here is a sample of my data:
Field 1 Field 2 Field 3 Field 4 Field 5
123 5675.68 5/24/03 Misc Misc2
E4678 345.76 6/23/02 Test Test2
123A 8756.67 7/3/03 Code Code2
0234 10456.45 6/4/02 Man Man2
234 456.34 10/5/02 Talk Talk2
675-02 1045.45 3/5/03 Level Level1
etc...

I would like to isolate only the records where the records in "Field
1" are "one-off." The results from the above sample would look like
this:
Field 1 Field 2 Field 3 Field 4 Field 5
123 5675.68 5/24/03 Misc Misc2
123A 8756.67 7/3/03 Code Code2

and

0234 10456.45 6/4/02 Man Man2
234 456.34 10/5/02 Talk Talk2

because the records in "Field 1" are what I am calling "one-off."

Click to expand...

You mean one character shorter, by either taking the first or
the last character off?

Or should 123 and 12X3 be "one off" too?

I'll assume the former.

The characters in "Field 1" can be anything.

Click to expand...

I will assume "anything except whitespace" below.

Any help would be greatly appreciated.

Click to expand...

--------------------------------------------------------
#!/usr/bin/perl
use strict;
use warnings;

my %seen;
while ( <DATA> ) {
$seen{$1} = $_ if /^(\S+)/;
}

my %reported;
foreach my $f1 ( sort keys %seen ) {
foreach my $shorter ( substr($f1, 0, -1), substr($f1, 1) ) {
if ( $seen{$shorter} and not $reported{ "$seen{$shorter}:$f1" }) {
$reported{ "$seen{$shorter}:$f1" } = 1;
print $seen{$shorter}, $seen{$f1}, "\n";
}
}
}

__DATA__
123 5675.68 5/24/03 Misc Misc2
E4678 345.76 6/23/02 Test Test2
123A 8756.67 7/3/03 Code Code2
0234 10456.45 6/4/02 Man Man2
234 456.34 10/5/02 Talk Talk2
675-02 1045.45 3/5/03 Level Level1
--------------------------------------------------------

[ snip TOFU.
Please learn the proper way of formatting followups.
]

Steven Kuo · Jul 22, 2003

JP Ogden said:
(e-mail address removed) (Tad McClellan) wrote in message news:<[email protected]>...

JP Ogden said:

[(snipped]

I would like to isolate only the records where the records in "Field
1" are "one-off." The results from the above sample would look like

Click to expand...

Click to expand...

[ snipped ]

[ snipped and rearranged ]

The characters in "Field 1" could be the same number -

Like 1234 and 1235

Or 234- and 334-

The characters in "Field 1" could be as Jeff put them -

abc1def and abc2def

Or 045D/123 and 055D/123

The characters in "Field 1" could also be as Tad put them -

123 and 12X3

Or add-34 and add34

The list goes on...

Here's a subroutine to get you started -- I presume you don't
use wide-characters in fields:

if (off_by_one("add-34", "add34")) {
print "Off by one\n";
}

sub off_by_one {
# returns undef, 0, or 1
# the latter indicates that the two arguments are "off-by-1"
my ($length1,$length2) = map length, my ($field1, $field2) = @_;
return if (abs($length1 - $length2) > 1);
if ($length1 == $length2) {
my $difference = grep $_ != 0 => unpack 'C*', $field1 ^ $field2;
return ($difference == 1)? 1 : 0;
} else {
my @chars1 = unpack 'C*' => $field1;
my @chars2 = unpack 'C*' => $field2;
my $i = 0;
while (defined $chars1[$i]
and defined $chars2[$i]
and $chars1[$i] == $chars2[$i]) { # compare unsigned int
++$i;
}
substr(($length1 > $length2)? $field1 : $field2, $i, 1, '');
return ($field1 eq $field2)? 1 : 0;
}
}

P.S. Please do not top-post.

Help with datascraping script	1	Aug 26, 2024
Only one table shows up with the information	2	Mar 29, 2023
Help wanted to modify Gimp Script-fu : will pay	0	Aug 26, 2022
How to have two html audio players on one page?	0	May 3, 2022
Issue with textbox script?	0	Sep 4, 2022
Error with a script to separate Undertale proggress	2	Dec 18, 2024
C Script Prematurely Terminating	3	Feb 7, 2022
Partially completed coding of loop script, need help finishing.	0	Oct 7, 2022

One Off Script

JP Ogden

Eric Schwartz

Jeff 'japhy' Pinyan

JP Ogden

Steven Kuo

Ask a Question

Similar Threads

Staff online

Members online

Forum statistics

Latest Threads