Working with Duplicates in Perl to generate Unique ID

E

esimbo

Hi

I have been tasked with producing a new input file which requires some
manipulation of a file to generate a unique ID. I have been advised
that Perl will be the simplest course of action here but in all
honesty, I'm not sure where to start.

My input file contains the following snippets of data.

Date, Amount, Refno
2005/01/07, 00000.096532030000,#0000015511
2005/06/07, 00006.963788280000,#0000015511
2005/06/13, 00002.243425000000,#0000030502
2006/06/16, 00002.243425000000,#0000030502
2006/06/16, 00047.230000000000,#0000030502
2005/02/18, 00002.243425000000,#0000040505
2005/02/13, 00001.738765000000,#0000030627

Based on this file, I need to generate a new file containing the same
fields but with an added column for the Unique id.

The premise is simple. Check the refno column and match against that
value against the corresponding value in the next row. If they both
match, then apend append both "I" and the Date to the Refno to generate
the ID. It then iterates through the rows repeating the same step until
it reaches the last occurence of the Refno. When we reach the last
occurence of the Refno, i.e we start a new Refno sequence, in which
case we append a "P".

Therefore, using the sample above, the result I would expect is as
follows

ID,Date,Amount, Refno
0000015511_I_2005/01/07, 2005/01/07, 00000.096532030000,#0000015511
0000015511_P_2005/06/07, 2005/06/07, 00006.963788280000,#0000015511
0000030502_I_2005/06/13, 2005/06/13, 00002.243425000000,#0000030502
0000030502_I_2006/06/16, 2006/06/16, 00002.243425000000,#0000030502
0000030502_P_2005/06/16, 2006/06/16, 00047.230000000000,#0000030502
0000030505_P_2005/02/18, 2005/02/18, 00002.243425000000,#0000040505
0000030627_P_2005/02/13, 2005/02/13, 00001.738765000000,#0000030627

If anyone can provide any assistance here, I'd really be grateful.

Regards.
 
A

A. Sinan Unur

(e-mail address removed) wrote in @o13g2000cwo.googlegroups.com:
I have been tasked with producing a new input file which requires some
manipulation of a file to generate a unique ID. I have been advised
that Perl will be the simplest course of action here but in all
honesty, I'm not sure where to start.

My input file contains the following snippets of data.

Date, Amount, Refno
2005/01/07, 00000.096532030000,#0000015511
2005/06/07, 00006.963788280000,#0000015511
2005/06/13, 00002.243425000000,#0000030502
2006/06/16, 00002.243425000000,#0000030502
2006/06/16, 00047.230000000000,#0000030502
2005/02/18, 00002.243425000000,#0000040505
2005/02/13, 00001.738765000000,#0000030627
....

ID,Date,Amount, Refno
0000015511_I_2005/01/07, 2005/01/07, 00000.096532030000,#0000015511
0000015511_P_2005/06/07, 2005/06/07, 00006.963788280000,#0000015511
0000030502_I_2005/06/13, 2005/06/13, 00002.243425000000,#0000030502
0000030502_I_2006/06/16, 2006/06/16, 00002.243425000000,#0000030502
0000030502_P_2005/06/16, 2006/06/16, 00047.230000000000,#0000030502
0000030505_P_2005/02/18, 2005/02/18, 00002.243425000000,#0000040505
0000030627_P_2005/02/13, 2005/02/13, 00001.738765000000,#0000030627

I would use a hash where each Refno is a key, and values are references
arrays of hash references, assuming that the file is a reasonable size.
You will probably need

perldoc -f split

Given this information, you can write some code now. Then, if you have
problems with your code, please post again.

In the mean time, you might benefit from reading

perldoc perlreftut

as well as the posting guidelines for this group.

Sinan
 
K

kingpin2502

Sinan

Thanks for your response. I've got a start, which is what I needed. I
must admit I wasn't aware of the rules prior to posting but I'll read
them before I post again..

Thanks.

Emmon
 
J

John W. Krahn

I have been tasked with producing a new input file which requires some
manipulation of a file to generate a unique ID. I have been advised
that Perl will be the simplest course of action here but in all
honesty, I'm not sure where to start.

My input file contains the following snippets of data.

Date, Amount, Refno
2005/01/07, 00000.096532030000,#0000015511
2005/06/07, 00006.963788280000,#0000015511
2005/06/13, 00002.243425000000,#0000030502
2006/06/16, 00002.243425000000,#0000030502
2006/06/16, 00047.230000000000,#0000030502
2005/02/18, 00002.243425000000,#0000040505
2005/02/13, 00001.738765000000,#0000030627

Based on this file, I need to generate a new file containing the same
fields but with an added column for the Unique id.

The premise is simple. Check the refno column and match against that
value against the corresponding value in the next row. If they both
match, then apend append both "I" and the Date to the Refno to generate
the ID. It then iterates through the rows repeating the same step until
it reaches the last occurence of the Refno. When we reach the last
occurence of the Refno, i.e we start a new Refno sequence, in which
case we append a "P".

Therefore, using the sample above, the result I would expect is as
follows

ID,Date,Amount, Refno
0000015511_I_2005/01/07, 2005/01/07, 00000.096532030000,#0000015511
0000015511_P_2005/06/07, 2005/06/07, 00006.963788280000,#0000015511
0000030502_I_2005/06/13, 2005/06/13, 00002.243425000000,#0000030502
0000030502_I_2006/06/16, 2006/06/16, 00002.243425000000,#0000030502
0000030502_P_2005/06/16, 2006/06/16, 00047.230000000000,#0000030502
0000030505_P_2005/02/18, 2005/02/18, 00002.243425000000,#0000040505
0000030627_P_2005/02/13, 2005/02/13, 00001.738765000000,#0000030627

If anyone can provide any assistance here, I'd really be grateful.

use warnings;
use strict;

my %seen;

print
reverse
map $_->[2] ? "$_->[2]_" . ( $seen{ $_->[2] }++ ? 'I' : 'P' ) .
"_$_->[1], $_->[0]" : $_->[0],
map [ $_, m!^([\d/]+)[^#]+#(\d+)$! ],
reverse
<DATA>;


__DATA__
Date, Amount, Refno
2005/01/07, 00000.096532030000,#0000015511
2005/06/07, 00006.963788280000,#0000015511
2005/06/13, 00002.243425000000,#0000030502
2006/06/16, 00002.243425000000,#0000030502
2006/06/16, 00047.230000000000,#0000030502
2005/02/18, 00002.243425000000,#0000040505
2005/02/13, 00001.738765000000,#0000030627



John
 
I

Ilmari Karonen

My input file contains the following snippets of data.

Date, Amount, Refno
2005/01/07, 00000.096532030000,#0000015511
2005/06/07, 00006.963788280000,#0000015511
2005/06/13, 00002.243425000000,#0000030502
2006/06/16, 00002.243425000000,#0000030502
2006/06/16, 00047.230000000000,#0000030502
2005/02/18, 00002.243425000000,#0000040505
2005/02/13, 00001.738765000000,#0000030627

The premise is simple. Check the refno column and match against that
value against the corresponding value in the next row. If they both
match, then apend append both "I" and the Date to the Refno to generate
the ID. It then iterates through the rows repeating the same step until
it reaches the last occurence of the Refno. When we reach the last
occurence of the Refno, i.e we start a new Refno sequence, in which
case we append a "P".

Okay, since you need to look ahead to the next line, it would probably
be easiest to first slurp all the data and then iterate over it. We
can split each line into an array, which will make manipulating the
fields easier, and then reassemble the lines afterwards. So:

#!/usr/bin/perl
use warnings;
use strict;

my @lines = <>; # slurp all lines from input
chomp @lines; # remove newlines
shift @lines; # remove first line (column names)

# split the lines on commas followed by a space or a number sign (#):
my @data = map [split /,[# ]/], @lines;

print "ID, Date, Amount,#Refno\n"; # print new header line

foreach my $i (0 .. $#data) {
my ($date, $amount, $refno) = @{ $data[$i] }; # columns of this row
my $next = $data[$i+1][-1] || ""; # last col of next row
my $char = ($refno eq $next ? "I" : "P"); # I if equal, else P
my $id = join "_", $refno, $char, $date; # construct id
print "$id, $date, $amount,#$refno\n"; # print rebuilt line
}

There, that should do it. Hopefully the comments are clear enough
that you can see how it works. In fact, this turned out to be quite a
nice little example of several common Perl idioms.

One idiom that may not be immediate obvious is $data[$i+1][-1] || "".
The array indexing works just as the comment says, but the "logical
or" with an empty string may be puzzling. In fact, all it does is
eliminate an unnecessary warning. When we reach the last line, and
try to access the last column of the line after that, we get an
undefined value. The "logical or" replaces it with an empty string.
It won't affect the values on other lines, because those are all
considered by perl to be logically true.
 
K

kingpin2502

Hi Ilmari

That was very clear thank you. I appreciate that very much.

Thanks
Emmon
 
K

kingpin2502

John

Thanks for your help with this. I really appreciated the help

Thanks
Emmon
 
S

Sherm Pendley

kingpin2502 said:
That was very clear thank you. I appreciate that very much.

*What* was very clear? Please quote enough of the message you're replying
to to provide sufficient context.

sherm--
 
S

Sherm Pendley

kingpin2502 said:
I was replying to Ilmari's comments, he wanted to know whether his
comments were clear.

What are you talking about? Imari's comments may have been clear, but
yours aren't. Please quote the relevant parts of the message you're
replying to, so that your own comments make sense.

sherm--
 
K

kingpin2502

Sherm

I was replying to Ilmari's comments, he wanted to know whether his
comments were clear. The other responses were all individual thank yous
to the responses I got. I wasn't aware at the time, that it didn't
quote the original text in the reply
 
S

Sherm Pendley

kingpin2502 said:
I'm really not sure where you're going with this. Can you state the
relevance here?

Where I'm going with what? The relevance of what?

Please quote the relevant parts of the messages you're replying to - the rest
of us aren't mind-readers.
I don't see the need to copy
and paste the whole mail I was responding to when all I want to do is
say Thank You.

If you're responding to an email, why would you post the response here in a
usenet group?
You can quite clearly see it in the thread

No, I can't. I'm not using Google Groups, I'm using a news reader. I'm not
looking at a thread, I'm looking at a message. A message that makes no sense
to me because you're making invalid assumptions about what I can see along
with your message.

sherm--
 
K

kingpin2502

Sherm

I'm really not sure where you're going with this. Can you state the
relevance here? As I have already stated, not quite sure how much
clearer you'll like me to, I was simply saying thank you to the people
who took time to respond to my query. If you look at the thread, you'll
find they are all replies to the authors. I don't see the need to copy
and paste the whole mail I was responding to when all I want to do is
say Thank You. You can quite clearly see it in the thread who I have
replied to.
 
A

A. Sinan Unur

I'm really not sure where you're going with this. Can you state the
relevance here?

Who knows?

Please quote an appropriate amount of context when replying.

Sinan
 
J

Jürgen Exner

kingpin2502 said:
Sherm

I'm really not sure where you're going with this.

What is "this"? Please quote some context such that people have a chance to
know what you are talking about.
Can you state the
relevance here? As I have already stated, not quite sure how much
clearer you'll like me to, I was simply saying thank you to the people
who took time to respond to my query.

That is a very commendable, most people will forget that step.
If you look at the thread,
you'll find they are all replies to the authors.

You don't seem to know much about Usenet. Because of its asynchronous,
distributed implementation there is no guarantee that articles
- arrive on a server in a specific order
- arrive on a server at all
- are available on a server at any specific moment in time
- are visible to a user now
- have been visible to a user in the past
- will ever be visible to a user
To make a long story short: you can never assume that Joe Reader can see or
has seen the same set of articles as you.

Therefore, and to make reading more efficient (no need to scroll back to a
previous article and most important knowing exactly which part of a
preceeding article someone is commenting on) it has been a proven Usenet
custom for the last two decades to quote just so much context from the
preceeding article that your posting is understandable without someone
reading the preceeding article. He may not had a chance to read it.

Now, for a general thank you it is quite customary to follow-up to your own
posting and just to say "Thanks to all who replied, I will try your
suggestions" or something to that effect.
I don't see the need
to copy and paste the whole mail I was responding to when all I want
to do is say Thank You.

That would be quite stupid and frowned upon indeed. You should quote enough
context, such that you reply is understandable on it's own without someone
reading the preceeding posting.

BTW: this is Usenet and there are no mails in Usenet.
You can quite clearly see it in the thread
who I have replied to.

Probably not. _You_ can probable see it, but other people will not because
their view of the thread is different.

jue
 
T

Tad McClellan

kingpin2502 said:
If you look at the thread, you'll
find


How do you know what articles have reached _my_ newserver?

How do you know how articles are displayed to me?

You can quite clearly see it in the thread who I have
replied to.


That is just the point. We *cannot* see that quite clearly.
 
D

David Combs

How do you know what articles have reached _my_ newserver?

How do you know how articles are displayed to me?




That is just the point. We *cannot* see that quite clearly.



I don't know what you guys are using for newsreaders,
but I'm using trn aka trn4, which has the wonderful
feature of drawing a wee tree (root at left, grows to
the right) of the surrounding part of the current thread, eg for
*this* thread:



| Comp.lang.perl.misc #553640 (45 + 1952 more) --(1)--(1)
| From: Tad McClellan <[email protected]> --(1)--(1)--(1)
| [1] Re: Working with Duplicates in Perl to generate Unique ID --(1)--(1)--(1)--(1)--(1)+-(1)
| Reply-To: (e-mail address removed) |-(1)
| Date: Tue Jun 21 12:24:29 EDT 2005 |-(1)
| Lines: 22 \-(1)



(any post not yeat read is shown in square-brackets;
the digit within is for the sub-thread, eg where
someone changes the subject but continues on
with the same thread.)

Also shows where you currently are in the thread.

And you can use the arrow-keys to traverse the thing.

So, having this tree-thing, it's pretty obvious what
a post is replying to.

And here's the entire tree:

| [1] Working with Duplicates in Perl to generate Unique ID
|
| (1)+-(1)--(1)
| |-(1)--(1)--(1)
| |-(1)--(1)--(1)--(1)
| \-(1)--(1)--(1)--(1)--(1)--(1)+-(1)
| |-(1)
| |-(1)
| \-(1)
|
| End of article 553640 (of 555115) -- what next? [npq]
|

(they all show round-parens because I'm replying to the
final post in the thread.)


So, maybe you're giving that guy a needlessly-hard time,
when all he's doing is saying "thanks" (for the prior
post's solution).

Suggestion: maybe switch to trn4 -- or if not that,
then look at it's source and lift the code it
uses to draw the tree.

Man, without the tree, I'd be totally lost, reading
newsgroups!


David
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top