SQL insert/update duplicate entries

C

Cheez

Hi there, I am using a pretty basic Perl script to read data from a
file into a mysql table. I have no provisions in my SQL statement to
deal with duplicate entries. In this situation, I want to increment a
frequency counter.

#My main SQL syntax is this:

INSERT INTO mytable (wherefound, sequence)

VALUES (?, ?)

# I would like to do this (but doesn't work):

INSERT INTO mytable (wherefound, sequence) VALUES (?,?)
-> ON DUPLICATE KEY UPDATE frequency=frequency+1

The 'sequence' field is the one that I want to check for duplicates
and then increment the 'frequency' field by one. If someone could
offer up any pointers I would be very much obliged.

Thanks,
Cheez

#code below:

print "\nINSERTER_SEQUENCE: insert processed sequence data into
sequence5 table\n";

use DBI;

my $dbh = DBI->connect("dbi:mysql:mydatabase", "", "");

my $sth = $dbh->prepare(<<SQL);

INSERT INTO mytable (wherefound, sequence)

VALUES (?, ?)

SQL

open(FILE, "seqparsed.txt");

while (my $data = <FILE>) {

my ($wherefound,$sequence) = split /\t/, $data;

$sth->execute($wherefound,$sequence);

}

close (FILE);
 
G

Gregory Toomey

Cheez said:
Hi there, I am using a pretty basic Perl script to read data from a
file into a mysql table. I have no provisions in my SQL statement to
deal with duplicate entries. In this situation, I want to increment a
frequency counter.

Do a select first and depending on whether the record is there, insert or
increment.

gtoomey
 
T

Tore Aursand

Hi there, I am using a pretty basic Perl script to read data from a
file into a mysql table. I have no provisions in my SQL statement to
deal with duplicate entries. In this situation, I want to increment a
frequency counter.

#My main SQL syntax is this:

INSERT INTO mytable (wherefound, sequence)

VALUES (?, ?)

# I would like to do this (but doesn't work):

INSERT INTO mytable (wherefound, sequence) VALUES (?,?)
-> ON DUPLICATE KEY UPDATE frequency=frequency+1

The 'sequence' field is the one that I want to check for duplicates
and then increment the 'frequency' field by one. If someone could
offer up any pointers I would be very much obliged.

Hmm. Are you sure that MySQL supports the latter SQL statement? I don't
think so, but I'm not sure.

If it doesn't support it, you need to SELECT first, to find out if the
record you're after exists, and then do an UPDATE on the 'frequency' field.
 
J

John J. Trammell

Hi there, I am using a pretty basic Perl script to read data from a
file into a mysql table. I have no provisions in my SQL statement to
deal with duplicate entries. In this situation, I want to increment a
frequency counter.

You might want to use the "REPLACE INTO" statement. Hard to say
from what you posted.
 
C

Cheez

John J. Trammell said:
You might want to use the "REPLACE INTO" statement. Hard to say
from what you posted.

Howdy,

I use a fairly basic Perl script to parse 16 flatfiles. For each
parse I create a tab delimited dataset made up of a 7 digit
identifier, followed by <tab>, followed by a 10-character sequence.
For instance:

6001456 GHIOGLKJNJ
6001234 SDRQIOPMNB
6310098 CCVNEEQPLL
....(100,000 lines long)

I then INSERT this data into a mySQL table using the script from my
original post. My 4 fields in the table are id, sequence, frequency,
identifier. Upon duplicate sequence entries, I wanted to use an SQL
command to increment the frequency counter by one and also concatanate
the existing identifier string with the identifier. I could have a
table that looks like:

id sequence frequency identifier
1 GHIOGLKJNJ 1 6001456
2 CCVNEEQPLL 3 6310098, 7890123, 4567812

As it stands right now, when I do a simple INSERT, I do not check for
duplicates and they simply get inserted with a unique id.

A previous poster had suggested performing a SELECT query for each
line of my dataset to check for duplicates. 'IF duplicates exist
UPDATE frequency++, ELSE INSERT new data' will be my approach.

How long will that take? Right now, to INSERT this data without
checking for duplicates takes about 30sec/100,000 records. Performing
a SELECT I imagine is going to be a time consuming process?

As it stands right now, duplicates and all, my table has 3.5 million
records. I bet 20% are duplicates.

I find Perl to be such a liberating language (compared to Java for
instance). I have both the skill and enthusiasm of a newbie. Thanks
for bearing with me if this whole post comes across a bit basic.

Cheers,
Cheez
 
T

Tore Aursand

I use a fairly basic Perl script to parse 16 flatfiles. For each
parse I create a tab delimited dataset made up of a 7 digit
identifier, followed by <tab>, followed by a 10-character sequence.
For instance:

6001456 GHIOGLKJNJ
6001234 SDRQIOPMNB
6310098 CCVNEEQPLL
...(100,000 lines long)

I then INSERT this data into a mySQL table using the script from my
original post. My 4 fields in the table are id, sequence, frequency,
identifier. Upon duplicate sequence entries, I wanted to use an SQL
command to increment the frequency counter by one and also concatanate
the existing identifier string with the identifier. I could have a
table that looks like:

id sequence frequency identifier
1 GHIOGLKJNJ 1 6001456
2 CCVNEEQPLL 3 6310098, 7890123, 4567812

Do you really save the data in the 'identifier' as a string separated with
comma? That means you have to parse it _again_ when you SELECT from the
table, you know. Not the best solution I would say.

Personally, I would have used more than one table for storing the data,
and I'm not really sure that I would have had the 'frequency' field stored
in any of them either;

CREATE TABLE sequences (
id int unsigned auto_increment primary key,
sequence varchar(10) not null,

INDEX(sequence)
);

CREATE TABLE sequence_identifier (
id int unsigned not null,
identifier int unsigned not null,

INDEX(identifier)
);

Inserting the data in the database would proceed something like this:

my $stInsSequence = $dbh->prepare('INSERT INTO sequence (sequence)
VALUES (?)' );
my $stInsSeqIdent = $dbh->prepare('INSERT INTO sequence_identifier (id, identifier)
VALUES (?, ?)' );

my %lookup = ();
while ( <DATA> ) {
chomp;
my ( $identifier, $sequence ) = split( /\t+/, $_, 2 );
my $id = $lookup{$sequence} || 0;
unless ( $id > 0 ) {
$stInsSequence->execute( $sequence );
$id = $stInsSequence->{'mysql_insertid'};
$lookup{$sequence} = $id;
}
$stInsSeqIdent->execute( $id, $identifier );
}

$stInsSeqIdent->finish();
$stInsSequence->finish();

If I read your specification correctly, this should be all you need. A
note on the %lookup hash, though: It's intended for caching the id for
each sequence, thus it can grow quite large if you have a lot of data.

To get a list of all the identifiers for a specified sequence:

my $stSeqIdent = $dbh->prepare('SELECT si.identifier
FROM sequences s, sequence_identifier si
WHERE s.sequence = ? AND s.id = si.id');
$stSeqIdent->execute( $sequence );
while ( my ( $identifier ) = $stSeqIdent->fetchrow_array() ) {
print $identifier . "\n";
}
$stSeqIdent->finish();

To get frequence for each sequence, you just count the occurrances of it
in the 'sequence_identifier' table, or you could list all and group them
by the sequence.
A previous poster had suggested performing a SELECT query for each
line of my dataset to check for duplicates. 'IF duplicates exist
UPDATE frequency++, ELSE INSERT new data' will be my approach.

With the approach described above (totally untested, by the way), you
don't need to check for that as long as the %lookup hash will fit into
memory. If the %lookup hash grows too big, you will need the lookup.
How long will that take?

Try it. :)
 
C

Cheez

To get frequence for each sequence, you just count the occurrances of it
in the 'sequence_identifier' table, or you could list all and group them
by the sequence.


In general, wow, thanks for the code. I did end up making up a script
posted here:

http://tinyurl.com/ywphd

Your approach doesn't include the incredibly inefficient step of doing
a SELECT preceeding a INSERT or UPDATE. However, the frequency value
is the one that I am using to identify sequences of interest. I don't
know 'a priori' which sequences are of interest to me therefor I want
to sort by frequency number not by sequence. I did experiment with
your strategy, knowing a particular sequence I can SELECT * FROM table
WHERE sequence LIKE '%XXXXXXXX%'. This will return all records and of
course, by virtue of the number of these records I will know
frequency. Again, I do not know which of these 3 million sequences is
the interesting one for me. That's why the added step of doing an
intial SELECT.

I suppose I could make a Perl script to anaylze the table, with
duplicates, after I populate it. Undoubtedly I would have to do this
by creating a huge hash or array. Not sure if I can do this with 3
million records.
Thanks again for your comments and the rest of the replies on this
post.

Cheers,
Cheez
 
C

ctcgag

In general, wow, thanks for the code. I did end up making up a script
posted here:

http://tinyurl.com/ywphd

Your approach doesn't include the incredibly inefficient step of doing
a SELECT preceeding a INSERT or UPDATE.

If that is incredibly inefficient, you apparently aren't using indices.
If you aren't using indices, what's the point of using mySQL for this
in the first place?

However, the frequency value
is the one that I am using to identify sequences of interest. I don't
know 'a priori' which sequences are of interest to me therefor I want
to sort by frequency number not by sequence.

I'd consider forgetting the identifiers for the moment. Just count
the frequencies, and once you now which sequences are interesting, you
can go back re-parse the files to get the identifiers of just those
sequences.

I did experiment with
your strategy, knowing a particular sequence I can SELECT * FROM table
WHERE sequence LIKE '%XXXXXXXX%'. This will return all records and of
course, by virtue of the number of these records I will know
frequency. Again, I do not know which of these 3 million sequences is
the interesting one for me. That's why the added step of doing an
intial SELECT.

select sequence, count(*) from table group by sequence.
I suppose I could make a Perl script to anaylze the table, with
duplicates, after I populate it. Undoubtedly I would have to do this
by creating a huge hash or array.

If I were going to do it this way, I wouldn't bother to put into the
database table in the first place. Just analyze it in Perl directly
from the text files.

Or I'd just do it from the linux command line:
"cat all_my_files|cut -f2|sort|uniq -c|sort -n > out"
Not sure if I can do this with 3
million records.

I can fit a 3 million member hash (10 char key, 8 char value) in ~300M
memory, or in ~220M if I squeeze (value is integer count only)

Xho
 
C

Cheez

If that is incredibly inefficient, you apparently aren't using indices.
If you aren't using indices, what's the point of using mySQL for this
in the first place?

My 'sequence' field is a primary index in this particular table. I
could very well be misinterpreting your question.
I'd consider forgetting the identifiers for the moment. Just count
the frequencies, and once you now which sequences are interesting, you
can go back re-parse the files to get the identifiers of just those
sequences.

Excellent idea...thx.
Or I'd just do it from the linux command line:
"cat all_my_files|cut -f2|sort|uniq -c|sort -n > out"

I am literally using 100s of lines of code for this task!
I can fit a 3 million member hash (10 char key, 8 char value) in ~300M
memory, or in ~220M if I squeeze (value is integer count only)

I'll give it shot. Thanks for the comments.

Cheers,
Cheez
 
B

Bruce Horrocks

Cheez said:
# I would like to do this (but doesn't work):

INSERT INTO mytable (wherefound, sequence) VALUES (?,?)
-> ON DUPLICATE KEY UPDATE frequency=frequency+1

Have you tried:

INSERT INTO mytable (wherefound, sequence, frequency)
VALUES (?, ?, 1)
-> ON DUPLICATE KEY UPDATE frequency = frequency + 1;

The difference being that you've explicitly told it about the frequency
column.

You also need to uniquely index mytable on sequence for the on duplicate
statement to work.

Regards,
 
C

ctcgag

(e-mail address removed) wrote in message


My 'sequence' field is a primary index in this particular table. I
could very well be misinterpreting your question.

Are you absolutely sure that it is a primary key? You reported about 20,
000 rows processed per hour, I processed 3 million rows 33 minutes. (The
first 20,000 took 7 seconds. The last 20,000 took 27 seconds). Unless you
are running a 33.3MHz machine, something seems very wrong.

Xho


use DBI ;
use strict;
$|++;

my $dbh = DBI->connect("DBI:mysql:x:localhost","me",
"meme", { RaiseError => 1 }) ;

$dbh->do('drop table if exists xx_test');
$dbh->do('create table xx_test (sequence varchar(20) primary key, freq int,
identifiers text)');

my $check=$dbh->prepare('select identifiers from xx_test where
sequence=?'); my $insert=$dbh->prepare('insert into xx_test
(sequence, freq, identifiers) values (?,1,?)');
my $update=$dbh->prepare('update xx_test set freq=freq+1, identifiers=
concat(identifiers,",",?) where sequence=?');

timer();

foreach (1..3_000_000) {
my $seq = join "", map {chr(65+rand(20))} 1..10;
if (rand()< 0.01) { #force some duplicates to occur
push @x,$seq unless @x>5000;
$seq=$x[rand(@x)] if @x>500;
};
my $id=rand(1000000);

$check->execute($seq);
if (my $ident=$check->fetchrow_array()) {
$update->execute($id,$seq);
} else {
$insert->execute($seq,$id);
};
print "$_\t",timer(),"\n" unless $_%10000;
};
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top