Trying to compare two files and output it into a third file.

chutsu · Jul 29, 2009

Ok. So basically I have two files in the form of:

File1:
asdfkjsdlfkjsdf 1232
afasdfklsdjfksf 12312
sdflsadsdffdsfs 32323

File2:
asdfkjsdlfkjsdf 1232
afasdfklsdjfksf 12312
sdflsadsdffdsfs 32323

Now these two files are similar and they are not ordered, my quest is
to get the first column from the first file (ie "asdfkjsdlfkjsdf") and
read the second file to find the same exact phrase.

Once you get a match obtain both second columns (ie The numbers) and
output as follows:

File 3:
asdfdsfdsfdssa 1232 133
asdfdsfdsfdssa 1232 133
asdfdsfdsfdssa 1232 133

Can someone help me, I have no idea how to approach this.
Thanks
Chris

Moi · Jul 29, 2009

Ok. So basically I have two files in the form of:

File1:
asdfkjsdlfkjsdf 1232
afasdfklsdjfksf 12312
sdflsadsdffdsfs 32323

File2:
asdfkjsdlfkjsdf 1232
afasdfklsdjfksf 12312
sdflsadsdffdsfs 32323

Now these two files are similar and they are not ordered, my quest is to
get the first column from the first file (ie "asdfkjsdlfkjsdf") and read
the second file to find the same exact phrase.

Once you get a match obtain both second columns (ie The numbers) and
output as follows:

File 3:
asdfdsfdsfdssa 1232 133
asdfdsfdsfdssa 1232 133
asdfdsfdsfdssa 1232 133

Can someone help me, I have no idea how to approach this. Thanks

Sort/merge "nested table scan"
Some hashing might help.

NB I don't know where the 133 in the result set comes from.
And I don't know why there are *three* tuples in the result set.

HTH,
AvK

Default User · Jul 29, 2009

chutsu said:
Ok. So basically I have two files in the form of:

File1:
asdfkjsdlfkjsdf 1232
afasdfklsdjfksf 12312
sdflsadsdffdsfs 32323

File2:
asdfkjsdlfkjsdf 1232
afasdfklsdjfksf 12312
sdflsadsdffdsfs 32323

Now these two files are similar and they are not ordered, my quest is
to get the first column from the first file (ie "asdfkjsdlfkjsdf") and
read the second file to find the same exact phrase.

Once you get a match obtain both second columns (ie The numbers) and
output as follows:

File 3:
asdfdsfdsfdssa 1232 133
asdfdsfdsfdssa 1232 133
asdfdsfdsfdssa 1232 133

Can someone help me, I have no idea how to approach this.

You have NO idea? Well, how would you do it by hand? What exactly is
giving you trouble? Do you know how to open files? Read from them?
Compare strings? Do you know what loops are?

If you seriously have no idea how to approach this problem, then you
need to fall back and learning C and programming from the start.
Otherwise, you need to show us what you've tried so was can help direct
you along the correct approach.

Brian

Gene · Jul 29, 2009

Ok. So basically I have two files in the form of:

File1:
asdfkjsdlfkjsdf 1232
afasdfklsdjfksf 12312
sdflsadsdffdsfs 32323

File2:
asdfkjsdlfkjsdf 1232
afasdfklsdjfksf 12312
sdflsadsdffdsfs 32323

Now these two files are similar and they are not ordered, my quest is
to get the first column from the first file (ie "asdfkjsdlfkjsdf") and
read the second file to find the same exact phrase.

Once you get a match obtain both second columns (ie The numbers) and
output as follows:

File 3:
asdfdsfdsfdssa 1232 133
asdfdsfdsfdssa 1232 133
asdfdsfdsfdssa 1232 133

Perhaps it's OT, but sometimes IMO the best C is no C. I.e. this is
the kind of problem that perl, awk, and similar languages were meant
to solve.

In perl you'd need only something like this _untested_ code.

our %pairs;

sub scan {
my $fn = shift;
open(F, $fn) || die;
while (<F>) {
my ($key, $val) = /^(\S+)\s+(\d+)$/;
die "bad data" unless $key;
push @{ $pairs{$1} }, $2;
}
close F;
}

my report {
my $fn = shift;
open(F, "> $fn") || die;
foreach my $key (keys %pairs) {
next unless scalar(@{ $pairs{$key} }) > 1;
print "$key\t" . join("\t", @{ $pairs{$key} }) . "\n";
}
close F;
}

scan("file1");
scan("file2");
report;

chutsu · Jul 30, 2009

You have NO idea? Well, how would you do it by hand? What exactly is
giving you trouble? Do you know how to open files? Read from them?
Compare strings? Do you know what loops are?

If you seriously have no idea how to approach this problem, then you
need to fall back and learning C and programming from the start.
Otherwise, you need to show us what you've tried so was can help direct
you along the correct approach.

Brian

to understand my code you need know more about what these files are.
So I'm trying to sort out some DNA data I got, the first stage is to
compare which sequences appear to be common, and how many repeats or
"reads" occur.
The first field is the sequence (or tag in my code), the second is the
number of reads.
The data file 1 and 2 will therefore look like:
CAGCTCACTGCA 123
ACGTGCCCCCTT 847
etc... etc...

I've been writing this code and I have no idea why it doesn't work:

the usual inclues and opening file...
This is the bit I can't get it to work

// Read file1
while(!feof(file)){

// Get the tag sequence and the read number
fscanf(file, "%s", tag_1);

// Validate the tag is a sequence and not the reads
if(tag_1[0]=='A'||tag_1[0]=='C'||tag_1[0]=='G'||tag_1[0]=='T'){

// Read file2
fscanf(file2, "%s", tag_2);

// Validate the tag2 is a sequence and not the reads
if(tag_2[0]=='A'||tag_2[0]=='C'||tag_2[0]=='G'||tag_2[0]=='T')
{

// Now compare tag1 with tag2 to see if they match
if(strcmp(tag_1, tag_2)==0){
printf("match!: %s", tag_1);
}
}
}
}

note this is by no means finish, I'm working in stages, but this is as
far as I got.

Morris Keesan · Jul 30, 2009

to understand my code you need know more about what these files are.
So I'm trying to sort out some DNA data I got, the first stage is to
compare which sequences appear to be common, and how many repeats or
"reads" occur.
The first field is the sequence (or tag in my code), the second is the
number of reads.
The data file 1 and 2 will therefore look like:
CAGCTCACTGCA 123
ACGTGCCCCCTT 847
etc... etc...

Clarification: It looks like you only want to find a match between
the two files if the matching base sequence is on the same line
number in both files? That appears to be the intent of your code.

And a few questions: How large are these files? Is there
any particular reason to avoid sorting them? And do you have
a guarantee that the two files have the same number of lines?

I've been writing this code and I have no idea why it doesn't work:

the usual inclues and opening file...
This is the bit I can't get it to work

// Read file1
while(!feof(file)){

This is a very common error in C code: feof only returns true after
you've attempted to read past the end of the file, NOT when you've
read the last byte of the file.

// Get the tag sequence and the read number
fscanf(file, "%s", tag_1);

// Validate the tag is a sequence and not the reads
if(tag_1[0]=='A'||tag_1[0]=='C'||tag_1[0]=='G'||tag_1[0]=='T'){

// Read file2
fscanf(file2, "%s", tag_2);

// Validate the tag2 is a sequence and not the reads
if(tag_2[0]=='A'||tag_2[0]=='C'||tag_2[0]=='G'||tag_2[0]=='T')

Note that you've repeated the code for comparing the first character to
ACGT -- very poor programming practice. If you were going to do this,
it would be worth extracting this test into a subroutine.
But reading the file this way, scanning a single token at a time and
testing the content to figure out which column you've read, is clunky
and error-prone, and I think it's a major source of the confusion in
your code. I suggest code something like this:

/* Assume tag_1 and tag_2 are arrays, or pointers to arrays,
* nreads_1 nreads_2 are integers. Also assume that you're
* super-confident of your data format, and that the sequences
* can't possibly be large enough to overflow tag_1 and tag_2
*/

while(fscanf(file, "%s %d", tag_1, &nreads_1) != EOF)
{
fscanf(file2, "%s %d", tag_2, &nreads_2);
if (strcmp(tag_1, tag_2) == 0)
printf("match!: %s: (%d, %d)\n", tag_1, nreads_1, nreads_2);
}

This needs some additional error-checking, but there's a basic
framework for you.

chutsu · Jul 30, 2009

to understand my code you need know more about what these files are.
So I'm trying to sort out some DNA data I got, the first stage is to
compare which sequences appear to be common, and how many repeats or
"reads" occur.
The first field is the sequence (or tag in my code), the second is the
number of reads.
The data file 1 and 2 will therefore look like:
CAGCTCACTGCA 123
ACGTGCCCCCTT 847
etc... etc...

Click to expand...

Clarification: It looks like you only want to find a match between
the two files if the matching base sequence is on the same line
number in both files? That appears to be the intent of your code.

And a few questions: How large are these files? Is there
any particular reason to avoid sorting them? And do you have
a guarantee that the two files have the same number of lines?

I've been writing this code and I have no idea why it doesn't work:

Click to expand...

the usual inclues and opening file...
This is the bit I can't get it to work

Click to expand...

// Read file1
while(!feof(file)){

Click to expand...

This is a very common error in C code: feof only returns true after
you've attempted to read past the end of the file, NOT when you've
read the last byte of the file.

// Get the tag sequence and the read number
fscanf(file, "%s", tag_1);

Click to expand...

// Validate the tag is a sequence and not the reads
if(tag_1[0]=='A'||tag_1[0]=='C'||tag_1[0]=='G'||tag_1[0]=='T'){

Click to expand...

// Read file2
fscanf(file2, "%s", tag_2);

Click to expand...

// Validate the tag2 is a sequence and not the reads
if(tag_2[0]=='A'||tag_2[0]=='C'||tag_2[0]=='G'||tag_2[0]=='T')

Click to expand...

Note that you've repeated the code for comparing the first character to
ACGT -- very poor programming practice. If you were going to do this,
it would be worth extracting this test into a subroutine.
But reading the file this way, scanning a single token at a time and
testing the content to figure out which column you've read, is clunky
and error-prone, and I think it's a major source of the confusion in
your code. I suggest code something like this:

/* Assume tag_1 and tag_2 are arrays, or pointers to arrays,
* nreads_1 nreads_2 are integers. Also assume that you're
* super-confident of your data format, and that the sequences
* can't possibly be large enough to overflow tag_1 and tag_2
*/

while(fscanf(file, "%s %d", tag_1, &nreads_1) != EOF)
{
fscanf(file2, "%s %d", tag_2, &nreads_2);
if (strcmp(tag_1, tag_2) == 0)
printf("match!: %s: (%d, %d)\n", tag_1, nreads_1, nreads_2);

}

This needs some additional error-checking, but there's a basic
framework for you.

Wow, that is so much more simplified. Anyways I tried your code, but
the it doesn't return anything.
I have done some error analysis and noticed that if you added a
"printf" after the second "fscanf"
the value of tag_1 does not register anymore.

while(fscanf(file, "%s %d", tag_1, &reads_1) != EOF){
fscanf(file2, "%s %d", tag_2, &reads_2);
printf("%s\n", tag_1);
if (strcmp(tag_1, tag_2) == 0)
printf("match!: %s: (%d, %d)\n", tag_1, reads_1, reads_2);
}

the program does prints a bunch of blank lines, but if I moved the
printf statement before the second "fscanf"
displays the content.
I'm so confused

Thomas Matthews · Jul 30, 2009

chutsu said:
Ok. So basically I have two files in the form of:

File1:
asdfkjsdlfkjsdf 1232
afasdfklsdjfksf 12312
sdflsadsdffdsfs 32323

File2:
asdfkjsdlfkjsdf 1232
afasdfklsdjfksf 12312
sdflsadsdffdsfs 32323

Now these two files are similar and they are not ordered, my quest is
to get the first column from the first file (ie "asdfkjsdlfkjsdf") and
read the second file to find the same exact phrase.

Once you get a match obtain both second columns (ie The numbers) and
output as follows:

File 3:
asdfdsfdsfdssa 1232 133
asdfdsfdsfdssa 1232 133
asdfdsfdsfdssa 1232 133

Can someone help me, I have no idea how to approach this.
Thanks
Chris

You have a key column, sounds like a map data structure
would be very helpful.
struct
{
char * key;
char * file1_data;
char * file2_data;
};

Read all the data from the first file into a struct like above.
Sort by the key field.
Read the key field from the second file. Search for the key
in the memory. If key field is the same, set the data in the
structure. If field is unique, append a new struct and
resort.

In some languages, you can split the key and values into two
pieces:
struct Value
{
char * first_value;
char * second_value;
};

You would then use a map (directory, associative array):
map[key] = value;

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.comeaucomputing.com/learn/faq/
Other sites:
http://www.josuttis.com -- C++ STL Library book
http://www.sgi.com/tech/stl -- Standard Template Library

Morris Keesan · Jul 30, 2009

On Jul 30, 12:41 am, "Morris Keesan" <[email protected]> wrote:

I note that you haven't answered these questions, leaving the rest
of us to guess what it is that you're really trying to do.

Wow, that is so much more simplified. Anyways I tried your code, but
the it doesn't return anything.
I have done some error analysis and noticed that if you added a
"printf" after the second "fscanf"
the value of tag_1 does not register anymore.

while(fscanf(file, "%s %d", tag_1, &reads_1) != EOF){
fscanf(file2, "%s %d", tag_2, &reads_2);
printf("%s\n", tag_1);
if (strcmp(tag_1, tag_2) == 0)
printf("match!: %s: (%d, %d)\n", tag_1, reads_1, reads_2);
}

the program does prints a bunch of blank lines, but if I moved the
printf statement before the second "fscanf"
displays the content.
I'm so confused

First: Scroll up a couple of screens, look at the questions I asked
before,
and please answer them.

Second: This is all wild speculation without seeing your actual code, but
notice
the comment above my code fragment, stating the assumptions that would
need to
be made in order for this to work. Note especially the assumptions about
tag_1
and tag_2 pointing to memory which is large enough to hold the strings.
Without
seeing your actual code, I can only guess, but I wouldn't be at all
surprised if
you have declarations like

char *tag_1;
char *tag_2;

and no code which allocates any space for them to point at.
Please post the whole function which is doing this, or at least
the declarations and the code which opens the files.

chutsu · Jul 30, 2009

Yes I'm trying to match the base sequence, however the match does not
necessary mean
they are both on the same line number. So my code was to:
- read the base sequence from the first file
- store that in some variable (ie tag_1)
- read the second file to see if a match is found
- if found printf match found
- and loops until there are no more base sequence in file 1

Note: I actally want to do more than just printf, but one at a time.

These files are very large, about 120,000 lines long, so I tried
creating
multi-dimensional arrays, but its just too big. The two files don't
have
the same line numbers but do have the same format.

First: Scroll up a couple of screens, look at the questions I asked
before,
and please answer them.

Second: This is all wild speculation without seeing your actual code, but
notice
the comment above my code fragment, stating the assumptions that would
need to
be made in order for this to work. Note especially the assumptions about
tag_1
and tag_2 pointing to memory which is large enough to hold the strings.
Without
seeing your actual code, I can only guess, but I wouldn't be at all
surprised if
you have declarations like

char *tag_1;
char *tag_2;

and no code which allocates any space for them to point at.
Please post the whole function which is doing this, or at least
the declarations and the code which opens the files.

My full code at the moment is:

#include <stdio.h>
#include <string.h>

int main(int argc, char * argv[])
{

char *file_path="../../data/clustered_tags/clustered_tags_DB2.txt";
char *file_path2="../../data/clustered_tags/clustered_tags_SC3.txt";
char tag_1[21];
char tag_2[21];
int reads_1;
int reads_2;
int i=0;
FILE *file;
FILE *file2;

// Opening file
file = fopen( file_path, "r" );
file2 = fopen( file_path2, "r" );

if(file==NULL || file2==NULL) {
printf("Error: can't open file.\n");
return 1;
}
else {
printf("File opened!\n");
}

while(fscanf(file, "%s %d", tag_1, &reads_1) != EOF){
fscanf(file2, "%s %d", tag_2, &reads_2);
printf("%s\n", tag_1);
if (strcmp(tag_1, tag_2) == 0)
printf("match!: %s: (%d, %d)\n", tag_1, reads_1, reads_2);
}

fclose(file);
fclose(file2);
return 0;
}

Morris Keesan · Jul 30, 2009

Yes I'm trying to match the base sequence, however the match does not
necessary mean
they are both on the same line number. So my code was to:
- read the base sequence from the first file
- store that in some variable (ie tag_1)
- read the second file to see if a match is found
- if found printf match found
- and loops until there are no more base sequence in file 1 ....
These files are very large, about 120,000 lines long,

Honestly, I don't think C is the correct tool for this problem.
At the very least, you should sit down and think about this
algorithmically, independent of any programming language.

If the files are unsorted, then for each line of file1, you'll
be reading the entire contents of file2 if there's no match,
and on average half of file2 if there is a match. This means
your algorithm is O(n squared): if half of the lines in file1
have a match in file2, then you're reading
(60,000 * 60,000) + (60,000 * 120,000) lines from file2
( approximately 11 BILLION lines )

If you sort both files, then you can keep the files synchronized
while you're reading them, advancing file2 to keep up with file1.
Also, consider extracting just the base sequences from each file,
then using sort and comm (Unix programs) to find the base sequences
that are in common. Then you can go back and find those matching
sequences in the original files and extract the counts from them.

jameskuyper · Jul 30, 2009

Morris said:
Honestly, I don't think C is the correct tool for this problem.
At the very least, you should sit down and think about this
algorithmically, independent of any programming language.

If the files are unsorted, then for each line of file1, you'll
be reading the entire contents of file2 if there's no match,
and on average half of file2 if there is a match. This means
your algorithm is O(n squared): if half of the lines in file1
have a match in file2, then you're reading
(60,000 * 60,000) + (60,000 * 120,000) lines from file2
( approximately 11 BILLION lines )

If you sort both files, then you can keep the files synchronized
while you're reading them, advancing file2 to keep up with file1.
Also, consider extracting just the base sequences from each file,
then using sort and comm (Unix programs) to find the base sequences
that are in common. Then you can go back and find those matching
sequences in the original files and extract the counts from them.

If he's able to use Unix tools and willing to sort the input file,
then I think that the 'join' command does pretty much exactly what he
wants done.

Compare Files and Cat File Difference Question	0	Oct 21, 2008
To compare the content in two files..	4	Nov 17, 2010
trim the last blank-line and compare files	6	Mar 2, 2010
XSLT Compare two documents and output differences	4	Jun 22, 2007
File compare	5	Oct 12, 2005
write to a file two dict()	2	Sep 23, 2012
Compare two hierarchial files	0	Sep 8, 2005
Filtering two files with uncommon column	7	Jan 18, 2008

Trying to compare two files and output it into a third file.

chutsu

Moi

Default User

Gene

chutsu

Morris Keesan

chutsu

Thomas Matthews

Morris Keesan

chutsu

Morris Keesan

jameskuyper

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads