Trying to compare two files and output it into a third file.

Discussion in 'C Programming' started by chutsu, Jul 29, 2009.

  1. chutsu

    chutsu Guest

    Ok. So basically I have two files in the form of:

    File1:
    asdfkjsdlfkjsdf 1232
    afasdfklsdjfksf 12312
    sdflsadsdffdsfs 32323

    File2:
    asdfkjsdlfkjsdf 1232
    afasdfklsdjfksf 12312
    sdflsadsdffdsfs 32323

    Now these two files are similar and they are not ordered, my quest is
    to get the first column from the first file (ie "asdfkjsdlfkjsdf") and
    read the second file to find the same exact phrase.

    Once you get a match obtain both second columns (ie The numbers) and
    output as follows:

    File 3:
    asdfdsfdsfdssa 1232 133
    asdfdsfdsfdssa 1232 133
    asdfdsfdsfdssa 1232 133


    Can someone help me, I have no idea how to approach this.
    Thanks
    Chris
    chutsu, Jul 29, 2009
    #1
    1. Advertising

  2. chutsu

    Moi Guest

    On Wed, 29 Jul 2009 13:25:00 -0700, chutsu wrote:

    > Ok. So basically I have two files in the form of:
    >
    > File1:
    > asdfkjsdlfkjsdf 1232
    > afasdfklsdjfksf 12312
    > sdflsadsdffdsfs 32323
    >
    > File2:
    > asdfkjsdlfkjsdf 1232
    > afasdfklsdjfksf 12312
    > sdflsadsdffdsfs 32323
    >
    > Now these two files are similar and they are not ordered, my quest is to
    > get the first column from the first file (ie "asdfkjsdlfkjsdf") and read
    > the second file to find the same exact phrase.
    >
    > Once you get a match obtain both second columns (ie The numbers) and
    > output as follows:
    >
    > File 3:
    > asdfdsfdsfdssa 1232 133
    > asdfdsfdsfdssa 1232 133
    > asdfdsfdsfdssa 1232 133
    >
    >
    > Can someone help me, I have no idea how to approach this. Thanks



    Sort/merge "nested table scan"
    Some hashing might help.

    NB I don't know where the 133 in the result set comes from.
    And I don't know why there are *three* tuples in the result set.


    HTH,
    AvK
    Moi, Jul 29, 2009
    #2
    1. Advertising

  3. chutsu

    Default User Guest

    chutsu wrote:

    > Ok. So basically I have two files in the form of:
    >
    > File1:
    > asdfkjsdlfkjsdf 1232
    > afasdfklsdjfksf 12312
    > sdflsadsdffdsfs 32323
    >
    > File2:
    > asdfkjsdlfkjsdf 1232
    > afasdfklsdjfksf 12312
    > sdflsadsdffdsfs 32323
    >
    > Now these two files are similar and they are not ordered, my quest is
    > to get the first column from the first file (ie "asdfkjsdlfkjsdf") and
    > read the second file to find the same exact phrase.
    >
    > Once you get a match obtain both second columns (ie The numbers) and
    > output as follows:
    >
    > File 3:
    > asdfdsfdsfdssa 1232 133
    > asdfdsfdsfdssa 1232 133
    > asdfdsfdsfdssa 1232 133
    >
    >
    > Can someone help me, I have no idea how to approach this.


    You have NO idea? Well, how would you do it by hand? What exactly is
    giving you trouble? Do you know how to open files? Read from them?
    Compare strings? Do you know what loops are?

    If you seriously have no idea how to approach this problem, then you
    need to fall back and learning C and programming from the start.
    Otherwise, you need to show us what you've tried so was can help direct
    you along the correct approach.



    Brian

    --
    Day 177 of the "no grouchy usenet posts" project
    Default User, Jul 29, 2009
    #3
  4. chutsu

    Gene Guest

    On Jul 29, 4:25 pm, chutsu <> wrote:
    > Ok. So basically I have two files in the form of:
    >
    > File1:
    > asdfkjsdlfkjsdf    1232
    > afasdfklsdjfksf    12312
    > sdflsadsdffdsfs   32323
    >
    > File2:
    > asdfkjsdlfkjsdf    1232
    > afasdfklsdjfksf    12312
    > sdflsadsdffdsfs   32323
    >
    > Now these two files are similar and they are not ordered, my quest is
    > to get the first column from the first file (ie "asdfkjsdlfkjsdf") and
    > read the second file to find the same exact phrase.
    >
    > Once you get a match obtain both second columns (ie The numbers) and
    > output as follows:
    >
    > File 3:
    > asdfdsfdsfdssa    1232    133
    > asdfdsfdsfdssa    1232    133
    > asdfdsfdsfdssa    1232    133
    >


    Perhaps it's OT, but sometimes IMO the best C is no C. I.e. this is
    the kind of problem that perl, awk, and similar languages were meant
    to solve.

    In perl you'd need only something like this _untested_ code.

    our %pairs;

    sub scan {
    my $fn = shift;
    open(F, $fn) || die;
    while (<F>) {
    my ($key, $val) = /^(\S+)\s+(\d+)$/;
    die "bad data" unless $key;
    push @{ $pairs{$1} }, $2;
    }
    close F;
    }

    my report {
    my $fn = shift;
    open(F, "> $fn") || die;
    foreach my $key (keys %pairs) {
    next unless scalar(@{ $pairs{$key} }) > 1;
    print "$key\t" . join("\t", @{ $pairs{$key} }) . "\n";
    }
    close F;
    }

    scan("file1");
    scan("file2");
    report;
    Gene, Jul 29, 2009
    #4
  5. chutsu

    chutsu Guest

    On Jul 29, 9:51 pm, "Default User" <> wrote:
    > chutsu wrote:
    > > Ok. So basically I have two files in the form of:

    >
    > > File1:
    > > asdfkjsdlfkjsdf    1232
    > > afasdfklsdjfksf    12312
    > > sdflsadsdffdsfs   32323

    >
    > > File2:
    > > asdfkjsdlfkjsdf    1232
    > > afasdfklsdjfksf    12312
    > > sdflsadsdffdsfs   32323

    >
    > > Now these two files are similar and they are not ordered, my quest is
    > > to get the first column from the first file (ie "asdfkjsdlfkjsdf") and
    > > read the second file to find the same exact phrase.

    >
    > > Once you get a match obtain both second columns (ie The numbers) and
    > > output as follows:

    >
    > > File 3:
    > > asdfdsfdsfdssa    1232    133
    > > asdfdsfdsfdssa    1232    133
    > > asdfdsfdsfdssa    1232    133

    >
    > > Can someone help me, I have no idea how to approach this.

    >
    > You have NO idea? Well, how would you do it by hand? What exactly is
    > giving you trouble? Do you know how to open files? Read from them?
    > Compare strings? Do you know what loops are?
    >
    > If you seriously have no idea how to approach this problem, then you
    > need to fall back and learning C and programming from the start.
    > Otherwise, you need to show us what you've tried so was can help direct
    > you along the correct approach.
    >
    > Brian
    >
    > --
    > Day 177 of the "no grouchy usenet posts" project


    to understand my code you need know more about what these files are.
    So I'm trying to sort out some DNA data I got, the first stage is to
    compare which sequences appear to be common, and how many repeats or
    "reads" occur.
    The first field is the sequence (or tag in my code), the second is the
    number of reads.
    The data file 1 and 2 will therefore look like:
    CAGCTCACTGCA 123
    ACGTGCCCCCTT 847
    etc... etc...

    I've been writing this code and I have no idea why it doesn't work:

    the usual inclues and opening file...
    This is the bit I can't get it to work

    // Read file1
    while(!feof(file)){

    // Get the tag sequence and the read number
    fscanf(file, "%s", tag_1);

    // Validate the tag is a sequence and not the reads
    if(tag_1[0]=='A'||tag_1[0]=='C'||tag_1[0]=='G'||tag_1[0]=='T'){

    // Read file2
    fscanf(file2, "%s", tag_2);

    // Validate the tag2 is a sequence and not the reads
    if(tag_2[0]=='A'||tag_2[0]=='C'||tag_2[0]=='G'||tag_2[0]=='T')
    {

    // Now compare tag1 with tag2 to see if they match
    if(strcmp(tag_1, tag_2)==0){
    printf("match!: %s", tag_1);
    }
    }
    }
    }

    note this is by no means finish, I'm working in stages, but this is as
    far as I got.
    chutsu, Jul 30, 2009
    #5
  6. On Wed, 29 Jul 2009 19:14:35 -0400, chutsu <> wrote:

    > to understand my code you need know more about what these files are.
    > So I'm trying to sort out some DNA data I got, the first stage is to
    > compare which sequences appear to be common, and how many repeats or
    > "reads" occur.
    > The first field is the sequence (or tag in my code), the second is the
    > number of reads.
    > The data file 1 and 2 will therefore look like:
    > CAGCTCACTGCA 123
    > ACGTGCCCCCTT 847
    > etc... etc...


    Clarification: It looks like you only want to find a match between
    the two files if the matching base sequence is on the same line
    number in both files? That appears to be the intent of your code.

    And a few questions: How large are these files? Is there
    any particular reason to avoid sorting them? And do you have
    a guarantee that the two files have the same number of lines?

    >
    > I've been writing this code and I have no idea why it doesn't work:
    >
    > the usual inclues and opening file...
    > This is the bit I can't get it to work
    >
    > // Read file1
    > while(!feof(file)){


    This is a very common error in C code: feof only returns true after
    you've attempted to read past the end of the file, NOT when you've
    read the last byte of the file.

    >
    > // Get the tag sequence and the read number
    > fscanf(file, "%s", tag_1);
    >
    > // Validate the tag is a sequence and not the reads
    > if(tag_1[0]=='A'||tag_1[0]=='C'||tag_1[0]=='G'||tag_1[0]=='T'){
    >
    > // Read file2
    > fscanf(file2, "%s", tag_2);
    >
    > // Validate the tag2 is a sequence and not the reads
    > if(tag_2[0]=='A'||tag_2[0]=='C'||tag_2[0]=='G'||tag_2[0]=='T')


    Note that you've repeated the code for comparing the first character to
    ACGT -- very poor programming practice. If you were going to do this,
    it would be worth extracting this test into a subroutine.
    But reading the file this way, scanning a single token at a time and
    testing the content to figure out which column you've read, is clunky
    and error-prone, and I think it's a major source of the confusion in
    your code. I suggest code something like this:

    /* Assume tag_1 and tag_2 are arrays, or pointers to arrays,
    * nreads_1 nreads_2 are integers. Also assume that you're
    * super-confident of your data format, and that the sequences
    * can't possibly be large enough to overflow tag_1 and tag_2
    */

    while(fscanf(file, "%s %d", tag_1, &nreads_1) != EOF)
    {
    fscanf(file2, "%s %d", tag_2, &nreads_2);
    if (strcmp(tag_1, tag_2) == 0)
    printf("match!: %s: (%d, %d)\n", tag_1, nreads_1, nreads_2);
    }

    This needs some additional error-checking, but there's a basic
    framework for you.
    Morris Keesan, Jul 30, 2009
    #6
  7. chutsu

    chutsu Guest

    On Jul 30, 12:41 am, "Morris Keesan" <> wrote:
    > On Wed, 29 Jul 2009 19:14:35 -0400, chutsu <> wrote:
    > > to understand my code you need know more about what these files are.
    > > So I'm trying to sort out some DNA data I got, the first stage is to
    > > compare which sequences appear to be common, and how many repeats or
    > > "reads" occur.
    > > The first field is the sequence (or tag in my code), the second is the
    > > number of reads.
    > > The data file 1 and 2 will therefore look like:
    > > CAGCTCACTGCA    123
    > > ACGTGCCCCCTT    847
    > > etc... etc...

    >
    > Clarification: It looks like you only want to find a match between
    > the two files if the matching base sequence is on the same line
    > number in both files?  That appears to be the intent of your code.
    >
    > And a few questions: How large are these files?  Is there
    > any particular reason to avoid sorting them?  And do you have
    > a guarantee that the two files have the same number of lines?
    >
    >
    >
    > > I've been writing this code and I have no idea why it doesn't work:

    >
    > > the usual inclues and opening file...
    > > This is the bit I can't get it to work

    >
    > > // Read file1
    > > while(!feof(file)){

    >
    > This is a very common error in C code: feof only returns true after
    > you've attempted to read past the end of the file, NOT when you've
    > read the last byte of the file.
    >
    >
    >
    > >     // Get the tag sequence and the read number
    > >     fscanf(file, "%s", tag_1);

    >
    > >     // Validate the tag is a sequence and not the reads
    > >     if(tag_1[0]=='A'||tag_1[0]=='C'||tag_1[0]=='G'||tag_1[0]=='T'){

    >
    > >         // Read file2
    > >         fscanf(file2, "%s", tag_2);

    >
    > >         // Validate the tag2 is a sequence and not the reads
    > >         if(tag_2[0]=='A'||tag_2[0]=='C'||tag_2[0]=='G'||tag_2[0]=='T')

    >
    > Note that you've repeated the code for comparing the first character to
    > ACGT -- very poor programming practice.  If you were going to do this,
    > it would be worth extracting this test into a subroutine.
    > But reading the file this way, scanning a single token at a time and
    > testing the content to figure out which column you've read, is clunky
    > and error-prone, and I think it's a major source of the confusion in
    > your code.  I suggest code something like this:
    >
    > /* Assume tag_1 and tag_2 are arrays, or pointers to arrays,
    >   * nreads_1 nreads_2 are integers.  Also assume that you're
    >   * super-confident of your data format, and that the sequences
    >   * can't possibly be large enough to overflow tag_1 and tag_2
    >   */
    >
    > while(fscanf(file, "%s %d", tag_1, &nreads_1) != EOF)
    > {
    >      fscanf(file2, "%s %d", tag_2, &nreads_2);
    >      if (strcmp(tag_1, tag_2) == 0)
    >          printf("match!: %s: (%d, %d)\n", tag_1, nreads_1, nreads_2);
    >
    > }
    >
    > This needs some additional error-checking, but there's a basic
    > framework for you.



    Wow, that is so much more simplified. Anyways I tried your code, but
    the it doesn't return anything.
    I have done some error analysis and noticed that if you added a
    "printf" after the second "fscanf"
    the value of tag_1 does not register anymore.

    while(fscanf(file, "%s %d", tag_1, &reads_1) != EOF){
    fscanf(file2, "%s %d", tag_2, &reads_2);
    printf("%s\n", tag_1);
    if (strcmp(tag_1, tag_2) == 0)
    printf("match!: %s: (%d, %d)\n", tag_1, reads_1, reads_2);
    }

    the program does prints a bunch of blank lines, but if I moved the
    printf statement before the second "fscanf"
    displays the content.
    I'm so confused
    chutsu, Jul 30, 2009
    #7
  8. chutsu wrote:
    > Ok. So basically I have two files in the form of:
    >
    > File1:
    > asdfkjsdlfkjsdf 1232
    > afasdfklsdjfksf 12312
    > sdflsadsdffdsfs 32323
    >
    > File2:
    > asdfkjsdlfkjsdf 1232
    > afasdfklsdjfksf 12312
    > sdflsadsdffdsfs 32323
    >
    > Now these two files are similar and they are not ordered, my quest is
    > to get the first column from the first file (ie "asdfkjsdlfkjsdf") and
    > read the second file to find the same exact phrase.
    >
    > Once you get a match obtain both second columns (ie The numbers) and
    > output as follows:
    >
    > File 3:
    > asdfdsfdsfdssa 1232 133
    > asdfdsfdsfdssa 1232 133
    > asdfdsfdsfdssa 1232 133
    >
    >
    > Can someone help me, I have no idea how to approach this.
    > Thanks
    > Chris


    You have a key column, sounds like a map data structure
    would be very helpful.
    struct
    {
    char * key;
    char * file1_data;
    char * file2_data;
    };

    Read all the data from the first file into a struct like above.
    Sort by the key field.
    Read the key field from the second file. Search for the key
    in the memory. If key field is the same, set the data in the
    structure. If field is unique, append a new struct and
    resort.

    In some languages, you can split the key and values into two
    pieces:
    struct Value
    {
    char * first_value;
    char * second_value;
    };

    You would then use a map (directory, associative array):
    map[key] = value;


    --
    Thomas Matthews

    C++ newsgroup welcome message:
    http://www.slack.net/~shiva/welcome.txt
    C++ Faq: http://www.parashift.com/c -faq-lite
    C Faq: http://www.eskimo.com/~scs/c-faq/top.html
    alt.comp.lang.learn.c-c++ faq:
    http://www.comeaucomputing.com/learn/faq/
    Other sites:
    http://www.josuttis.com -- C++ STL Library book
    http://www.sgi.com/tech/stl -- Standard Template Library
    Thomas Matthews, Jul 30, 2009
    #8
  9. On Wed, 29 Jul 2009 20:01:55 -0400, chutsu <> wrote:

    > On Jul 30, 12:41 am, "Morris Keesan" <> wrote:

    <snip>
    >> Clarification: It looks like you only want to find a match between
    >> the two files if the matching base sequence is on the same line
    >> number in both files?  That appears to be the intent of your code.
    >>
    >> And a few questions: How large are these files?  Is there
    >> any particular reason to avoid sorting them?  And do you have
    >> a guarantee that the two files have the same number of lines?





    I note that you haven't answered these questions, leaving the rest
    of us to guess what it is that you're really trying to do.

    <snip>


    >>  I suggest code something like this:
    >>
    >> /* Assume tag_1 and tag_2 are arrays, or pointers to arrays,
    >>   * nreads_1 nreads_2 are integers.  Also assume that you're
    >>   * super-confident of your data format, and that the sequences
    >>   * can't possibly be large enough to overflow tag_1 and tag_2
    >>   */
    >>
    >> while(fscanf(file, "%s %d", tag_1, &nreads_1) != EOF)
    >> {
    >>      fscanf(file2, "%s %d", tag_2, &nreads_2);
    >>      if (strcmp(tag_1, tag_2) == 0)
    >>          printf("match!: %s: (%d, %d)\n", tag_1, nreads_1, nreads_2);
    >>
    >> }
    >>
    >> This needs some additional error-checking, but there's a basic
    >> framework for you.

    >
    >
    > Wow, that is so much more simplified. Anyways I tried your code, but
    > the it doesn't return anything.
    > I have done some error analysis and noticed that if you added a
    > "printf" after the second "fscanf"
    > the value of tag_1 does not register anymore.
    >
    > while(fscanf(file, "%s %d", tag_1, &reads_1) != EOF){
    > fscanf(file2, "%s %d", tag_2, &reads_2);
    > printf("%s\n", tag_1);
    > if (strcmp(tag_1, tag_2) == 0)
    > printf("match!: %s: (%d, %d)\n", tag_1, reads_1, reads_2);
    > }
    >
    > the program does prints a bunch of blank lines, but if I moved the
    > printf statement before the second "fscanf"
    > displays the content.
    > I'm so confused


    First: Scroll up a couple of screens, look at the questions I asked
    before,
    and please answer them.

    Second: This is all wild speculation without seeing your actual code, but
    notice
    the comment above my code fragment, stating the assumptions that would
    need to
    be made in order for this to work. Note especially the assumptions about
    tag_1
    and tag_2 pointing to memory which is large enough to hold the strings.
    Without
    seeing your actual code, I can only guess, but I wouldn't be at all
    surprised if
    you have declarations like

    char *tag_1;
    char *tag_2;

    and no code which allocates any space for them to point at.
    Please post the whole function which is doing this, or at least
    the declarations and the code which opens the files.
    Morris Keesan, Jul 30, 2009
    #9
  10. chutsu

    chutsu Guest


    > >> Clarification: It looks like you only want to find a match between
    > >> the two files if the matching base sequence is on the same line
    > >> number in both files?  That appears to be the intent of your code.


    Yes I'm trying to match the base sequence, however the match does not
    necessary mean
    they are both on the same line number. So my code was to:
    - read the base sequence from the first file
    - store that in some variable (ie tag_1)
    - read the second file to see if a match is found
    - if found printf match found
    - and loops until there are no more base sequence in file 1

    Note: I actally want to do more than just printf, but one at a time.

    > >> And a few questions: How large are these files?  Is there
    > >> any particular reason to avoid sorting them?  And do you have
    > >> a guarantee that the two files have the same number of lines?


    These files are very large, about 120,000 lines long, so I tried
    creating
    multi-dimensional arrays, but its just too big. The two files don't
    have
    the same line numbers but do have the same format.



    >
    >
    > >>  I suggest code something like this:

    >
    > >> /* Assume tag_1 and tag_2 are arrays, or pointers to arrays,
    > >>   * nreads_1 nreads_2 are integers.  Also assume that you're
    > >>   * super-confident of your data format, and that the sequences
    > >>   * can't possibly be large enough to overflow tag_1 and tag_2
    > >>   */

    >
    > >> while(fscanf(file, "%s %d", tag_1, &nreads_1) != EOF)
    > >> {
    > >>      fscanf(file2, "%s %d", tag_2, &nreads_2);
    > >>      if (strcmp(tag_1, tag_2) == 0)
    > >>          printf("match!: %s: (%d, %d)\n", tag_1, nreads_1, nreads_2);

    >
    > >> }

    >
    > >> This needs some additional error-checking, but there's a basic
    > >> framework for you.

    >
    > > Wow, that is so much more simplified. Anyways I tried your code, but
    > > the it doesn't return anything.
    > > I have done some error analysis and noticed that if you added a
    > > "printf" after the second "fscanf"
    > > the value of tag_1 does not register anymore.

    >
    > >    while(fscanf(file, "%s %d", tag_1, &reads_1) != EOF){
    > >            fscanf(file2, "%s %d", tag_2, &reads_2);
    > >            printf("%s\n", tag_1);
    > >            if (strcmp(tag_1, tag_2) == 0)
    > >            printf("match!: %s: (%d, %d)\n", tag_1, reads_1, reads_2);
    > >    }

    >
    > > the program does prints a bunch of blank lines, but if I moved the
    > > printf statement before the second "fscanf"
    > > displays the content.
    > > I'm so confused

    >
    > First:  Scroll up a couple of screens, look at the questions I asked  
    > before,
    > and please answer them.
    >
    > Second: This is all wild speculation without seeing your actual code, but  
    > notice
    > the comment above my code fragment, stating the assumptions that would  
    > need to
    > be made in order for this to work.  Note especially the assumptions about  
    > tag_1
    > and tag_2 pointing to memory which is large enough to hold the strings.  
    > Without
    > seeing your actual code, I can only guess, but I wouldn't be at all  
    > surprised if
    > you have declarations like
    >
    >      char *tag_1;
    >      char *tag_2;
    >
    > and no code which allocates any space for them to point at.
    > Please post the whole function which is doing this, or at least
    > the declarations and the code which opens the files.


    My full code at the moment is:

    #include <stdio.h>
    #include <string.h>

    int main(int argc, char * argv[])
    {

    char *file_path="../../data/clustered_tags/clustered_tags_DB2.txt";
    char *file_path2="../../data/clustered_tags/clustered_tags_SC3.txt";
    char tag_1[21];
    char tag_2[21];
    int reads_1;
    int reads_2;
    int i=0;
    FILE *file;
    FILE *file2;


    // Opening file
    file = fopen( file_path, "r" );
    file2 = fopen( file_path2, "r" );

    if(file==NULL || file2==NULL) {
    printf("Error: can't open file.\n");
    return 1;
    }
    else {
    printf("File opened!\n");
    }

    while(fscanf(file, "%s %d", tag_1, &reads_1) != EOF){
    fscanf(file2, "%s %d", tag_2, &reads_2);
    printf("%s\n", tag_1);
    if (strcmp(tag_1, tag_2) == 0)
    printf("match!: %s: (%d, %d)\n", tag_1, reads_1, reads_2);
    }

    fclose(file);
    fclose(file2);
    return 0;
    }
    chutsu, Jul 30, 2009
    #10
  11. On Thu, 30 Jul 2009 07:27:58 -0400, chutsu <> wrote:

    >
    >> >> Clarification: It looks like you only want to find a match between
    >> >> the two files if the matching base sequence is on the same line
    >> >> number in both files?  That appears to be the intent of your code.

    >
    > Yes I'm trying to match the base sequence, however the match does not
    > necessary mean
    > they are both on the same line number. So my code was to:
    > - read the base sequence from the first file
    > - store that in some variable (ie tag_1)
    > - read the second file to see if a match is found
    > - if found printf match found
    > - and loops until there are no more base sequence in file 1

    ....
    > These files are very large, about 120,000 lines long,


    Honestly, I don't think C is the correct tool for this problem.
    At the very least, you should sit down and think about this
    algorithmically, independent of any programming language.

    If the files are unsorted, then for each line of file1, you'll
    be reading the entire contents of file2 if there's no match,
    and on average half of file2 if there is a match. This means
    your algorithm is O(n squared): if half of the lines in file1
    have a match in file2, then you're reading
    (60,000 * 60,000) + (60,000 * 120,000) lines from file2
    ( approximately 11 BILLION lines )

    If you sort both files, then you can keep the files synchronized
    while you're reading them, advancing file2 to keep up with file1.
    Also, consider extracting just the base sequences from each file,
    then using sort and comm (Unix programs) to find the base sequences
    that are in common. Then you can go back and find those matching
    sequences in the original files and extract the counts from them.
    Morris Keesan, Jul 30, 2009
    #11
  12. chutsu

    jameskuyper Guest

    Morris Keesan wrote:
    > On Thu, 30 Jul 2009 07:27:58 -0400, chutsu <> wrote:
    >
    > >
    > >> >> Clarification: It looks like you only want to find a match between
    > >> >> the two files if the matching base sequence is on the same line
    > >> >> number in both files? ï¿œThat appears to be the intent of your code.

    > >
    > > Yes I'm trying to match the base sequence, however the match does not
    > > necessary mean
    > > they are both on the same line number. So my code was to:
    > > - read the base sequence from the first file
    > > - store that in some variable (ie tag_1)
    > > - read the second file to see if a match is found
    > > - if found printf match found
    > > - and loops until there are no more base sequence in file 1

    > ...
    > > These files are very large, about 120,000 lines long,

    >
    > Honestly, I don't think C is the correct tool for this problem.
    > At the very least, you should sit down and think about this
    > algorithmically, independent of any programming language.
    >
    > If the files are unsorted, then for each line of file1, you'll
    > be reading the entire contents of file2 if there's no match,
    > and on average half of file2 if there is a match. This means
    > your algorithm is O(n squared): if half of the lines in file1
    > have a match in file2, then you're reading
    > (60,000 * 60,000) + (60,000 * 120,000) lines from file2
    > ( approximately 11 BILLION lines )
    >
    > If you sort both files, then you can keep the files synchronized
    > while you're reading them, advancing file2 to keep up with file1.
    > Also, consider extracting just the base sequences from each file,
    > then using sort and comm (Unix programs) to find the base sequences
    > that are in common. Then you can go back and find those matching
    > sequences in the original files and extract the counts from them.


    If he's able to use Unix tools and willing to sort the input file,
    then I think that the 'join' command does pretty much exactly what he
    wants done.
    jameskuyper, Jul 30, 2009
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. GenxLogic
    Replies:
    3
    Views:
    1,249
    andrewmcdonagh
    Dec 6, 2006
  2. Justme
    Replies:
    9
    Views:
    604
    clayne
    Oct 1, 2006
  3. Lambda
    Replies:
    3
    Views:
    613
    Lambda
    Jun 24, 2008
  4. Tradeorganizer
    Replies:
    5
    Views:
    162
    Tradeorganizer
    Jan 31, 2007
  5. Jofio
    Replies:
    3
    Views:
    104
    BootNic
    Oct 9, 2005
Loading...

Share This Page