simple indexing in Perl?

Discussion in 'Perl Misc' started by ela, Aug 10, 2010.

  1. ela

    ela Guest

    I'm new to database programming and just previously learnt to use loops to
    look up and enrich information using the following codes. However, when the
    tables are large, I find this process is very slow. Then, somebody told me I
    can build a database for one of the file real time and so no need to read
    the file from the beginning till the end again and again. However, perl DBI
    has a lot of sophisticated functions there and in fact my tables are only
    large but nothing special, linked by an ID. Is there any simple way to
    achieve the same purpose? I just wish the ID can be indexed and then
    everytime I access the record through memory and not through I/O...


    #!/usr/bin/perl

    my ($listfile, $format, $accfile, $infofile) = @ARGV;
    print '($listfile, $accfile, $infofile)'; <STDIN>;

    print "Working on $listfile...\n";
    $outname = $listfile . "_" . $infofile . ".xls";

    open (OFP, ">$outname");

    open(FP, $listfile);
    $line = <FP>;
    chomp $line;

    if ($format ne "") {
    @fields = split(/\t/, $line);
    for ($i=0; $i<@fields; $i++) {
    ############## check fields ###############################
    if ( $fields[$i] =~ /accession/) {
    $acci = $i;
    }
    }
    }

    print OFP "$line\tgene info\n";

    $nl = '\n';

    while (<FP>) {
    $line = $_;
    if ($line eq "\n") {
    print OFP $line;
    next;
    }
    chomp $line;

    if ($format eq "") {
    @cells = split (/:/, $line);
    $tag = $cells[0];
    } else {
    @cells = split (/\t/, $line);
    $tag = $cells[$acci];
    }

    open(AFP, $accfile);

    while (<AFP>) {
    @cells = split (/\t/, $_);
    if ($cells[5] =~ /$tag/) {
    $des = $cells[1];
    last;
    }
    }
    close AFP;

    if ($found == 0) {
    print OFP "$line\tNo gene info available\n";
    }
    }
    close FP;
    ela, Aug 10, 2010
    #1
    1. Advertising

  2. ela <> wrote:
    > I'm new to database programming and just previously learnt to use loops to
    > look up and enrich information using the following codes. However, when the
    > tables are large,


    Which tables? Do you mean 'files'?

    > I find this process is very slow. Then, somebody told me I
    > can build a database for one of the file real time and so no need to read
    > the file from the beginning till the end again and again. However, perl DBI
    > has a lot of sophisticated functions there and in fact my tables are only
    > large but nothing special, linked by an ID. Is there any simple way to
    > achieve the same purpose? I just wish the ID can be indexed and then
    > everytime I access the record through memory and not through I/O...


    > #!/usr/bin/perl


    Please, please use

    use strict;
    use warnings;

    It will tell you about a lot of potential problems.

    > my ($listfile, $format, $accfile, $infofile) = @ARGV;
    > print '($listfile, $accfile, $infofile)'; <STDIN>;


    What's that at end of the line good for?

    > print "Working on $listfile...\n";
    > $outname = $listfile . "_" . $infofile . ".xls";


    > open (OFP, ">$outname");


    Better use the three-argument form of open and use normal
    variables for file handles, this isn't Perl 4 anymore...

    open my $ofp, '>', $outname
    or die "Can't open $outfile for writing\n";

    Also checking that opening a file succeeded shouldn't be left
    out without very good reasons...

    > open(FP, $listfile);
    > $line = <FP>;
    > chomp $line;


    > if ($format ne "") {
    > @fields = split(/\t/, $line);
    > for ($i=0; $i<@fields; $i++) {
    > ############## check fields ###############################
    > if ( $fields[$i] =~ /accession/) {


    Are you aware that this will also match e.g. 'disaccession_123'?

    > $acci = $i;
    > }
    > }
    > }


    > print OFP "$line\tgene info\n";


    > $nl = '\n';


    > while (<FP>) {
    > $line = $_;


    Why don't you read directly into '$line' but instead do an
    additional copy?

    > if ($line eq "\n") {
    > print OFP $line;
    > next;
    > }
    > chomp $line;


    > if ($format eq "") {
    > @cells = split (/:/, $line);
    > $tag = $cells[0];
    > } else {
    > @cells = split (/\t/, $line);
    > $tag = $cells[$acci];
    > }


    > open(AFP, $accfile);


    > while (<AFP>) {
    > @cells = split (/\t/, $_);
    > if ($cells[5] =~ /$tag/) {
    > $des = $cells[1];
    > last;
    > }
    > }
    > close AFP;


    > if ($found == 0) {
    > print OFP "$line\tNo gene info available\n";
    > }


    Huh? '$found' is nowhere else used in your program. With
    'use warnings' you would have gotten a warning that you
    use the value of an uninitialized variable...

    > }
    > close FP;


    The probably most time-consuming part of your program is that for
    each line of the file with the name '$listfile' you read in at
    least a certain portion on '$accfile', again and again. To get
    around that you don't need a database, you just have to read it
    in only once and store the relevant information e.g. in a hash.
    If you would do something like

    open my $afp, '<', $accfile)
    or die "Can't open $accfile for reading\n";

    my %ahash;
    while ( my line = <$afp> ) {
    my @cells = split /\t/, $line;
    $ahash{ $cells[ 5 ] } = $cells[ 1 ];
    }
    close $afp;

    somewhere at the begining then you would have all the infor-
    mation you use from the '$accfile' file in the %ahash hash and
    there would be no need to read the file again and again:

    while ( my $line = <$fp> ) {
    if ( $line eq "\n" ) {
    print $ofp "\n";
    next;
    }
    chomp $line;

    if ( $format eq "" ) {
    @cells = split /:/, $line;
    $tag = $cells[ 0 ];
    } else {
    @cells = split /\t/, $line;
    $tag = $cells[ $acci ];
    }

    $des = $ahash{ $tag } if exists $ahash{ $tag };
    }

    close $fp;

    Putting things in a database won't do too much good here
    since, unless you have an in-memory database, also the
    database will put the information on the disk and has to
    retrieve it from there (but for sure a lot faster then
    re-reading a file for a bit of information lots of times;-)
    The only case I can think of where using a database may be
    beneficial here is when the '$accfile' is extremely large
    and the '%ahash' would use up all the memory you have. In
    that case putting things in a database (on disk then of
    course) for relatively fast finding the value for a key
    (i.e. what you have in the '$tag' variable) might be a rea-
    sonable alternative.
    Regards, Jens
    --
    \ Jens Thoms Toerring ___
    \__________________________ http://toerring.de
    Jens Thoms Toerring, Aug 10, 2010
    #2
    1. Advertising

  3. ela

    wolf Guest

    ela schrieb:
    > I'm new to database programming and just previously learnt to use loops to
    > look up and enrich information using the following codes. However, when the
    > tables are large, I find this process is very slow. Then, somebody told me I
    > can build a database for one of the file real time and so no need to read
    > the file from the beginning till the end again and again. However, perl DBI
    > has a lot of sophisticated functions there and in fact my tables are only
    > large but nothing special, linked by an ID. Is there any simple way to
    > achieve the same purpose? I just wish the ID can be indexed and then
    > everytime I access the record through memory and not through I/O...
    >
    >
    > #!/usr/bin/perl
    >
    > my ($listfile, $format, $accfile, $infofile) = @ARGV;
    > print '($listfile, $accfile, $infofile)'; <STDIN>;
    >
    > print "Working on $listfile...\n";
    > $outname = $listfile . "_" . $infofile . ".xls";
    >
    > open (OFP, ">$outname");
    >
    > open(FP, $listfile);
    > $line = <FP>;
    > chomp $line;
    >
    > if ($format ne "") {
    > @fields = split(/\t/, $line);
    > for ($i=0; $i<@fields; $i++) {
    > ############## check fields ###############################
    > if ( $fields[$i] =~ /accession/) {
    > $acci = $i;
    > }
    > }
    > }
    >
    > print OFP "$line\tgene info\n";
    >
    > $nl = '\n';
    >
    > while (<FP>) {
    > $line = $_;
    > if ($line eq "\n") {
    > print OFP $line;
    > next;
    > }
    > chomp $line;
    >
    > if ($format eq "") {
    > @cells = split (/:/, $line);
    > $tag = $cells[0];
    > } else {
    > @cells = split (/\t/, $line);
    > $tag = $cells[$acci];
    > }
    >
    > open(AFP, $accfile);
    >
    > while (<AFP>) {
    > @cells = split (/\t/, $_);
    > if ($cells[5] =~ /$tag/) {
    > $des = $cells[1];
    > last;
    > }
    > }
    > close AFP;
    >
    > if ($found == 0) {
    > print OFP "$line\tNo gene info available\n";
    > }
    > }
    > close FP;
    >
    >


    Hi ela,

    without going too deeply into your code, let's just say that you should
    always start you perl scripts with

    #!/usr/bin/perl
    use warnings;
    use strict;

    and if you can't make it run with these restrictions there is something
    seriously flaky about the way you are persuing.

    Apart from the perl aspect, there are some serious information issues
    you need to address.

    From what i can gather of your description, you are reading in a file
    that contains some kind of gene information, and you want to index that
    information so that retrieval of information is much faster rather than
    iterating SEQUENTIALLY over the whole file(or series of files) every
    time you need an answer.

    Is my assumption thus far right ?


    But to assess that, some real life info on what you are
    actually trying to do is needed :p
    How big is/are the files - that is .. how big will that index be ?

    What is the actual index gonna be .. etc.

    Only after that part becomes clear a solution is possible. And you need
    to communicate that.


    cheers, wolf
    wolf, Aug 10, 2010
    #3
  4. "ela" <> wrote:
    >
    >
    >I'm new to database programming and just previously learnt to use loops to
    >look up and enrich information using the following codes. However, when the
    >tables are large, I find this process is very slow. Then, somebody told me I
    >can build a database for one of the file real time and so no need to read
    >the file from the beginning till the end again and again.


    What I gathered from your code without going into details is that for
    each line in OFP your are opening, reading through, and closing AFP.

    I/O operations are by far the slowest operations and there is a trivial
    solution that will probably speed up your program dramatically: instead
    of reading AFP again and again and again just read it into an array once
    at the beginning of your program and then loop over that array instead
    of over the file.

    Only if AFP is too large for that (serveral GB) then you may need to
    look for a better algorithmic solution. This requires knowledge and
    experience and a database may or may not help, depending upon what you
    actually are trying to achive.

    jue
    Jürgen Exner, Aug 10, 2010
    #4
  5. ela

    ccc31807 Guest

    On Aug 10, 3:39 am, "ela" <> wrote:
    > I'm new to database programming and just previously learnt to use loops to
    > look up and enrich information using the following codes. However, when the
    > tables are large, I find this process is very slow. Then, somebody told me I
    > can build a database for one of the file real time and so no need to read
    > the file from the beginning till the end again and again. However, perl DBI
    > has a lot of sophisticated functions there and in fact my tables are only
    > large but nothing special, linked by an ID. Is there any simple way to
    > achieve the same purpose? I just wish the ID can be indexed and then
    > everytime I access the record through memory and not through I/O...


    You have input, which you want to process and turn into output.

    Your input consists of data contained in some kind of file. This is
    exactly the kind of task that Perl excels at.

    You have two choices: (1) you can use a database to store and query
    your data, or (2) you can use your computer's memory to store and
    query your data.

    If you have a large amount of permanent data that you need to add to,
    delete from, and change, your best strategy is to use a database. Read
    your data file into your database. Most databases have external
    commands (i.e., not SQL) for doing that, so it should be
    straightforward and easy -- note that you do not use Perl for this,
    and probably shouldn't.

    If you have a small to moderate amount of data, whether permanent or
    temporary, that you don't need to add to, delete from, or modify, your
    best strategy is to use your computer's memory to store and query your
    data. Simply open the file, read each line, destructure each line into
    a key and value, and stuff it into a hash.

    For example, suppose your data looks like this:
    12345,George,Washington,First
    23456,John,Adams,Second
    34567,Thomas,Jefferson,Third
    45678,James,Madison,Fourth

    You can do this:
    my %pres;
    open PRES, '<', 'data.csv' or die "$!";
    while(<PRES>)
    {
    chomp;
    my ($id, $first, $last, $place) = split /,/;
    $pres{$place} = "$id, $first, $last";
    }
    close PRES;

    If you need a multilevel data structure, see documentation, starting
    maybe with lists of lists.

    CC.
    ccc31807, Aug 10, 2010
    #5
  6. Tad McClellan <> wrote:
    > Jens Thoms Toerring <> wrote:
    > > ela <> wrote:


    > >> print '($listfile, $accfile, $infofile)'; <STDIN>;

    > >
    > > What's that at end of the line good for?


    > Pausing the program until something is typed on STDIN.


    Oh, I see. I was a bit confused why to wait for some input
    in that situation when one is complaining that the program
    is taking so long;-)
    Regards, Jens
    --
    \ Jens Thoms Toerring ___
    \__________________________ http://toerring.de
    Jens Thoms Toerring, Aug 10, 2010
    #6
  7. ela wrote:
    > I'm new to database programming and just previously learnt to use loops to
    > look up and enrich information using the following codes. However, when the
    > tables are large,


    How large?

    > I find this process is very slow. Then, somebody told me I
    > can build a database for one of the file real time and so no need to read
    > the file from the beginning till the end again and again.


    Not sure what you mean by "real time" here.

    > However, perl DBI
    > has a lot of sophisticated functions there and in fact my tables are only
    > large but nothing special, linked by an ID.


    Data is data. It doesn't need to "something special" in order to put
    into a database. Databases themselves are nothing special, just
    specialized tools to do a specialized job.

    > Is there any simple way to
    > achieve the same purpose? I just wish the ID can be indexed and then
    > everytime I access the record through memory and not through I/O...


    You can read the data into a hash, depending on just how large it is,
    and exactly how it needs to be matched.

    > open (OFP, ">$outname");
    >
    > open(FP, $listfile);


    You should check that your open commands succeed.

    >
    > print OFP "$line\tgene info\n";
    >
    > $nl = '\n';


    This is never used, and I don't see what one would use it for.

    >
    > while (<FP>) {

    ....

    >
    > open(AFP, $accfile);


    Again, you should check that the open succeeds.

    >
    > while (<AFP>) {
    > @cells = split (/\t/, $_);
    > if ($cells[5] =~ /$tag/) {
    > $des = $cells[1];
    > last;
    > }
    > }
    > close AFP;


    This would actually be quite hard to optimize if the match really needs
    to be as written, $cells[5] =~ /$tag/. Are you sure it wouldn't still
    be correct (or even be more correct) to test $cells[5] eq $tag, or at
    least $cells[5] =~ /^\Q$tag/ ?



    >
    > if ($found == 0) {
    > print OFP "$line\tNo gene info available\n";
    > }
    > }


    In your code, $found never gets set to anything, or changed.

    Xho
    Xho Jingleheimerschmidt, Aug 11, 2010
    #7
  8. ela

    ela Guest

    After testing different approaches, Jens Thoms Toerring's works better and
    therefore I modified the codes accordingly. Now I just don't know why the
    array content cannot be retrieved but only a number "1" is returned. Can
    anyone tell me the reason? In fact I can simply pass $line instead of @cells
    but what I finally want to achieve is to only print out several cells
    instead of all.


    my %ahash;
    while ( my $line = <$afp> ) {
    my @cells = split /\t/, $line;
    $ahash{ $cells[ 5 ] } = $cells[ 1 ];
    }
    close $afp;

    open my $ifp, '<', $infofile or die "Can't open $infofile for reading\n";

    my %ihash;
    while ( my $line = <$ifp> ) {
    my @cells = split /\t/, $line;
    $ihash{ $cells[ 1 ] } = @cells;
    }
    close $ifp;

    while ( my $line = <$fp> ) {
    if ( $line eq "\n" ) {
    print $ofp "\n";
    next;
    }
    chomp $line;

    if ( $format eq "" ) {
    @cells = split /:/, $line;
    $tag = $cells[ 0 ];
    } else {
    @cells = split /\t/, $line;
    $tag = $cells[ $acci ];
    }

    $gid = $ahash{ $tag } if exists $ahash{ $tag };
    @gene_info = $ihash{$gid};
    print $ofp "$line\t@gene_info";
    }

    close $fp;
    ela, Aug 11, 2010
    #8
  9. ela

    Guest

    On Wed, 11 Aug 2010 10:51:40 +0800, "ela" <> wrote:

    >After testing different approaches, Jens Thoms Toerring's works better and
    >therefore I modified the codes accordingly. Now I just don't know why the
    >array content cannot be retrieved but only a number "1" is returned. Can
    >anyone tell me the reason? In fact I can simply pass $line instead of @cells
    >but what I finally want to achieve is to only print out several cells
    >instead of all.
    >
    >
    >my %ahash;
    >while ( my $line = <$afp> ) {
    > my @cells = split /\t/, $line;
    > $ahash{ $cells[ 5 ] } = $cells[ 1 ];
    >}
    >close $afp;
    >
    >open my $ifp, '<', $infofile or die "Can't open $infofile for reading\n";
    >
    >my %ihash;
    >while ( my $line = <$ifp> ) {
    > my @cells = split /\t/, $line;
    > $ihash{ $cells[ 1 ] } = @cells;
    >}
    >close $ifp;
    >
    >while ( my $line = <$fp> ) {
    > if ( $line eq "\n" ) {
    > print $ofp "\n";
    > next;
    > }
    > chomp $line;
    >
    > if ( $format eq "" ) {
    > @cells = split /:/, $line;
    > $tag = $cells[ 0 ];
    > } else {
    > @cells = split /\t/, $line;
    > $tag = $cells[ $acci ];
    > }
    >
    > $gid = $ahash{ $tag } if exists $ahash{ $tag };
    > @gene_info = $ihash{$gid};
    > print $ofp "$line\t@gene_info";
    >}
    >
    >close $fp;
    >


    I'm puzzled why you should tackle this in Perl when
    I'm guessing this would be a hard SLQ statement for you
    to do.

    Realizing its a simple sql from 3 tables on a key field
    then trying to do it in Perl, etc ..

    Your looking for speed, but you can't normalize the task.
    You make the big mistake of gathering everything into memory
    thereby hogging memory with useless information, then
    compounding that error with one time use. Although, I'm not
    sure about the one time use, unless its interactive, but
    I didn't look to hard for that in the code.

    It doesen't appear you have multiple lines per key
    gene data, however, that data could be massive.
    There is no need to keep all the data in memory.
    You could in effect, keep a key => file position
    hash via tell(), then retrieve the data later with a
    seek.

    Applying a pseudo analysis on your content-less code,
    it is storing data beyond its use. Its like formal
    symbolic logic. Write the equation, then solve it,
    its called reverse-engineering.

    This is the bottom line equation of your work:

    ------------------
    @Gene-Info Array = @{ I-Hash{ A-Hash{ fp0 } } } if A-Hash{ fp0 } exists
    ------------------

    From inner to outer, when constructing the A-Hash, there is no
    need to add a key to the I-Hash if it does not exist in the A-Hash.
    If you wrote the sql for this you would have picked this up.
    And since the I-Hash contains all the mega gene data, you just
    ruptured your memory's brain.

    Start over, write pseudo-code, re-check your work via logic analysis
    from the inner to outer context. This will save you countless hours
    of headache.

    -sln
    , Aug 11, 2010
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. C
    Replies:
    0
    Views:
    496
  2. Tartifola

    Simple question on indexing

    Tartifola, Dec 1, 2006, in forum: Python
    Replies:
    2
    Views:
    265
    Christoph Haas
    Dec 1, 2006
  3. Emin
    Replies:
    4
    Views:
    408
    Paul McGuire
    Jan 12, 2007
  4. Skybuck Flying
    Replies:
    30
    Views:
    1,099
    Bill Reid
    Sep 19, 2011
  5. C
    Replies:
    3
    Views:
    218
    Manohar Kamath [MVP]
    Oct 17, 2003
Loading...

Share This Page