Finding number of file from gzip'ed format

Discussion in 'Perl Misc' started by sopan.shewale@gmail.com, Feb 29, 2008.

  1. Guest

    Hi,

    I am not sure if this is the right group to ask this question - i am
    sorry if this is not the right place.

    Problem: Let us say we have file called "myfile.txt". The size of the
    file is huge. The file is gziped - the gziped filename is
    "myfile.txt.gz". I am interested to find the number of lines of
    myfile.txt from myfile.txt.gz without gunziping it.

    I know if it is allowed to gunzip then just use "gunzip -c
    myfile.txt.gz | wc -l" this can give the number of lines.

    My problem is time taken to gunzip is huge file is very large.

    Is there any way to count the number of lines using Perl script/Any
    other method - just to figure out number of "\n" chars hidden inside
    the file-use something from the algorithm of gzip?

    Appreciate your time efforts to read the problem and thank you so much
    for investing time to read this problem.

    Please help me with solution or pointers to read (already reading
    http://www.gzip.org/algorithm.txt).


    --sopan
     
    , Feb 29, 2008
    #1
    1. Advertising

  2. Guest

    "" <> wrote:
    > Hi,
    >
    > I am not sure if this is the right group to ask this question - i am
    > sorry if this is not the right place.
    >
    > Problem: Let us say we have file called "myfile.txt". The size of the
    > file is huge. The file is gziped - the gziped filename is
    > "myfile.txt.gz". I am interested to find the number of lines of
    > myfile.txt from myfile.txt.gz without gunziping it.
    >
    > I know if it is allowed to gunzip then just use "gunzip -c
    > myfile.txt.gz | wc -l" this can give the number of lines.
    >
    > My problem is time taken to gunzip is huge file is very large.


    That is about as good as it is going to get.

    >
    > Is there any way to count the number of lines using Perl script/Any
    > other method - just to figure out number of "\n" chars hidden inside
    > the file-use something from the algorithm of gzip?


    You *might* be able to come up with a shortcut that is integrated in
    with the very guts of the Lempel-Ziv 77 algorithm that would allow you
    to count the "\n" without actually doing the unzip, but I'm skeptical
    that you could make it meaningfully faster than just gunzipping (or gzcat).
    And if you try to do so in Perl rather than C, then I'm rather confident
    that it would be a lot slower.

    Perhaps you should pre-compute and then cache the number of lines
    someplace.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    , Feb 29, 2008
    #2
    1. Advertising

  3. Ted Zlatanov Guest

    On Fri, 29 Feb 2008 11:58:05 -0800 (PST) "" <> wrote:

    ssc> Hi,
    ssc> I am not sure if this is the right group to ask this question - i am
    ssc> sorry if this is not the right place.

    ssc> Problem: Let us say we have file called "myfile.txt". The size of the
    ssc> file is huge. The file is gziped - the gziped filename is
    ssc> "myfile.txt.gz". I am interested to find the number of lines of
    ssc> myfile.txt from myfile.txt.gz without gunziping it.

    ssc> I know if it is allowed to gunzip then just use "gunzip -c
    ssc> myfile.txt.gz | wc -l" this can give the number of lines.

    ssc> My problem is time taken to gunzip is huge file is very large.

    ssc> Is there any way to count the number of lines using Perl script/Any
    ssc> other method - just to figure out number of "\n" chars hidden inside
    ssc> the file-use something from the algorithm of gzip?

    ssc> Appreciate your time efforts to read the problem and thank you so much
    ssc> for investing time to read this problem.

    ssc> Please help me with solution or pointers to read (already reading
    ssc> http://www.gzip.org/algorithm.txt).

    If the only metadata you'll need is the number of lines, just rename to
    myfile.N.txt.gz where N is the number of lines. So, if there is no N
    you have to count (you can't avoid that cost, because newlines are just
    content), but if N is already calculated you're done. Obviously if you
    modify the file you recalculate N, but a compressed file is unlikely to
    be modified in place.

    It's not a long-term solution and it will only work for this one piece
    of data, but it's easy to implement.

    The question is, why do you need to count newlines? If you specifically
    need to show exact statistics about how many lines are in the file,
    you're stuck. But you can at least approximate from the file size and
    average bytes per line over the first 5000 lines.

    Users really appreciate interactive applications. If instead of doing
    the wc -l and THEN displaying it, you maintain a running counter of the
    number of lines and update the screen with the new value periodically, I
    guarantee you that users won't mind it much.

    You could even show a progress bar using the average bytes per line so
    far, or the much easier Zeno's paradox progress bar (every update adds
    50% of the remainder, so you do 50%, 75%, 87.5%, etc.).

    Ted
     
    Ted Zlatanov, Mar 3, 2008
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    3
    Views:
    388
    Fredrik Lundh
    Dec 13, 2004
  2. flebber
    Replies:
    9
    Views:
    524
    John Machin
    Jun 10, 2007
  3. Rémi Gagnon
    Replies:
    2
    Views:
    360
    Michel Brito
    Apr 23, 2009
  4. Atoli Atoli

    open-uri + Zlib: not in gzip format

    Atoli Atoli, Oct 23, 2010, in forum: Ruby
    Replies:
    0
    Views:
    231
    Atoli Atoli
    Oct 23, 2010
  5. Fei Liu
    Replies:
    21
    Views:
    2,379
    John Bokma
    Dec 16, 2006
Loading...

Share This Page