Search for an expression in a text file

Discussion in 'C Programming' started by ritesh, Apr 24, 2007.

  1. ritesh

    ritesh Guest

    Hi,

    I'm working on a piece of code that -
    1. has a list of text files
    2. and needs to search for a particular expression in these files

    (this is being done on Linux using gcc 3.4.2)

    Currently the search is done using the 'grep' utility on linux. This
    takes too much time, and kills the responsiveness of the application.

    Is there any function in C or maybe any other way to get this job done
    faster?

    Please note that i've provided the operating system and compiler info,
    just for the sake of it. Some people on this group don't like this and
    ask to re-post on an appropriate group for linux. I've still doen this
    because I never recieve a good answer on any other group.

    Thanks,
    Ritesh
     
    ritesh, Apr 24, 2007
    #1
    1. Advertising

  2. [You should read this with a neutral tone. There's
    more to becoming a good programmer than
    writing code!]

    ritesh <> wrote:
    > Hi,
    >
    > I'm working on a piece of code that -
    > 1. has a list of text files
    > 2. and needs to search for a particular expression in these
    > files


    Do you have a question about the C language?

    > (this is being done on Linux using gcc 3.4.2)
    >
    > Currently the search is done using the 'grep' utility on
    > linux. This takes too much time, and kills the
    > responsiveness of the application.
    >
    > Is there any function in C or maybe any other way to get
    > this job done faster?


    Most grep implementations are based on publically available
    regex libraries. There is no standard function beyond strstr()
    which is not likely to be as optimised as taylored libraries.

    You may even try tools like [f]lex to generate lexical analysers.

    > Please note that i've provided the operating system and
    > compiler info, just for the sake of it.


    Hint: If these are at all relevant, there's a very good chance
    your post of off-topic in clc.

    > Some people on this
    > group don't like this and ask to re-post on an appropriate
    > group for linux.


    Because other groups can answer platform specific questions
    much better, and clc isn't particularly interested in becoming
    yet another high noise to signal dumping ground for anything
    and everything where main is a valid identifier.

    > I've still doen this
    > because I never recieve a good answer on any other group.


    That doesn't mean that clc has to rectify your problem.

    Have you tried simple web/code searches? Looking for C
    source on grep or find/replace should give you ample
    sources.

    --
    Peter
     
    Peter Nilsson, Apr 24, 2007
    #2
    1. Advertising

  3. ritesh

    ritesh Guest

    Two Point I missed out -

    1. The text files - are of random form - they don't contain records or
    any ordered sequence of characters.

    2. The list of text files may go upto 10K files. So I'm assuming that
    opening each file using the C File I/O is not a good way to handle
    this.
     
    ritesh, Apr 24, 2007
    #3
  4. ritesh

    Jonas Guest

    "ritesh" <> wrote in message
    news:...
    > Two Point I missed out -
    >
    > 1. The text files - are of random form - they don't contain records or
    > any ordered sequence of characters.
    >
    > 2. The list of text files may go upto 10K files. So I'm assuming that
    > opening each file using the C File I/O is not a good way to handle
    > this.
    >


    <OT>
    The efficient way to do this would be to index your text files. Search for
    "inverted index" on the web for more info. More generally, your question is
    in the field of "information retrieval" for which there is also a (not very
    active) newsgroup: comp.theory.info-retrieval.
    </OT>

    --
    Jonas
     
    Jonas, Apr 24, 2007
    #4
  5. ritesh said:

    > Hi,
    >
    > I'm working on a piece of code that -
    > 1. has a list of text files
    > 2. and needs to search for a particular expression in these files
    >
    > (this is being done on Linux using gcc 3.4.2)
    >
    > Currently the search is done using the 'grep' utility on linux. This
    > takes too much time, and kills the responsiveness of the application.
    >
    > Is there any function in C or maybe any other way to get this job done
    > faster?


    No specific function in C, no, but there is certainly a way. Several
    ways, in fact.

    One is to slurp grep into your program, to avoid the overhead of
    creating a fresh process, complete with environment etc. It should not
    be difficult to find the source for grep, so you can turn grep into
    grep().

    Another possibility is to code the search yourself, using (say)
    Knuth-Morris-Pratt or Boyer-Moore, and taking advantage of any data
    knowledge you have which might be able to speed up the search.

    And of course you can pre-calculate, as someone else suggested - build
    an index that gives you key starting-points.

    Other possibilities may well exist, and perhaps the others here will
    chip in a few ideas when the grumpiness wears off.

    --
    Richard Heathfield
    "Usenet is a strange place" - dmr 29/7/1999
    http://www.cpax.org.uk
    email: rjh at the above domain, - www.
     
    Richard Heathfield, Apr 24, 2007
    #5
  6. ritesh <> writes:
    > I'm working on a piece of code that -
    > 1. has a list of text files
    > 2. and needs to search for a particular expression in these files
    >
    > (this is being done on Linux using gcc 3.4.2)
    >
    > Currently the search is done using the 'grep' utility on linux. This
    > takes too much time, and kills the responsiveness of the application.
    >
    > Is there any function in C or maybe any other way to get this job done
    > faster?

    [...]

    The grep command happens to be implemented in C. What makes you think
    that another program written in C to do the same job will be any
    faster?

    And you haven't defined what you mean by an "expression".

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
     
    Keith Thompson, Apr 24, 2007
    #6
  7. Keith Thompson said:

    > ritesh <> writes:
    >>
    >> Currently the search is done using the 'grep' utility on linux. This
    >> takes too much time, and kills the responsiveness of the application.
    >>
    >> Is there any function in C or maybe any other way to get this job
    >> done faster?

    > [...]
    >
    > The grep command happens to be implemented in C. What makes you think
    > that another program written in C to do the same job will be any
    > faster?


    The two obvious reasons are: (1) losing the process-creation overhead,
    and (2) possible hacks based on special data knowledge.

    > And you haven't defined what you mean by an "expression".


    He doesn't have to. C defines it for him. :)

    --
    Richard Heathfield
    "Usenet is a strange place" - dmr 29/7/1999
    http://www.cpax.org.uk
    email: rjh at the above domain, - www.
     
    Richard Heathfield, Apr 24, 2007
    #7
  8. Richard Heathfield <> writes:
    > Keith Thompson said:
    >
    >> ritesh <> writes:
    >>>
    >>> Currently the search is done using the 'grep' utility on linux. This
    >>> takes too much time, and kills the responsiveness of the application.
    >>>
    >>> Is there any function in C or maybe any other way to get this job
    >>> done faster?

    >> [...]
    >>
    >> The grep command happens to be implemented in C. What makes you think
    >> that another program written in C to do the same job will be any
    >> faster?

    >
    > The two obvious reasons are: (1) losing the process-creation overhead,
    > and (2) possible hacks based on special data knowledge.


    The first can be largely eliminated. (He could have found out about
    xargs if he'd posted to a Unix newsgroup.)

    As for special data knowledge, I suppose that's possible, but not we
    can't help if he doesn't share that knowledge.

    [...]

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
     
    Keith Thompson, Apr 24, 2007
    #8
  9. ritesh

    ritesh Guest

    On Apr 24, 3:40 pm, Keith Thompson <> wrote:
    > ritesh <> writes:
    > > I'm working on a piece of code that -
    > > 1. has a list of text files
    > > 2. and needs to search for a particular expression in these files

    >
    > > (this is being done on Linux using gcc 3.4.2)

    >
    > > Currently the search is done using the 'grep' utility on linux. This
    > > takes too much time, and kills the responsiveness of the application.

    >
    > > Is there any function in C or maybe any other way to get this job done
    > > faster?

    >
    > [...]
    >
    > The grep command happens to be implemented in C. What makes you think
    > that another program written in C to do the same job will be any
    > faster?
    >
    > And you haven't defined what you mean by an "expression".
    >


    By 'expression' I mean a regular expression used by the grep command.
     
    ritesh, Apr 24, 2007
    #9
  10. ritesh

    ritesh Guest

    On Apr 24, 4:02 pm, Keith Thompson <> wrote:
    > Richard Heathfield <> writes:
    > > Keith Thompson said:

    >
    > >> ritesh <> writes:

    >
    > >>> Currently the search is done using the 'grep' utility on linux. This
    > >>> takes too much time, and kills the responsiveness of the application.

    >
    > >>> Is there any function in C or maybe any other way to get this job
    > >>> done faster?
    > >> [...]

    >
    > >> The grep command happens to be implemented in C. What makes you think
    > >> that another program written in C to do the same job will be any
    > >> faster?

    >
    > > The two obvious reasons are: (1) losing the process-creation overhead,
    > > and (2) possible hacks based on special data knowledge.

    >
    > The first can be largely eliminated. (He could have found out about
    > xargs if he'd posted to a Unix newsgroup.)
    >
    > As for special data knowledge, I suppose that's possible, but not we
    > can't help if he doesn't share that knowledge.
    >
    > [...]
    >


    The text files are similar to files containing C code. I just din't
    mentioned this. How does having this knowledge about the data in the
    files help?

    I also looked at the some of the code avilable for 'grep' on the net,
    thats a whole lot of code - I don't think i can understand it all and
    make it work faster in the way I want to :(
     
    ritesh, Apr 24, 2007
    #10
  11. ritesh

    B. Augestad Guest

    ritesh wrote:
    > On Apr 24, 4:02 pm, Keith Thompson <> wrote:
    >
    >>Richard Heathfield <> writes:
    >>
    >>>Keith Thompson said:

    >>
    >>>>ritesh <> writes:

    >>
    >>>>>Currently the search is done using the 'grep' utility on linux. This
    >>>>>takes too much time, and kills the responsiveness of the application.

    >>
    >>>>>Is there any function in C or maybe any other way to get this job
    >>>>>done faster?
    >>>>
    >>>>[...]

    >>
    >>>>The grep command happens to be implemented in C. What makes you think
    >>>>that another program written in C to do the same job will be any
    >>>>faster?

    >>
    >>>The two obvious reasons are: (1) losing the process-creation overhead,
    >>>and (2) possible hacks based on special data knowledge.

    >>
    >>The first can be largely eliminated. (He could have found out about
    >>xargs if he'd posted to a Unix newsgroup.)
    >>
    >>As for special data knowledge, I suppose that's possible, but not we
    >>can't help if he doesn't share that knowledge.
    >>
    >>[...]
    >>

    >
    >
    > The text files are similar to files containing C code. I just din't
    > mentioned this. How does having this knowledge about the data in the
    > files help?
    >
    > I also looked at the some of the code avilable for 'grep' on the net,
    > thats a whole lot of code - I don't think i can understand it all and
    > make it work faster in the way I want to :(
    >



    <OT>
    Ask in comp.unix.programmer as the solution to your problem will be way
    off topic in comp.lang.c. Some keywords/functions you should look at,
    are regcomp() and pthread_create(). Your system probably has man pages
    for these functions.

    </OT>

    HTH
    Bjørn
    --
    Looking for an embeddable web server?
    http://www.metasystems.no/products/highlander/index.html
     
    B. Augestad, Apr 24, 2007
    #11
  12. ritesh

    ritesh Guest

    On Apr 24, 12:37 pm, "Jonas" <> wrote:
    > "ritesh" <> wrote in message
    >
    > news:...
    >
    > > Two Point I missed out -

    >
    > > 1. The text files - are of random form - they don't contain records or
    > > any ordered sequence of characters.

    >
    > > 2. The list of text files may go upto 10K files. So I'm assuming that
    > > opening each file using the C File I/O is not a good way to handle
    > > this.

    >
    > <OT>
    > The efficient way to do this would be to index your text files. Search for
    > "inverted index" on the web for more info. More generally, your question is
    > in the field of "information retrieval" for which there is also a (not very
    > active) newsgroup: comp.theory.info-retrieval.
    > </OT>
    >
    > --
    > Jonas



    I looked up inverted index - looking at this small example -

    "i love you," "god is love," "love is blind," and "blind justice"

    the text-index generated for these 4 lines is -

    blind (3,8);(4,0)
    god (2,0)
    i (1,0)
    is (2,4);(3,5)
    justice (4,6)
    love (1,2);(2,7);(3,0)
    you (1,7)


    It might help if I could do the same with my text files - but how are
    these indeces generated? Do you have a pointer to a good web page that
    shows this.

    thanks
     
    ritesh, Apr 24, 2007
    #12
  13. ritesh said:

    > On Apr 24, 4:02 pm, Keith Thompson <> wrote:

    <snip>
    >>
    >> As for special data knowledge, I suppose that's possible, but not we
    >> can't help if he doesn't share that knowledge.

    >
    > The text files are similar to files containing C code. I just din't
    > mentioned this. How does having this knowledge about the data in the
    > files help?


    Here's a very simple example. Let's pretend it really is C code that
    you're searching. (Yes, I know, you suggested it's only "similar", but
    I need a concrete example here, so I've picked C source as that
    example.)

    And let's say you know you're looking for a function name in "live"
    code. So if you find a logical line whose first non-whitespace
    character is a #, you know you can ignore the rest of the line. If you
    find a /* pair, you can look specifically for a */ pair without
    worrying what comes between. If you're really up to speed, you might
    even be able to prune out "dead" code between #if/#endif and the like.
    If you find a quote character that isn't a character constant, you can
    zoom through to the next unescaped quote character.

    And so on and so forth.

    Let's take another simple example that isn't C code. Say you have a file
    consisting of various kinds of records, with the record type being
    noted at the start of the line. If the data you're looking for is
    guaranteed only to appear within a particular kind of record, you can
    simply check the label and only bother to search the record itself if
    it's the right record type.

    Many of us tend to focus on writing "generic" software that can handle
    any kind of data, and that's all well and good ("write once, use many",
    as it were) but, when speed is of the essence, specific knowledge of
    the data format can lead to significant speed improvements over generic
    algorithms.

    --
    Richard Heathfield
    "Usenet is a strange place" - dmr 29/7/1999
    http://www.cpax.org.uk
    email: rjh at the above domain, - www.
     
    Richard Heathfield, Apr 24, 2007
    #13
  14. ritesh

    Al Balmer Guest

    On 24 Apr 2007 05:27:58 -0700, ritesh <> wrote:

    >> As for special data knowledge, I suppose that's possible, but not we
    >> can't help if he doesn't share that knowledge.
    >>
    >> [...]
    >>

    >
    >The text files are similar to files containing C code. I just din't
    >mentioned this. How does having this knowledge about the data in the
    >files help?


    In addition to what Richard wrote, you may be able to take advantage
    of knowledge about the data you're searching *for*, as well. For
    example, in some cases a simple state machine can provide very fast
    searches for particular strings.

    --
    Al Balmer
    Sun City, AZ
     
    Al Balmer, Apr 24, 2007
    #14
  15. ritesh

    Jonas Guest

    "ritesh" <> wrote in message
    news:...
    > On Apr 24, 12:37 pm, "Jonas" <> wrote:
    >> "ritesh" <> wrote in message
    >>
    >> news:...
    >>
    >> > Two Point I missed out -

    >>
    >> > 1. The text files - are of random form - they don't contain records or
    >> > any ordered sequence of characters.

    >>
    >> > 2. The list of text files may go upto 10K files. So I'm assuming that
    >> > opening each file using the C File I/O is not a good way to handle
    >> > this.

    >>
    >> <OT>
    >> The efficient way to do this would be to index your text files. Search
    >> for
    >> "inverted index" on the web for more info. More generally, your question
    >> is
    >> in the field of "information retrieval" for which there is also a (not
    >> very
    >> active) newsgroup: comp.theory.info-retrieval.
    >> </OT>
    >>
    >> --
    >> Jonas

    >
    >
    > I looked up inverted index - looking at this small example -
    >
    > "i love you," "god is love," "love is blind," and "blind justice"
    >
    > the text-index generated for these 4 lines is -
    >
    > blind (3,8);(4,0)
    > god (2,0)
    > i (1,0)
    > is (2,4);(3,5)
    > justice (4,6)
    > love (1,2);(2,7);(3,0)
    > you (1,7)
    >
    >
    > It might help if I could do the same with my text files - but how are
    > these indeces generated? Do you have a pointer to a good web page that
    > shows this.


    Well, for full-blown search engine libraries there are e.g. CLucene
    (http://clucene.sourceforge.net/) or Estraier
    (http://hyperestraier.sourceforge.net/). I haven't used any of them, though.

    If your demands are smaller, you might consider implementing the indexer &
    search engine yourself. If the amount of text is small, the index could be
    created in RAM, which simplifies things. And if the text files change
    infrequently, you would not have to any 'delete' functionality -- simply
    rebuild the index every once in a while.

    Try a web search for "create inverted index" or similar, it seems to give a
    few useful hits.

    --
    Jonas
     
    Jonas, Apr 24, 2007
    #15
  16. ritesh

    CBFalconer Guest

    ritesh wrote:
    >

    .... snip ...
    >
    > I looked up inverted index - looking at this small example -
    >
    > "i love you," "god is love," "love is blind," and "blind justice"
    >
    > the text-index generated for these 4 lines is -
    >
    > blind (3,8);(4,0)
    > god (2,0)
    > i (1,0)
    > is (2,4);(3,5)
    > justice (4,6)
    > love (1,2);(2,7);(3,0)
    > you (1,7)
    >
    > It might help if I could do the same with my text files - but how
    > are these indeces generated? Do you have a pointer to a good web
    > page that shows this.


    Well, you obviously need at least an array of shorts to define the
    files involved, together with an array of longs to identify the
    position in the file. This will obviously expand the storage for
    short words, such as 'I', 'is', etc. So you probably need a list
    of words to ignore. How to handle mistypings is another
    consideration.

    I would suggest a hash table, based on the word itself, leading to
    lists of arrays. Only the tables associated with one file will be
    in play at any one moment.

    See hashlib for one possible source for the hash table.

    <http://cbfalconer.home.att.net/download/>

    --
    <http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
    <http://www.securityfocus.com/columnists/423>
    <http://www.aaxnet.com/editor/edit043.html>
    cbfalconer at maineline.net



    --
    Posted via a free Usenet account from http://www.teranews.com
     
    CBFalconer, Apr 24, 2007
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ray Dixon [MVP]
    Replies:
    0
    Views:
    960
    Ray Dixon [MVP]
    Jul 21, 2003
  2. Abby Lee
    Replies:
    5
    Views:
    473
    Abby Lee
    Aug 2, 2004
  3. Peter Hanke
    Replies:
    1
    Views:
    163
    Dr.Ruud
    Jan 6, 2008
  4. Chris Angelico
    Replies:
    9
    Views:
    245
    Andrew Cooper
    Jul 29, 2012
  5. Tim Chase
    Replies:
    10
    Views:
    398
    Robert Miles
    Aug 31, 2012
Loading...

Share This Page