Search for an expression in a text file

R

ritesh

Hi,

I'm working on a piece of code that -
1. has a list of text files
2. and needs to search for a particular expression in these files

(this is being done on Linux using gcc 3.4.2)

Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.

Is there any function in C or maybe any other way to get this job done
faster?

Please note that i've provided the operating system and compiler info,
just for the sake of it. Some people on this group don't like this and
ask to re-post on an appropriate group for linux. I've still doen this
because I never recieve a good answer on any other group.

Thanks,
Ritesh
 
P

Peter Nilsson

[You should read this with a neutral tone. There's
more to becoming a good programmer than
writing code!]

ritesh said:
Hi,

I'm working on a piece of code that -
1. has a list of text files
2. and needs to search for a particular expression in these
files

Do you have a question about the C language?
(this is being done on Linux using gcc 3.4.2)

Currently the search is done using the 'grep' utility on
linux. This takes too much time, and kills the
responsiveness of the application.

Is there any function in C or maybe any other way to get
this job done faster?

Most grep implementations are based on publically available
regex libraries. There is no standard function beyond strstr()
which is not likely to be as optimised as taylored libraries.

You may even try tools like [f]lex to generate lexical analysers.
Please note that i've provided the operating system and
compiler info, just for the sake of it.

Hint: If these are at all relevant, there's a very good chance
your post of off-topic in clc.
Some people on this
group don't like this and ask to re-post on an appropriate
group for linux.

Because other groups can answer platform specific questions
much better, and clc isn't particularly interested in becoming
yet another high noise to signal dumping ground for anything
and everything where main is a valid identifier.
I've still doen this
because I never recieve a good answer on any other group.

That doesn't mean that clc has to rectify your problem.

Have you tried simple web/code searches? Looking for C
source on grep or find/replace should give you ample
sources.
 
R

ritesh

Two Point I missed out -

1. The text files - are of random form - they don't contain records or
any ordered sequence of characters.

2. The list of text files may go upto 10K files. So I'm assuming that
opening each file using the C File I/O is not a good way to handle
this.
 
J

Jonas

ritesh said:
Two Point I missed out -

1. The text files - are of random form - they don't contain records or
any ordered sequence of characters.

2. The list of text files may go upto 10K files. So I'm assuming that
opening each file using the C File I/O is not a good way to handle
this.

<OT>
The efficient way to do this would be to index your text files. Search for
"inverted index" on the web for more info. More generally, your question is
in the field of "information retrieval" for which there is also a (not very
active) newsgroup: comp.theory.info-retrieval.
</OT>
 
R

Richard Heathfield

ritesh said:
Hi,

I'm working on a piece of code that -
1. has a list of text files
2. and needs to search for a particular expression in these files

(this is being done on Linux using gcc 3.4.2)

Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.

Is there any function in C or maybe any other way to get this job done
faster?

No specific function in C, no, but there is certainly a way. Several
ways, in fact.

One is to slurp grep into your program, to avoid the overhead of
creating a fresh process, complete with environment etc. It should not
be difficult to find the source for grep, so you can turn grep into
grep().

Another possibility is to code the search yourself, using (say)
Knuth-Morris-Pratt or Boyer-Moore, and taking advantage of any data
knowledge you have which might be able to speed up the search.

And of course you can pre-calculate, as someone else suggested - build
an index that gives you key starting-points.

Other possibilities may well exist, and perhaps the others here will
chip in a few ideas when the grumpiness wears off.
 
K

Keith Thompson

ritesh said:
I'm working on a piece of code that -
1. has a list of text files
2. and needs to search for a particular expression in these files

(this is being done on Linux using gcc 3.4.2)

Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.

Is there any function in C or maybe any other way to get this job done
faster?
[...]

The grep command happens to be implemented in C. What makes you think
that another program written in C to do the same job will be any
faster?

And you haven't defined what you mean by an "expression".
 
R

Richard Heathfield

Keith Thompson said:
ritesh said:
Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.

Is there any function in C or maybe any other way to get this job
done faster?
[...]

The grep command happens to be implemented in C. What makes you think
that another program written in C to do the same job will be any
faster?

The two obvious reasons are: (1) losing the process-creation overhead,
and (2) possible hacks based on special data knowledge.
And you haven't defined what you mean by an "expression".

He doesn't have to. C defines it for him. :)
 
K

Keith Thompson

Richard Heathfield said:
Keith Thompson said:
ritesh said:
Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.

Is there any function in C or maybe any other way to get this job
done faster?
[...]

The grep command happens to be implemented in C. What makes you think
that another program written in C to do the same job will be any
faster?

The two obvious reasons are: (1) losing the process-creation overhead,
and (2) possible hacks based on special data knowledge.

The first can be largely eliminated. (He could have found out about
xargs if he'd posted to a Unix newsgroup.)

As for special data knowledge, I suppose that's possible, but not we
can't help if he doesn't share that knowledge.

[...]
 
R

ritesh

ritesh said:
I'm working on a piece of code that -
1. has a list of text files
2. and needs to search for a particular expression in these files
(this is being done on Linux using gcc 3.4.2)
Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.
Is there any function in C or maybe any other way to get this job done
faster?

[...]

The grep command happens to be implemented in C. What makes you think
that another program written in C to do the same job will be any
faster?

And you haven't defined what you mean by an "expression".

By 'expression' I mean a regular expression used by the grep command.
 
R

ritesh

Richard Heathfield said:
Keith Thompson said:
Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.
Is there any function in C or maybe any other way to get this job
done faster?
[...]
The grep command happens to be implemented in C. What makes you think
that another program written in C to do the same job will be any
faster?
The two obvious reasons are: (1) losing the process-creation overhead,
and (2) possible hacks based on special data knowledge.

The first can be largely eliminated. (He could have found out about
xargs if he'd posted to a Unix newsgroup.)

As for special data knowledge, I suppose that's possible, but not we
can't help if he doesn't share that knowledge.

[...]

The text files are similar to files containing C code. I just din't
mentioned this. How does having this knowledge about the data in the
files help?

I also looked at the some of the code avilable for 'grep' on the net,
thats a whole lot of code - I don't think i can understand it all and
make it work faster in the way I want to :(
 
B

B. Augestad

ritesh said:
Richard Heathfield said:
Keith Thompson said:
Currently the search is done using the 'grep' utility on linux. This
takes too much time, and kills the responsiveness of the application.
Is there any function in C or maybe any other way to get this job
done faster?

[...]
The grep command happens to be implemented in C. What makes you think
that another program written in C to do the same job will be any
faster?
The two obvious reasons are: (1) losing the process-creation overhead,
and (2) possible hacks based on special data knowledge.

The first can be largely eliminated. (He could have found out about
xargs if he'd posted to a Unix newsgroup.)

As for special data knowledge, I suppose that's possible, but not we
can't help if he doesn't share that knowledge.

[...]


The text files are similar to files containing C code. I just din't
mentioned this. How does having this knowledge about the data in the
files help?

I also looked at the some of the code avilable for 'grep' on the net,
thats a whole lot of code - I don't think i can understand it all and
make it work faster in the way I want to :(


<OT>
Ask in comp.unix.programmer as the solution to your problem will be way
off topic in comp.lang.c. Some keywords/functions you should look at,
are regcomp() and pthread_create(). Your system probably has man pages
for these functions.

</OT>

HTH
Bjørn
 
R

ritesh

<OT>
The efficient way to do this would be to index your text files. Search for
"inverted index" on the web for more info. More generally, your question is
in the field of "information retrieval" for which there is also a (not very
active) newsgroup: comp.theory.info-retrieval.
</OT>


I looked up inverted index - looking at this small example -

"i love you," "god is love," "love is blind," and "blind justice"

the text-index generated for these 4 lines is -

blind (3,8);(4,0)
god (2,0)
i (1,0)
is (2,4);(3,5)
justice (4,6)
love (1,2);(2,7);(3,0)
you (1,7)


It might help if I could do the same with my text files - but how are
these indeces generated? Do you have a pointer to a good web page that
shows this.

thanks
 
R

Richard Heathfield

ritesh said:
On Apr 24, 4:02 pm, Keith Thompson <[email protected]> wrote:

The text files are similar to files containing C code. I just din't
mentioned this. How does having this knowledge about the data in the
files help?

Here's a very simple example. Let's pretend it really is C code that
you're searching. (Yes, I know, you suggested it's only "similar", but
I need a concrete example here, so I've picked C source as that
example.)

And let's say you know you're looking for a function name in "live"
code. So if you find a logical line whose first non-whitespace
character is a #, you know you can ignore the rest of the line. If you
find a /* pair, you can look specifically for a */ pair without
worrying what comes between. If you're really up to speed, you might
even be able to prune out "dead" code between #if/#endif and the like.
If you find a quote character that isn't a character constant, you can
zoom through to the next unescaped quote character.

And so on and so forth.

Let's take another simple example that isn't C code. Say you have a file
consisting of various kinds of records, with the record type being
noted at the start of the line. If the data you're looking for is
guaranteed only to appear within a particular kind of record, you can
simply check the label and only bother to search the record itself if
it's the right record type.

Many of us tend to focus on writing "generic" software that can handle
any kind of data, and that's all well and good ("write once, use many",
as it were) but, when speed is of the essence, specific knowledge of
the data format can lead to significant speed improvements over generic
algorithms.
 
A

Al Balmer

As for special data knowledge, I suppose that's possible, but not we
can't help if he doesn't share that knowledge.

[...]

The text files are similar to files containing C code. I just din't
mentioned this. How does having this knowledge about the data in the
files help?

In addition to what Richard wrote, you may be able to take advantage
of knowledge about the data you're searching *for*, as well. For
example, in some cases a simple state machine can provide very fast
searches for particular strings.
 
J

Jonas

ritesh said:
I looked up inverted index - looking at this small example -

"i love you," "god is love," "love is blind," and "blind justice"

the text-index generated for these 4 lines is -

blind (3,8);(4,0)
god (2,0)
i (1,0)
is (2,4);(3,5)
justice (4,6)
love (1,2);(2,7);(3,0)
you (1,7)


It might help if I could do the same with my text files - but how are
these indeces generated? Do you have a pointer to a good web page that
shows this.

Well, for full-blown search engine libraries there are e.g. CLucene
(http://clucene.sourceforge.net/) or Estraier
(http://hyperestraier.sourceforge.net/). I haven't used any of them, though.

If your demands are smaller, you might consider implementing the indexer &
search engine yourself. If the amount of text is small, the index could be
created in RAM, which simplifies things. And if the text files change
infrequently, you would not have to any 'delete' functionality -- simply
rebuild the index every once in a while.

Try a web search for "create inverted index" or similar, it seems to give a
few useful hits.
 
C

CBFalconer

ritesh said:
.... snip ...

I looked up inverted index - looking at this small example -

"i love you," "god is love," "love is blind," and "blind justice"

the text-index generated for these 4 lines is -

blind (3,8);(4,0)
god (2,0)
i (1,0)
is (2,4);(3,5)
justice (4,6)
love (1,2);(2,7);(3,0)
you (1,7)

It might help if I could do the same with my text files - but how
are these indeces generated? Do you have a pointer to a good web
page that shows this.

Well, you obviously need at least an array of shorts to define the
files involved, together with an array of longs to identify the
position in the file. This will obviously expand the storage for
short words, such as 'I', 'is', etc. So you probably need a list
of words to ignore. How to handle mistypings is another
consideration.

I would suggest a hash table, based on the word itself, leading to
lists of arrays. Only the tables associated with one file will be
in play at any one moment.

See hashlib for one possible source for the hash table.

<http://cbfalconer.home.att.net/download/>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top