Verify that line exists in a file

J

Jonny

Hi,

Please could you tell me what the quickest way would be to verify that a
line of text exists in a file.

I would like to be able to do something like:

grep "^line of text$" filename.txt

and then check the exit code, or:

grep -c "^line of text$" filename.txt

and then check the count.

But I need to do it in a C program.

Many Thanks,
Jonny
 
W

Walter Roberson

:please could you tell me what the quickest way would be to verify that a
:line of text exists in a file.

open the file, loop reading lines, for each line strcmp against the
target; if you get a match then set an appropriate status and break
early; close the file.

Watch our for the factor of whether the line read includes or excludes
the record seperator; note that the answer might not be the same for the
last line in the file.


I should be slightly more honest: the above is not necessarily
the -quickest- way, if by 'quick' you mean that execution speed is
crucial. If execution speed is crucial, there are faster algorithms.

For example, you could read a bufferful of the file at a time into
memory and proceed through it. Instead of starting from the start
of the buffer and comparing forward, you could instead go immediately
a number of characters further on, where the offset is the size of
the string you are trying to match against. If the character there does
not match the -last- character in the string, then you know already
that that particular section you are looking at is not a copy of the
line; if it does match, then you proceed backwards comparing a character
at at time until you reach the beginning of the string or find a mismatch.

If that first test did not match, you could scan forward until you
found the end-of-line and then start the backwards comparison from there.
There are optimizations that can be made to even that process, though:
the character that was in fact at the place you looked can give you
hints about how much further on to look. For example if the character
you found was an 'x', and you knew that 'x' was the 7th last character in
the string, then you could immediately advance by 7 characters and look
for the newline there, instead of testing each of the characters
inbetween for newline.
 
M

Michael Mair

Jonny said:
Hi,

Please could you tell me what the quickest way would be to verify that a
line of text exists in a file.

I would like to be able to do something like:

grep "^line of text$" filename.txt

and then check the exit code, or:

grep -c "^line of text$" filename.txt

and then check the count.

But I need to do it in a C program.

You want to have a function

int OccurrenceOfLine(FILE *file, const char *line)

which reads lines with fgets() into a buffer of size
>=strlen(line)+1, strips the '\n' from the end of the line
(if there is none, the line was too long, i.e. it does not
match and you can read characters until you have reached the
'\n' of this line) and uses strcmp() on the buffer and
line.
If you have a match, you can return 1. If you encounter an
error, you can return -1. Otherwise, you return 0.
If you want to _count_ the occurences, then I suggest

int OccurrencesOfLine(FILE *file, const char *line, size_t *count)

where you return 0 on success and an error code !=0 on failure.
The number of matching lines is stored in *count.

Note: You need of course <stdio.h> and <string.h> for both
ways I have suggested.


Cheers
Michael
 
G

Gregory Toomey

Jonny said:
Hi,

Please could you tell me what the quickest way would be to verify that a
line of text exists in a file.

I would like to be able to do something like:

grep "^line of text$" filename.txt

and then check the exit code, or:

grep -c "^line of text$" filename.txt

and then check the count.

But I need to do it in a C program.

Many Thanks,
Jonny

Knuth-pratt-morris or boyer-moore
http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/

gtoomey
 
J

jcoffin

Gregory Toomey wrote:

[ ... ]

With the proviso that you _rarely_ want to use Boyer-Moore as
originally defined -- nearly all practical use is of simplified
variants such as Boyer-Moore-Horspool (or Sunday's variant thereof). A
full-blown Boyer-Moore search is theoretically better in some sense,
but the savings rarely justify the initialization effort.

The development effort is considerable as well -- to the point that
incorrect implementations abound, even those published by the most
highly respected computer scientists (e.g. Knuth). In fact, for the
first 15 years after the algorithm was described, EVERY published
implementation seems to have contained at least one defect.

By contrasty, Sunday's variant of B-M-H is easy to set up, much easier
implement correctly (I'm among the few to have really screwed it up),
and will usually run faster to boot!
 
J

Jonny

Thanks to everyone for replying.

I do need a function which is efficient, so I'll probably need to try an
implementation of one of the more advanced algorithms suggested.

I found the section of code below, which claims to be an implementation
of Boyer-Moore-Horspool.

I would be grateful if someone could briefly tell me if and how I could
use it.


char *search( pat, text, n )
char *pat, *text;
int n;

{ int i, j, k, m, skip[MAXCHAR];
m = strlen(pat);
if( m==0 ) return( text );
for( k=0; k<MAXCHAR; k++ ) skip[k] = m;
for( k=0; k<m-1; k++ ) skip[pat[k]] = m-k-1;

for( k=m-1; k < n; k += skip[text[k] & (MAXCHAR-1)] ) {
for( j=m-1, i=k; j>=0 && text == pat[j]; j-- ) i--;
if( j == (-1) ) return( text+i+1 );
}
return( NULL );
}


Regards,
Jonny
 
C

CBFalconer

Gregory said:

For complete lines combining fgets and strcmp is hard to beat.
Normally strcmp will fail on the first char. or two, and fgets need
not read into a buffer significantly larger than the comparee. The
only complication will be stripping the '\n' off the end of file
lines, if present, and flushing the input line thru '\n' if not
present.

There is no need for more complex searches such as kmp or bm.
 
M

Michael Mair

Jonny said:
Thanks to everyone for replying.

I do need a function which is efficient, so I'll probably need to try an
implementation of one of the more advanced algorithms suggested.

Which is _complete_ overkill.
Both search for a comparatively small pattern in a comparatively
large _unstructured_ text.
You, OTOH, have a pattern which is exactly as large as the text
without the '\n'. That means that setting up the skip vector
about triples the cost.

The only disadvantage of fgets() is that it does not return or
make available the number of read characters so that you may
consider using getc() and comparing yourself.
Implement the "read line -- compare it" variant and one advanced
algorithm and just use a profiler or clock() or whatever to
find out which is better.

A "better" algorithm is not necessarily the best algorithm for
a given task -- whenever you have additional information about
the problem at hand, using a method tailored to your degree of
information is best.

I found the section of code below, which claims to be an implementation
of Boyer-Moore-Horspool.

I would be grateful if someone could briefly tell me if and how I could
use it.

You could, with some restrictions.
Why did you not just paste it into your code _and_ try it?
We are not here to counsel you about algorithms or code off some
dubious site in the web but about _your_ C code -- which you still
fail to show us.

char *search( pat, text, n )
char *pat, *text;
int n;

K&R style -- this is outdated by a good fifteen years.

char *search (const char *pat, const char *text, int n)

where you probably would change int n into size_t n and
would change the order of text and pat to fit the order
in standard library functions such as strstr().
{ int i, j, k, m, skip[MAXCHAR];
m = strlen(pat);
if( m==0 ) return( text );
for( k=0; k<MAXCHAR; k++ ) skip[k] = m;
for( k=0; k<m-1; k++ ) skip[pat[k]] = m-k-1;

for( k=m-1; k < n; k += skip[text[k] & (MAXCHAR-1)] ) {

Here is a catch: The author assumes MAXCHAR to be 1<<s.
To repair this, make it skip[text[k] % MAXCHAR]
and leave the micro-optimisations to the compiler.
for( j=m-1, i=k; j>=0 && text == pat[j]; j-- ) i--;
if( j == (-1) ) return( text+i+1 );
}
return( NULL );
}


I have not closely looked at or tested the rest of the
code -- the latter is your job.
However, I think that you are trying to be lazy on the
thinking and programming part. This will not lead to the
most efficient solution.


Cheers
Michael
 
S

SM Ryan

# Hi,
#
# Please could you tell me what the quickest way would be to verify that a
# line of text exists in a file.

Set up an FSM and scan the file character by character. Then you don't have to worry
about line bufferring.

If the line you're interested in has N characters, the state machine would have
N+4 states:
s=0 initial state
0<=s<N s characters of a line have been read and match the given string.
if the next input character is the sth string character
then move to state s+1
else move to state N+2
s=N all N characters match
if the next input character is a new line or EOF
then move to state N+2
else move to state N+1
s=N+1 this line does not match the string
if the next input character is a new line
then move to state 0
else if the next input is an EOF
then move to state N+3
else
stay in state N+1
s=N+2 accepting final state: the line is in the file
s=N+3 rejecting final state: the line is not in the file

If you get the line out of argv, the only variables you need are the next input,
N, s, and argc and argv.

If the file length is M and the sought line is randomly distributed in the file,
the estimated running time is a little over O(M/2), and it is independent of N.
The storage cost is O(4+N).
 
M

Mark McIntyre

Hi,

Please could you tell me what the quickest way would be to verify that a
line of text exists in a file.

I would like to be able to do something like:

grep "^line of text$" filename.txt

1) d/l the source code for grep. Its in the gnu txtutils pack I think.

2) read the file line by line till you find a line that matches your
pattern.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top