filters

L

lee

Hi,

it´s probably a FAQ, though I haven´t found any good info on it: How do
you go about writing a filter, i. e. a program that reads data from
stdin and processes it?

Since indefinte amounts of data could be read from stdin, they can´t
just be put into some (ever increasing) buffer. Using a buffer of
limited size can make it difficult to process the data because the
buffer could be too small.

What´s the solution for this?
 
S

Stefan Ram

lee said:
Since indefinte amounts of data could be read from stdin, they can't
just be put into some (ever increasing) buffer. Using a buffer of
limited size can make it difficult to process the data because the
buffer could be too small.
What's the solution for this?

(There has not been a programming problem specified yet.)

Sometimes, you do not need to store all the data
when processing it. This depends on the specific problem.

A C program that cannot get enough buffer memory
to process data using realloc might terminate
setting an appropriate exit code, possibly writing
an explanation to stderr or a log file.

You also could write the data read to a file, sometimes.
This might be slower, but might allow to process larger
input data.

When the life time of a computer is assumed to be 150 years
and one assumes that 1 Gibioctet can be read from stdin each
second, there will not be more than 5093144018288640000
octets read. So, one could simply buy that much memory
upfront before starting the program.
 
B

Ben Bacarisse

lee said:
it´s probably a FAQ, though I haven´t found any good info on it: How do
you go about writing a filter, i. e. a program that reads data from
stdin and processes it?

This is not really a C question. I suggest to ask it in
comp.programming which deals with this sort of question. I've set
follow ups there. Maybe you had a C-specific question in mind. If so,
please ignore the followup-to header and reply with the more specific
question.

It's possible you haven't found anything because the question is very
general. Pretty much the only general thing to say about a filter is
what you already know -- that it reads stdin and writes to stdout.
Since indefinte amounts of data could be read from stdin, they can´t
just be put into some (ever increasing) buffer.

You may have no choice. A sort filter can't produce any output until it
has seen all the input while other filters can finish before having seen
all the data (for example, Unix's head command). Some can compute
function that depend on arbitrarily long inputs without requiring
unbounded storage (for example Unix's wc command).
Using a buffer of
limited size can make it difficult to process the data because the
buffer could be too small.

What´s the solution for this?

You make the buffer bigger. If that is not possible you have to find
some other way to store the temporary data. Of course, it is important
to know that you really do need to store the data. As an example, you
can write a filter that prints the variance of a set of numbers without
having to store them all -- the choice of algorithm is central.
 
F

Francois Grieu

How do you go about writing a filter, i. e. a program that reads
data from stdin and processes it?

#include <stdio.h>
int main(void)
{
int c;
while(EOF!=(c = getchar()))
{
if (c>='0'&&c<='9')
c = '9'-c+'0';
putchar(c);
}
}


As an aside, I wonder if
c = '9'-c+'0'
is safe, and think
c = '9'+'0'-c
is not.

Francois Grieu
 
J

James Kuyper

Hi,

it´s probably a FAQ, though I haven´t found any good info on it: How do
you go about writing a filter, i. e. a program that reads data from
stdin and processes it?

Since indefinte amounts of data could be read from stdin, they can´t
just be put into some (ever increasing) buffer. Using a buffer of
limited size can make it difficult to process the data because the
buffer could be too small.

What´s the solution for this?

The term "filter" is generally reserved for the kind of application that
only needs to keep a small portion of the input in memory at any given
time. A typical Unix filter is "cut". It parses each line of input up
into fields, which can be either fixed width or delimited by a
user-specifiable character which defaults to '\t'. It writes out a
specified subset of the fields, with the delimiter optionally replaced
with an arbitrary string. I've never attempted implementing it, but it
seems to me that it should be implementable in a way that never keeps
more than one character of input in memory at any given time.

However, if for some reason your program does need to store the entire
input, then you need expandable storage, and the C standard library
provides some. Start by allocating a buffer with malloc(). Whenever the
buffer gets full, call realloc() to expand it; I recommend increasing
the size by a fixed factor; 2 would be a good value. Note, there are a
couple of tricky points in connection with calling realloc():
* if realloc() fails, it returns a null pointer, and pointers into the
old buffer are still valid. Therefore, if you make the mistake of
storing the value returned by realloc() directly into the same pointer
you were using to keep track of your buffer, you'll lose your ability to
access that buffer if realloc() fails.
* if realloc() succeeds, it may have moved your data to a new location
in memory, invalidating any pointers you may have been keeping that
pointed into your old buffer. You can't safely do anything with any of
the old pointer values, not even comparing them for equality with new
ones to determine whether or not the buffer was moved. For each such
pointer, determine its offset from the beginning of the buffer before
calling realloc(). If realloc() succeeds, calculate the new value for
the corresponding pointer by adding that offset to the start of the new
buffer.

If realloc() does fail, you'll have to switch to a different approach.
One option is to create a temporary file using tmpfile(). Read from
standard input, then write to the temporary file. Once the entire file
is read in, you can move around in the temporary file using fseek(),
something you cannot do with stdin.
 
N

Nobody

it´s probably a FAQ, though I haven´t found any good info on it: How do
you go about writing a filter, i. e. a program that reads data from
stdin and processes it?

Since indefinte amounts of data could be read from stdin, they can´t
just be put into some (ever increasing) buffer. Using a buffer of
limited size can make it difficult to process the data because the
buffer could be too small.

What´s the solution for this?

Read line, process line, write line. Replace "line" with whatever unit of
data is appropriate, but for a typical Unix filter, data is processed line
by line.

Many of the original Unix tools used fixed-size buffers; if an input line
was too large, the program would just terminate with an error. This wasn't
such a problem when creating a file with lines longer than 80 characters
was a feat in itself. Nowadays, such programs are more likely to realloc()
the buffer as required.
 
B

Bill Cunningham

I'd like to see if I can digest this code.
#include <stdio.h>
int main(void)
{
int c;
while(EOF!=(c = getchar()))

If an input from keyboard isn't EOF
{
if (c>='0'&&c<='9')
c = '9'-c+'0';

The above I can;t read. But this seems to be well put together.
 
A

Angel

The above I can't read. But this seems to be well put together.

Since c holds the result of a call to getc() and we already tested for
EOF, we know c is holding a character code, in whatever character set
the implementation is using.

This snippet of code tests if c holds the character code for a decimal
digit, and if so it "reverses" it, so that 0 becomes 9, 1 becomes 8 and
so on.

It's written in a way that will work regardless of which character set
the implementation uses, as long as the character codes for the decimal
digits form an unbroken sequence. This is true in at least ASCII (and
by extention, UTF8) and EBDIC, the two you are most likely to encounter.
 
L

Lew Pitcher

I'd like to see if I can digest this code.


If an input from keyboard isn't EOF

WHILE stdin returns valid input (as opposed to an END-OF-FILE condition)
The above I can;t read. But this seems to be well put together.

Take it in bits.

Because of the if() statement, we know that c holds a value between '0'
and '9' inclusive. In other words, c contains a numeric digit.

The expression
'9' - c
finds out how many digits there are between c and 9
If c == '0', then '9' - c results in a value of 9
If c == '1', then '9' - c results in a value of 8
If c == '2', then '9' - c results in a value of 7
and so on, to
If c == '9', then '9' - c results in a value of 0

Now, this value is then added to '0', resulting in a numeric digit
character.

If '9' - c results in a value of 9,
then '9' - c + '0' results in a value of '9'
If '9' - c results in a value of 8,
then '9' - c + '0' results in a value of '8'
and so on

The net result is that, for each digit input, the expression computes it's
decimal complement
Input '0' and get '9' out
Input '1' and get '8' out
Input '2' and get '7' out
and so on, to
Input '9' and get '0' out

Certainly,
c = '9' - c + '0';
would be safer than
c = '9' + '0' - c;
even though both are algebraically identical.

Given the conditions of it's execution, the sub-expression
'9' + '0'
/might/ overflow an int, in some unknown characterset, while
'9' - c
would not.

OTOH, the compiler might just optimize away the contentious subexpressions
entirely; certainly '9' + '0' is a known value which might fit within an
integer. And, the compiler needs only order operations such that the result
would be the same as if it had not reordered operations, the compiler is
free to reorder
'9' - c + '0'
/or/
'9' + '0' - c
such that the constants are coalesced into a single value prior to object
code emission.

But,
'9' - c + '0'
is probably the safer expression when dealing with an unknown compiler.
 
K

Keith Thompson

Angel said:
It's written in a way that will work regardless of which character set
the implementation uses, as long as the character codes for the decimal
digits form an unbroken sequence. This is true in at least ASCII (and
by extention, UTF8) and EBDIC, the two you are most likely to encounter.

And it's guaranteed by the C standard. (Letters, on the other hand, are
not guaranteed to be contiguous, and in fact they aren't in EBCDIC --
or, typically, in character sets larger than ASCII.)
 
A

Angel

And it's guaranteed by the C standard. (Letters, on the other hand, are
not guaranteed to be contiguous, and in fact they aren't in EBCDIC --
or, typically, in character sets larger than ASCII.)

Mm, didn't know digits were guaranteed. Thanks. ^^
 
K

Keith Thompson

Angel said:
Mm, didn't know digits were guaranteed. Thanks. ^^

C99 5.2.1p3.

I suspect the standard would have made a similar guarantee about
upper case Latin letters and lower case Latin letters if they were
contiguous in both ASCII and EBCDIC, and in common variants.
 
B

Ben Bacarisse

In my generally excellent opinion, it is a C question rather than a
general programming question. The concept of a filter (or function)
is general; the implementation issues are language dependent. See
James's excellent little article on the C specific issues.

I invited the OP to ignore the follow ups header and post something more
specific because I thought it would help everyone to know more about the
C issues the OP though he or she had. Without more information one
would just have to assume some C question which is what James did.
That's fine, and his excellent answer may well be useful even if the
assumption behind it is wrong, but answering a general question with
some C specifics does not make the question a C one, it simply makes the
answer topical! I did not feel like guessing today.
IMGEO :)-)) people are often too quick to say something is a
programming question rather than a C question, perhaps because they
feel that the are about the technicalities of the C standards.

I am ludicrously prone to making typos so I can usually unravel them in
other people's posts, but something has gone wrong here that defeats me.
Present company excepted, of course.

That's everyone reading this, right? :)
 
B

Ben Bacarisse

But I disagree with you about what is topical. OP is asking (or so I
am reading him) "What sort of C design pattern should I use for
writing a filter program?" That sort of question, and I am going to
be dogmatic about it, is topical.

We don't disagree about what it topical (at least I can't see a
disagreement here). We disagree about what the OP asked. I think you
take the very fact that the message was posted here as meaning that C
advice was sought. I didn't see it that way and asked the OP to confirm
that they wanted C advice by re-posting with a little more detail.
Maybe I should not have conflated these two (the request for more detail
and a confirmation that they wanted C advice).

Why would I not take the fact that the question was posted here as
meaning that C advice was being asked for? The main reason is that it
makes no sense for someone to be asking so general a question about
writing a filter whist at the same time that person has already decided
what language they'll be using. I accept that there are reasons that
this happens (the person knows only C or they are taking a C course and
the language is therefore pre-decided) but in the absence of a hint
about that it seemed impolite to assume that the OP wanted only C
advice.
If OP were to write a filter
program in Fortran, a quite different answer would be needed, and if
in Ocaml yet another kind of answer. In short, the question is about
writing a filter program as a general question, it is specifically
about writing one in C. That's why it is topical.

You may be right. Maybe we'll find out one day! If it really was a C
question, maybe we'll find out why the OP knows the language they are
going to use but didn't say anything about the kind of filter they'll be
writing.

<snip>
 
T

Tim Rentsch

James Kuyper said:
Hi,

it'`s probably a FAQ, though I haven'`t found any good info on it: How do
you go about writing a filter, i. e. a program that reads data from
stdin and processes it?

Since indefinte amounts of data could be read from stdin, they can'`t
just be put into some (ever increasing) buffer. Using a buffer of
limited size can make it difficult to process the data because the
buffer could be too small.

What'`s the solution for this?

The term "filter" is generally reserved for the kind of application that
only needs to keep a small portion of the input in memory at any given
time. [snip]

No, it isn't. Gordon Burditt gives a much better survey of
different kinds of filters and their storage requirements.
 
J

Jorgen Grahn

James Kuyper said:
Hi,

it'`s probably a FAQ, though I haven'`t found any good info on it: How do
you go about writing a filter, i. e. a program that reads data from
stdin and processes it?

Since indefinte amounts of data could be read from stdin, they can'`t
just be put into some (ever increasing) buffer. Using a buffer of
limited size can make it difficult to process the data because the
buffer could be too small.

What'`s the solution for this?

The term "filter" is generally reserved for the kind of application that
only needs to keep a small portion of the input in memory at any given
time. [snip]

No, it isn't.

I'm not sure there is a definition everyone agrees on. There certainly
isn't an /official/ one.
Gordon Burditt gives a much better survey of
different kinds of filters and their storage requirements.

I found both Kuyper's and Burditt's postings useful.

/Jorgen
 
L

lee

Ben Bacarisse said:
Why would I not take the fact that the question was posted here as
meaning that C advice was being asked for? The main reason is that it
makes no sense for someone to be asking so general a question about
writing a filter whist at the same time that person has already decided
what language they'll be using. I accept that there are reasons that
this happens (the person knows only C or they are taking a C course and
the language is therefore pre-decided) but in the absence of a hint
about that it seemed impolite to assume that the OP wanted only C
advice.

Well, no, I´m asking for C advice because I´m thinking of writing a
filter in C. That´s why I posted the question in this group, impolitely
assuming that people here would assume it´s a question about writing a
filter in C :)

Though I´m aware that there are languages better suited for the purpose,
C is still the only one I know well enough to use for this, hence the
pre-decision.

I´ve kept the question rather general because I don´t have a particular
problem with writing a filter for a particular purpose yet, as I haven´t
started writing one yet. Currently, I´m thinking about writing one
eventually for the filtering of email.

Since I´m aware of the problem that when reading data from stdin, the
amount of data to be read isn´t limited, I´m wondering how filters deal
with this problem. That makes for a rather general question, of
course. It´s only something I´m trying to learn more about before even
starting to write a filter.


From all the kind replys, I´ve gathered so far that:


* There´s no other choice than to buffer the data read from stdin either
in memory or in temporary files.

* It depends on what the filter does how large a portion of the data
read from stdin needs to be.

* Eventually, the filter can terminate with an error.


Now a more specific question: Is there a nice C library useful for
writing a filter for filtering email? There isn´t much point in
re-inventing the wheel, like seperating the headers from the body,
picking out particular headers, searching through the contents of
headers, adding headers and such.
 
B

Ben Bacarisse

lee said:
Well, no, I´m asking for C advice because I´m thinking of writing a
filter in C. That´s why I posted the question in this group, impolitely
assuming that people here would assume it´s a question about writing a
filter in C :)

Good to hear back from you. You have no idea how many people just "post
and go". And I was not suggesting your were being impolite -- I was
suggesting it would be impolite for me to assume you'd decided the
language but nothing else.
Though I´m aware that there are languages better suited for the purpose,
C is still the only one I know well enough to use for this, hence the
pre-decision.

I urge you to reconsider! Not because I know that language X (X != C)
is the better choice (how can I?) but because deciding to stick with
what I know has almost always been a bad choice for me. Learning new
things is stimulating and it's rare to regret learning to use
appropriate tools.

You may end up using C, but please don't decide that now.
I´ve kept the question rather general because I don´t have a particular
problem with writing a filter for a particular purpose yet, as I haven´t
started writing one yet. Currently, I´m thinking about writing one
eventually for the filtering of email.

You can't get specific advice without a specific problem. I have
several mail filters that I use every day and none use C. Only one has
any need for more than a tiny amount of storage. How do you even know
that storage will be a problem?
Since I´m aware of the problem that when reading data from stdin, the
amount of data to be read isn´t limited, I´m wondering how filters deal
with this problem. That makes for a rather general question, of
course. It´s only something I´m trying to learn more about before even
starting to write a filter.

This suggests that you may be a bit confused about the problem. In my
previous rely I cited examples of filters that don't have any storage
problem despite having potentially unbounded input. That's why, to
start with, this is a programming question. The need for growing
storage will come from what the filter does, not from the fact the input
is not limited in size. In short, there is no inevitable problem caused
by unlimited input.
From all the kind replys, I´ve gathered so far that:

* There´s no other choice than to buffer the data read from stdin either
in memory or in temporary files.

"Buffer" is maybe too specific a term. It's often related to IO but the
best solution may have nothing to do with buffering. For example, if
you need to find how many distinct words there are in the input, you
will be better off growing a data structure like a tree or a hash table.
That's not really covered by "buffering the data".
* It depends on what the filter does how large a portion of the data
read from stdin needs to be.

Whilst that's obviously true, it seems to be missing the point. It is
much more important to understand how much data (and exactly what data)
needs to be stored. How much needs to be read is often a detail. In my
previous example, the whole message needs to be read, but what needs to
be stored is some data structure that can be used to look up previously
seen words.
* Eventually, the filter can terminate with an error.

Now a more specific question: Is there a nice C library useful for
writing a filter for filtering email? There isn´t much point in
re-inventing the wheel, like seperating the headers from the body,
picking out particular headers, searching through the contents of
headers, adding headers and such.

Yes, though I can't say how nice any of them are because I've not used
any from C. If I was forced to use C, I'd probably start by looking at
the GNU mailutils library. From choice I'd use Perl's MailTools
package.

Is it possible that you are looking for a project to extend your C
skills? If so, saying so will stop people like me trying to get you to
use better tools for the job. There's nothing wrong with writing C for
the hell of it (or just to get better at it) but right now I can't see
why you want general C answers to problems that are not obviously C ones
(yet).
 
T

Tim Rentsch

Jorgen Grahn said:
James Kuyper said:
On 06/18/2011 06:15 AM, lee wrote:
Hi,

it'`s probably a FAQ, though I haven'`t found any good info on it: How do
you go about writing a filter, i. e. a program that reads data from
stdin and processes it?

Since indefinte amounts of data could be read from stdin, they can'`t
just be put into some (ever increasing) buffer. Using a buffer of
limited size can make it difficult to process the data because the
buffer could be too small.

What'`s the solution for this?

The term "filter" is generally reserved for the kind of application that
only needs to keep a small portion of the input in memory at any given
time. [snip]

No, it isn't.

I'm not sure there is a definition everyone agrees on. There certainly
isn't an /official/ one.

Which justifies the assertion that the term isn't "generally
reserved" for the one usage.
I found both Kuyper's and Burditt's postings useful.

I didn't mean to imply anything about the usefulness of JK's
other comments - only about the accuracy of the one sentence I
responded to.
 
L

lee

Ben Bacarisse said:
Good to hear back from you. You have no idea how many people just "post
and go".

Yeah, I don't like it when people do that.
And I was not suggesting your were being impolite -- I was
suggesting it would be impolite for me to assume you'd decided the
language but nothing else.

I know, hence the smiley.
I urge you to reconsider! Not because I know that language X (X != C)
is the better choice (how can I?) but because deciding to stick with
what I know has almost always been a bad choice for me. Learning new
things is stimulating and it's rare to regret learning to use
appropriate tools.

You're right about that. There's so much to learn that it's really
overwhelming. I'm trying to learn elisp ATM, besides a lot of other
things.
You can't get specific advice without a specific problem. I have
several mail filters that I use every day and none use C. Only one has
any need for more than a tiny amount of storage. How do you even know
that storage will be a problem?

It turned out that storage can be a problem when I was thinking of
writing a filter for something. IIRC, the filter was supposed to work on
strings, and I soon found out that when allocating a buffer to work on,
a part of what the program needed to look at would inevitably be outside
of the buffer. I ended up finding a different solution and kept
wondering how the problem is solved.
The need for growing storage will come from what the filter does, not
from the fact the input is not limited in size. In short, there is no
inevitable problem caused by unlimited input.

That's true.
Whilst that's obviously true, it seems to be missing the point. It is
much more important to understand how much data (and exactly what data)
needs to be stored. How much needs to be read is often a detail.

Seeing it that way makes it obvious that asking a question like I did
doesn't make much sense. The question *is* too general, and you're
right, it's missing the point.
Yes, though I can't say how nice any of them are because I've not used
any from C. If I was forced to use C, I'd probably start by looking at
the GNU mailutils library. From choice I'd use Perl's MailTools
package.

Well, I always wanted to learn perl and never got to it. I think I've
been reading about the MailTools package, and it seemed to be very
powerful.
Is it possible that you are looking for a project to extend your C
skills?

Nah, it's only like I said, C is an obvious choice for me because it's
what I know best. I could write a filter without having to learn another
programming language first, which is something that can take a lot of
time.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top