substring finding problem!

Nick Keighley · Feb 27, 2010

Hm! All of my solutions have involved recursion in some form, though
perhaps not as elegantly as some of the more-cryptic-to-me solutions
posted.

Click to expand...

I think it was Wilhelm's [s?] that impressed me

"A better class of primitive"? for code, you mean?

Click to expand...

code, or specification. I thought concat was neat but it was a choice
of primitive. I'm sure a similar specification could be written that
used different primitives.

And then perhaps we could replace the "does not use string.h"
constraint with something more meaningful [*], though I'm not
sure what.
[*] My objection to this constraint is that any minimally competent
programmer should be able to write functions that implement the
same API, so just avoiding use of the library functions doesn't
seem to me to make the problem more interesting.

Click to expand...

Click to expand...

there were some proposals for better APIs, ones that avoided
repeatedly rescanning the string.

Click to expand...

?

Click to expand...

I think some people said the problem with some string.h functions was
they didn't return a pointer to the end of string- hence causing a
naive implementaion to call strlen repeatedly. And strlen is o(n)

Click to expand...

Click to expand...

Chris M. Thomasson · Feb 27, 2010

Seebs said:
Idle curiousity: How's mine do? I haven't checked to see what the
official
interface is, but I'm pretty sure this is adequately obvious. It
presumably
suffers from double-scanning, but I don't know how much that matters. It
doesn't do a lot of mallocs.

char *
rep(const char *in, const char *out, const char *target) {
char *result = 0;
const char *s;
char *t, *u;
size_t inlen, outlen, targetlen, resultlen;
int count;

if (!in || !out || !target || !*in)
return 0;
inlen = strlen(in);
outlen = strlen(out);
targetlen = strlen(target);

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Seebs, you don't actually need to call `strlen()' on the head of the
`target' in order to get it's length. Instead, you can calculate this after
the "first pass" of your `strstr()' loop by keeping a pointer to the
previous match. Think of something along the lines of:
_________________________________________________________
#include <string.h>
#include <assert.h>

size_t
blah(char const* src,
char const* cmp)
{
char const* tail = src;
size_t cmp_size = strlen(cmp);

if (cmp_size)
{
char const* head;

for (head = strstr(src, cmp);
head;
tail = head,
head = strstr(head + cmp_size, cmp));
}

return (tail - src) + strlen(tail);
}

int
main(void)
{
size_t x = blah("xcmpxcmpx", "x");

assert(x == sizeof("xcmpxcmpx") - 1);

return 0;
}
_________________________________________________________

See what I am getting at here? Anyway, this should definitely improve the
performance of your algorithm!

:^)

Ben Bacarisse · Feb 27, 2010

Nick Keighley said:
I think some people said the problem with some string.h functions was
they didn't return a pointer to the end of string- hence causing a
naive implementaion to call strlen repeatedly. And strlen is o(n)

I don't think the problem is repeated calls but one unnecessary one.
The last strstr will always fail, but it will have scanned to the end
of the string[1]. To find the length of this tail we have to scan
again. Either a different API for substring matching or a manual
search for the substring will give us this extra useful information.

I notice that my best effort so far fails to use this information in
one of the places where it might -- I can replace the final strcpy
with a memcpy. That buys me a few cycles per call. Thanks for making
me look again.

<snip>

Seebs · Feb 27, 2010

What exactly does autism have to do with any of this?

I'm high-functioning autistic, and spinoza1111 is obsessed with my various
flaws. I mentioned that once just to see if he'd try to run with it, and the
answer was "yes".

-s

Seebs · Feb 27, 2010

Seebs, you don't actually need to call `strlen()' on the head of the
`target' in order to get it's length. Instead, you can calculate this after
the "first pass" of your `strstr()' loop by keeping a pointer to the
previous match.

Good point!

I was writing for robustness, rather than performance -- no significant
optimization was intended. I'd guess that it could be improved quite a bit;
I just wanted something that was, as they say, so simple that it obviously
contained no bugs (except for the boundary condition someone pointed out
within five minutes).

It's much easier to improve the performance of a clear design than to debug
an unclear design.

-s

Julienne Walker · Feb 27, 2010

I believe I express mostly disapproval about speed of coding as a way
of dodging issues, not speed of coding per se. For example, I didn't
like it at all when Brian Kernighan, in the recent O'Reilly collection
Beautiful Code, praised Rob Pike for only taking an hour to write a
"regular expression processor" because:

* Pike's code wasn't a full or true regular expression processor
* The fact that it took an hour doesn't change the above

At the risk of getting off on a tangent, I'm wondering if you read
that chapter carefully, or the book from which the example was taken.
The requirements for the "regular expression package" were, and I
quote:

"I suggested to Rob that we find the smallest regular expression
package that would illustrate the basic ideas while still recognizing
a useful and nontrivial class of patterns. Ideally, the code would fit
on a single page."

Then later, confirmation that the requirements didn't include a "full
or true regular expression parser", by which I suspect you mean
something like Perl or .NET regex matching:

"This is quite a useful class; in my own experience of using regular
expressions on a day-to-day basis, it easily accounts for 95 percent
of all instances. In many situations, solving the right problem is a
big step toward creating a beautiful program. Rob deserves great
credit for choosing a very small yet important, well-defined, and
extensible set of features from among a wide set of options."

While I might question how much he actually uses regular expressions
if the set of [<literal>, .(period), ^, $, *] makes up 95% of his
usage, the context is important, and it's obvious that a "full and
true regular expression parser" was never claimed in The Practice of
Programming, as evidenced by this quote from that section:

"For full regular expressions, with parentheses and alternatives, the
implementation must be more sophisticated, but can use some of the
techniques we'll talk about later in this chapter."

I'm inclined to side with Kernighan on this one. When viewed in
context, the code is indeed a beautiful piece of work for what it
does. You seem to be critiquing it in terms of production code rather
than a rudimentary implementation of regular expressions in a chapter
section describing how regular expressions are a simplifying notation.

spinoza1111 · Feb 28, 2010

At the risk of getting off on a tangent, I'm wondering if you read
that chapter carefully, or the book from which the example was taken.
The requirements for the "regular expression package" were, and I
quote:

"I suggested to Rob that we find the smallest regular expression
package that would illustrate the basic ideas while still recognizing
a useful and nontrivial class of patterns. Ideally, the code would fit
on a single page."

I read that passage. However, I'd encountered "regular expressions" in
1971 when I read "Formal Languages and Their Relation to Automata" by
Hopcroft and Ullman before I had access to a computer. I regard myself
as privileged to start with math before wide availability of
computers, since I had to figure more out.

In this book, a "regular expression" was not a regular expression
unless it was a complete fullfunction regular expression, because it
was REQUIRED to be able to parse any regular string. What happened IMO
was that Kernighan Pike et al. seized upon the concept, stole it from
their Princeton/Bell Labs coworkers, and used it to designate a hack:
the unix grep horror.

I analyze the problems that result in a blog post, "Regular
Expressions and 'The Horror'": http://spinoza1111.wordpress.com/2009/05/17/regular-expressions-and-the-horror/.
Basically, what mathematicians mean by regular expressions and the
horror they became (with ms dos file id patterns being just one
example) was created by the way justification and discourse works in
our society.

People esp. in the USA believe that things such as programs can be
justified. Kernighan tried to justify Pike's code by saying it "was"
"sort of" a regex processor ("regular expression package that
would ... recognize [some] useful ... patterns") and wasn't it great,
he went on to say, that Rob did it in an hour.

The problem is that "regular expression" already has a meaning, and
Rob's "regular expression package that would ... recognize [some]
useful ... patterns" was NOT a regular expression recognizer. In a
university class, where the assignment was "develop a regular
expression package that parses and applies regular expressions", Rob
would have received an F.

Dijkstra would not have approved. But even in my own graduate
coursework, in a department that was focused primarily in making
people employable in practical data processing in Chicago, the
department chair gave us the complete rules, not a partial hack of the
rules. Programming the COMPLETE rules gives us satisfaction and
utility, for life is short but art is long. Programming a part of them
in ten minutes is jerking off. The code is porn.

That is: people here, and elsewhere in programming, do violence to
problems and this is blessed as "hacking". But mathematicians don't
change the meaning of pre-existing notions to get their work done fast
and impress some thug of a boss. The professional ethics that have
resulted are on exhibit here. Any nonsense (a "linked list" of copies
as the only solution, failure to find %s) are presented as solutions,
and any objections are first met with "but I did it in ten minutes",
followed up with Fascistic personal abuse that is the stock in trade
of the corrupt.

Robert Oppenheimer said that physicists learned evil after the atomic
bomb. Computer scientists learned evil the day Algol died. This was
because their work should have been subordinated to university
mathematics.

Why?

NOT because university mathematicians or guys like Dijkstra are
Platonic-perfect beings. It is because thanks to unix, computer
science discourse became a trivial subset of corporate-speak.

What's wrong with corporate-speak?

It's "hegemonic" in the sense of Gramsci, a guy who in Mussolini's
prisons tried to explain why Italians were deluded by mass media into
thinking the (originally socialist) Mussolini was their friend.

It has no outside. What was that movie where Jim Carrey can't get
outside of a TV show?

Any math-based objection to silly code became, in the 1970s and 1980s,
in my experience, a non-starter without a "business case". For
example, I found it helpful to fix bugs in Bell Northern Research's
compiler that I found on my own whilst making changes needed by the
engineers. But I had to do so on my own time, since no "business case"
had been made. The compiler support team in Ottawa had been brutally
disbanded in 1980 because in the view of management, it was "wasting
time" by incrementally improving the compiler by removing bugs and
extending its capabilities. The result being my job, since they'd
thrown the baby out with the bathwater.

My engineering case was "it is intrinsically dangerous that the
compiler references uninitalized variables when the last 'end' is
missing, I know how to fix this, so I should fix it: this is because
the compiler, a large and complex critical system that the field
engineers need every day to solve problems for our customers, is
operating in this case outside of its design specifications. It's not
supposed to do that, and I predict trouble down the line".

But this wasn't a "business case", since (1) I could not predict the
nature of the "trouble" and even more, (2) I couldn't show how this
impacted profits.

Remarkably, this situation was repeated in the Columbia disaster of
2003, when the Space Shuttle exploded on re-entry. NASA engineers said
that the Columbia was operating outside its design window when it shed
foam on takeoff, for the same reason your car is operating outside its
design window when you drive with a busted taillight to the Kennedy
Space Center. They were told that this was (to use Rumsfeld's term) a
known unknown and not to worry, since in NASA's "businesslike"
"management by objectives" and cost-centered regime instituted under
Reagan, everything had to have a "business case".

I then read Habermas, a German philosopher. He, unlike an American,
lacked the American trust that it will all come out right on launch
day or re-entry as long as we have faith in Jesus and obey superiors.
Instead, Habermas made a detailed study of German debates in coffee-
houses under the somewhat enlightened reign of Frederick the Great to
discover that people could engage in a common search for truth, and
from these common searches for truth, utililty resulted in some cases,
but utility was NOT the goal.

Habermas said that there are two forms of discourse: civil discourse,
a common and disinterested search for truth that stops at nothing
(hey, there's a bug, it references uninitialized storage when there is
not balanced end keyword, let's fix it, wow cool) and commercial
discourse, the clerk's discourse, where his remit is to solve only a
specific problem and wait for further instructions.

I realized, to get back to Beautiful Code, that Kernighan had fallen
victim to the fact that programming no longer feels it can look
outside to a disinterested discourse, that of mathematics. But once
you form a closed world of commercial discourse, the main axiom of
self-interest automatically implies corruption: changing the meaning
of words to get it done and please the boss, and here, acting like a
complete asshole when your work is questioned...staying inside the
worst type of commercial discourse, the attack on competence, in
preference to seeking guidance in mathematical "civil discourse".

Then later, confirmation that the requirements didn't include a "full
or true regular expression parser", by which I suspect you mean
something like Perl or .NET regex matching:

No, I mean regular expressions as independently discovered in the
1940s by McCulloch, von Neumann et al.

"This is quite a useful class; in my own experience of using regular
expressions on a day-to-day basis, it easily accounts for 95 percent
of all instances. In many situations, solving the right problem is a
big step toward creating a beautiful program. Rob deserves great
credit for choosing a very small yet important, well-defined, and
extensible set of features from among a wide set of options."

While I might question how much he actually uses regular expressions
if the set of [<literal>, .(period), ^, $, *] makes up 95% of his
usage, the context is important, and it's obvious that a "full and
true regular expression parser" was never claimed in The Practice of
Programming, as evidenced by this quote from that section:

"For full regular expressions, with parentheses and alternatives, the
implementation must be more sophisticated, but can use some of the
techniques we'll talk about later in this chapter."

I'm inclined to side with Kernighan on this one. When viewed in

Not me.

context, the code is indeed a beautiful piece of work for what it
does. You seem to be critiquing it in terms of production code rather
than a rudimentary implementation of regular expressions in a chapter
section describing how regular expressions are a simplifying notation.

Nope. Actually, I could have written a complete parser using the truly
beautiful and simple syntax of ALL regular expressions presented in my
automata class by the head of the department.

Look, I don't want to offend Kernighan. I met him at Princeton, he
remembers me with neutrality if not fondness (probably not the latter)
and he's responded to my emails recently. I shall ask him to weigh in
here, but don't bet on it.

If you call Pike's crude solution "beautiful", you're corrupting
people and producing "programmers" who GET LAID OFF AT THIRTY, and who
cannot find another job or learn new languages on their own, and who
wind up saying "Welcome to Costco, I love you, welcome to Costco, I
love you" for the rest of their lives.

We need discourse other than the loud mouth of money: the health
insurance crisis is being created by the greed of insurance companies.

spinoza1111 · Feb 28, 2010

On Feb 26 said:
On Feb 26 said:

[...] The API locks us into bad thoughts.
I could [swear] you told me there were no bad books. And yet there are
bad thoughts? double plus ungood.

Click to expand...

Click to expand...

Sure. Freedom of speech is completely consistent with criticism of
published thought, but NOT by poseurs. Seebach, in my opinion, is a
poseur. Therefore, my freedom of speech enables me to call him a
poseur.

Click to expand...

Your own thought seems to proceed in true regular guy mode:

Click to expand...

have I been insulted? Is it good or bad to be a "regular guy"?

Click to expand...

Bad. We need stand up guys here.

Chomsky 3 substitution of strings,

[...] Strings terminated ON THE RIGHT with a Nul is a
bad thought for two reasons:
* It prevents Nul from occuring in a string
a count preceeded string is bad because it limits the maximum size of
the string. I suppose a chain of blocks doesn't have this limitation.

Click to expand...

Click to expand...

I suppose not. However, the length may be inexpressible if it exceeds
what in C is called "long long" precision. OO systems handle this
cleanly: it gives C the willies.

Click to expand...

OO isn't magic. If there is a way to do this then C will be able to do
it as well. Yeah, I know I'm drifting into "all useful programming
languages are turing complete and therefore in a sense equivalent".

Click to expand...

You sure are.

But I'll argue that this particular problem doesn't require the C
programmer to implement an interpreter.

Click to expand...

"Interpreter" has become to mean "does extra stuff". But that's not
what an interpreter is. One of the least celebrated discoveries in
computer science was made by Bell Northern Research developers in the
1970s: threaded code and just in time, which solves the problem that
interpretation has to interpret everything all the time.

The turing tarpit, where everything is possible but nothing is
tractable

* It mandates Eurocentric left to right processing
nope. You are confusing representation and presentation. The nul isn't
at the right hand end it's at the largest address. If the display
device chooses to print r-to-l instead of l-to-r it makes not [the]
blindest bit of difference.

Click to expand...

Click to expand...

What on EARTH does the word "print" mean?

Click to expand...

to make marks on a piece of paper.

Click to expand...

You still do that? Your only output device is a printer? What, are you
logged on to the Web using a Western Union teletypewriter? Wow.

[...]
I'm aware that left to right can be reversed by layering software,
something at which C programmers suck because C sucks at it.

Click to expand...

Your solution forces the wogs to get wog devices that print backwards.

Click to expand...

or a device that isn't biased. Like a laser printer. They basically
don't care what direction they go in. I suspect it wouldn't take much
to make a dot-matrix or ink-jet do similar stuff. I can print pictures
on my dot-matrix and pictures aren't euro-centric. Those of us that
stayed in the industry are aware that the chain printer is no more.
(not that those were direction biased either!)

Click to expand...

You say that, but use the word "print".

we're repeating ourselves. But I think my point about representation
and presentation (or display) is worth thinking about.

Click to expand...

No, it's not, because can is not will. I certainly know, living in
China, that the invention of the bit mapped display and laser printer
was great for Asia.

However, Nul C strings are like Jewish settlers in Palestine, or the
ongoing isolation of Fiji by Australia. They still bias the whole
system towards the West and its presumptions.

I have to confess I smiled the first time I saw one of your little
dialogues. But the joke doesn't really last.

Click to expand...

Too bad.

spinoza1111 · Feb 28, 2010

[...]

However, nobody but me seems to notice that part of the general
culture of these newsgroup is amnesia about traditional ethics owing
to its replacement by an ironic hacker ethos. In that ethos, it seems
to be OK to destroy individual standing, because in what Jared Lanier
calls "digital Maoism" all that matters is group "consensus".

Click to expand...

Components of the hacker pseudo-ethos:

Click to expand...

* Running code (and the "rough" consensus of the Lynch law) as
opposed to correct software
* Treatment of artifacts such as computers and abstractions as more
important than human beings
* Autism

Click to expand...

^^^^^^^^^^^^^^^^^

What exactly does autism have to do with any of this? I know some higher
functioning autistic people, and quite frankly, they are some of the most
brilliant individuals I have ever had the pleasure to be around. They can
tear a problem apart by visualizing every aspect of it in there mind. They
seem to be able to think in highly detailed pictures. It's extremely neat
and I wish I had a fraction of their abilities.

I don't understand why people say "my autism is a good thing", and
it's analogical in a troubling way to allowing software to be deviant
because it was developed fast. Autoids are people without working
parts, like the Columbia space shuttle shedding foam panels on
liftoff.

But: I noticed, as did several other programmers (including an early
British programmer writing about real programming in a Penguin
collection of essays about ordinary Work), that programming attracted,
and retained, many disturbing individuals.

Perhaps society over-relied on software if its "fast" development
required focus in excess of normal, and perhaps we really don't need
people so hypertrophied in one dimension.

Nick Keighley · Feb 28, 2010

On Feb 26, 5:15 pm, Nick Keighley <[email protected]>

Bad. We need stand up guys here.

Click to expand...

is a "stand up guy" different from a "regular guy"? I'm afraid I don't
speak american.

[...] the length may be inexpressible if it exceeds
what in C is called "long long" precision. OO systems handle this
cleanly: it gives C the willies.

Click to expand...

Click to expand...

OO isn't magic. If there is a way to do this then C will be able to do
it as well. Yeah, I know I'm drifting into "all useful programming
languages are turing complete and therefore in a sense equivalent". [...]
But I'll argue that this particular problem doesn't require the C
programmer to implement an interpreter.

Click to expand...

"Interpreter" has become to mean "does extra stuff".

Click to expand...

I considered using "translator", I wasn't insisting on a particular
implementation technology but on the concept that one language can be
used to implement another.

But that's not
what an interpreter is. One of the least celebrated discoveries in
computer science was made by Bell Northern Research developers in the
1970s: threaded code and just in time, which solves the problem that
interpretation has to interpret everything all the time.

[a nul terminated string] mandates Eurocentric left to right processing
nope. You are confusing representation and presentation. The nul isn't
at the right hand end it's at the largest address. If the display
device chooses to print r-to-l instead of l-to-r it makes not [the]
blindest bit of difference.
What on EARTH does the word "print" mean?

Click to expand...

Click to expand...

to make marks on a piece of paper.

Click to expand...

You still do that? Your only output device is a printer? What, are you
logged on to the Web using a Western Union teletypewriter? Wow.

Click to expand...

fine. So get yourself a better web client.

You say that, but use the word "print".

Click to expand...

so what did you mean by "mandates left to right processing"

No, it's not, because can is not will. I certainly know, living in
China, that the invention of the bit mapped display and laser printer
was great for Asia.

Click to expand...

good

<snip>

Julienne Walker · Feb 28, 2010

At the risk of getting off on a tangent, I'm wondering if you read
that chapter carefully, or the book from which the example was taken.
The requirements for the "regular expression package" were, and I
quote:

Click to expand...

"I suggested to Rob that we find the smallest regular expression
package that would illustrate the basic ideas while still recognizing
a useful and nontrivial class of patterns. Ideally, the code would fit
on a single page."

Click to expand...

I read that passage. However, I'd encountered "regular expressions" in
1971 when I read "Formal Languages and Their Relation to Automata" by
Hopcroft and Ullman before I had access to a computer. I regard myself
as privileged to start with math before wide availability of
computers, since I had to figure more out.

In this book, a "regular expression" was not a regular expression
unless it was a complete fullfunction regular expression, because it
was REQUIRED to be able to parse any regular string. What happened IMO
was that Kernighan Pike et al. seized upon the concept, stole it from
their Princeton/Bell Labs coworkers, and used it to designate a hack:
the unix grep horror.

<snip>

The problem is that "regular expression" already has a meaning, and
Rob's "regular expression package that would ... recognize [some]
useful ... patterns" was NOT a regular expression recognizer. In a
university class, where the assignment was "develop a regular
expression package that parses and applies regular expressions", Rob
would have received an F.

So my suspicion was partially correct. You're judging the code based
on a different specification than the one intended. Kernighan and Pike
clearly meant "grep-style" regular expressions rather than
"mathematical" regular expressions. Seeing as how the book was written
in 1998 and the term "regular expression" has solidified as meaning
the former, it's not unreasonable for them to neglect defining it.

To avoid being nonconstructive in my criticism, I'll offer you another
exercise in your relearning of C: write a full and true mathematical
regular expression parser that meets Kernighan's stated requirements
of being small enough for inclusion in that particular section
("Ideally, the code would fit on a single page."), and illustrates the
basic ideas of regular expressions as a simplifying notation in
problem solving. I won't ask that it be done in an hour or two,
because how long it takes is not an indicator of quality. ;-)

Richard Bos · Mar 1, 2010

Seebs said:
People who have dealt with outsourcing are very likely to have encountered
a disproportionately bad sample of programmers from some other country.
I have not known anyone who has had particularly good experiences with
"cheap offshore coders", and I have oodles of horror stories.

Possibly, and I should really have said "programming education culture"
or "official programming culture". I, too, have encountered more than
one very capable Indian programmer, and even more capable Indian
sysadmins. However, from talking to them, I usually get the impression
that they are so in spite of, not thanks to, programming education in
India. Not a few of them got their tertiary education in the UK or the
USA, not in India.
As for the demonstration of the lack of quality otherwise in their IT
colleges, well... that should be obvious from the many posts in this
newsgroup, coming from that country, asking for answers to completely
wrong questions.

Richard

Chris M. Thomasson · Mar 1, 2010

[...]

However, nobody but me seems to notice that part of the general
culture of these newsgroup is amnesia about traditional ethics owing
to its replacement by an ironic hacker ethos. In that ethos, it seems
to be OK to destroy individual standing, because in what Jared Lanier
calls "digital Maoism" all that matters is group "consensus".

Click to expand...

Components of the hacker pseudo-ethos:

Click to expand...

* Running code (and the "rough" consensus of the Lynch law) as
opposed to correct software
* Treatment of artifacts such as computers and abstractions as more
important than human beings
* Autism

Click to expand...

^^^^^^^^^^^^^^^^^

What exactly does autism have to do with any of this? I know some higher
functioning autistic people, and quite frankly, they are some of the
most
brilliant individuals I have ever had the pleasure to be around. They
can
tear a problem apart by visualizing every aspect of it in there mind.
They
seem to be able to think in highly detailed pictures. It's extremely
neat
and I wish I had a fraction of their abilities.

Click to expand...

I don't understand why people say "my autism is a good thing",

Like I said... I wish I had some of the abilities that autistic people have.
I would love to have the ability to think in pictures like some autistic
individuals can do. Are you familiar with the story of Temple Grandin? She
has a PH.D, and she did not even talk until she was around 3 or 4 years old.
I suggest you read about here before you continue on making a complete ass
out of yourself. I would LOVE to have some of here abilities!

[...]

Ike Naar · Mar 1, 2010

People who have dealt with outsourcing are very likely to have encountered
a disproportionately bad sample of programmers from some other country.
I have not known anyone who has had particularly good experiences with
"cheap offshore coders", and I have oodles of horror stories.

^^^^^^^^^^^^^^^^^^^^^
You mean "sea programmers" ?

spinoza1111 · Mar 5, 2010

I read that passage. However, I'd encountered "regular expressions" in
1971 when I read "Formal Languages and Their Relation to Automata" by
Hopcroft and Ullman before I had access to a computer. I regard myself
as privileged to start with math before wide availability of
computers, since I had to figure more out.

Click to expand...

In this book, a "regular expression" was not a regular expression
unless it was a complete fullfunction regular expression, because it
was REQUIRED to be able to parse any regular string. What happened IMO
was that Kernighan Pike et al. seized upon the concept, stole it from
their Princeton/Bell Labs coworkers, and used it to designate a hack:
the unix grep horror.

<snip>

Click to expand...

The problem is that "regular expression" already has a meaning, and
Rob's "regular expression package that would ... recognize [some]
useful ... patterns" was NOT a regular expression recognizer. In a
university class, where the assignment was "develop a regular
expression package that parses and applies regular expressions", Rob
would have received an F.

Click to expand...

So my suspicion was partially correct. You're judging the code based
on a different specification than the one intended. Kernighan and Pike
clearly meant "grep-style" regular expressions rather than
"mathematical" regular expressions. Seeing as how the book was written
in 1998 and the term "regular expression" has solidified as meaning
the former, it's not unreasonable for them to neglect defining it.

To avoid being nonconstructive in my criticism, I'll offer you another
exercise in your relearning of C: write a full and true mathematical
regular expression parser that meets Kernighan's stated requirements
of being small enough for inclusion in that particular section
("Ideally, the code would fit on a single page."), and illustrates the
basic ideas of regular expressions as a simplifying notation in
problem solving. I won't ask that it be done in an hour or two,
because how long it takes is not an indicator of quality. ;-)

Very constructive.

I will start on it this weekend and log my time. Thanks for an
interesting challenge.

Oh, and I have found nobody calling Richard Heathfield or Peter on
their behavior. Therefore I shall contact their publishers and
Seebach's employer.

blmblm · Mar 6, 2010

On Feb 26, 3:39 am, Seebs <[email protected]> wrote:

[ snip ]

Something is Wrong, all right. It's that you shit on people and pay
your way onto standards boards without any qualifications that are
independently verifiable of whatever makes your current employer a
quick buck.

As I understand things, based on descriptions I've read here of how
people come to be on the committee responsible for the C standard(s),
the same critique could be applied to any member of said committee --
anyone who pays the appropriate dues can attend meetings. (Perhaps
someone can confirm this, or correct me.)

[ snip ]

blmblm · Mar 6, 2010

No, though it is a detail. I thought Seebs was making a point ("if
you code like this you'll make mistakes like this one I remember") but
I could be wrong about that.

You could add it to your examples for your students to have to debug
with and without -Wall: + binds more tightly than << and >> in C.

Oh my! I sure didn't spot that one. (So, what *was* that line
supposed to accomplish?)

(But then, I've never tried very hard to keep in my head the
relative precedence of operators -- I'm inclined to use parentheses

if there's *any* doubt in my mind said:
Aside: this is a hard one to remember unless you know C++:

cout << 3 + i + i;

That rather reminds me of one of my favorite examples of a C++
beginner mistake that requires a more-than-beginner understanding
to explain -- trying to read values into two variables a and b thus:

cin >> a, b;

Beginners are surprised that only one value is read ....

[ snip ]

blmblm · Mar 6, 2010

Ben Bacarisse wrote:
)>< snip >
)> willem (O2) 2.77 seconds
)> willem (O3) 4.16 seconds [ can this be right?! ]
)
) Looks wacky to me! Is it repeatable?

Bwahahaha! I love it!

) Here are my times (also gcc 4.4.1 and libc 2.10.1). I seem to have a
) faster machine. The first number are your times (for reference) and
) the second are mine (in seconds). The third column is the ratio of
) the two. You can see that there is more going on than just the speed
) of the machine.
)
) <snip>
) willem (O2) 2.77 0.813 3.41
) willem (O3) 4.16 0.885 4.70
)
) If we are now measuring the same things, it seems that some code is
) favoured by my system (yours for example) and some does not do so
) well. I suspect interactions with the various caches but that is a
) huge guess.

This is quite interesting! I would really like to see the generated
assembly for -O2 and -O3 for my code. I guess I can retrieve my code
from the usenet archive and compile it, but I don't know which of the
two solutions I posted was tested here. (The iterative or the recursive
one ?)

(I hope you're still interested ....)

The recursive one, I think -- anyway the one from message ID

<[email protected]>

I'll put the code and the output of "gcc -S -O2" and "gcc -S -O3"
below. I'll let you try to make sense of the differences. Mostly
what I notice is that the -O3 version is quite a bit longer.

PS: For testing you would also need different match patterns, including
some that contain repeated strings or stuff like that, especially
if you're comparing 'smart' against 'dumb' algorithms.

Yes .... I think Ben's tests are probably better-thought-out than
mine. Ah well.

==== repstr.c ====

#include <stdlib.h>

char *repstr(char *s, char *m, char *r, size_t tl)
{
char *t;
size_t sl = 0, ml = 0, rl;
while (s[sl]) {
if (m[ml] == 0) {
for (rl = 0; r[rl]; rl++) ;
if (t = repstr(s+sl+ml, m, r, tl+sl+rl)) {
while (rl--) { t[tl+sl+rl] = r[rl]; }
while (sl--) { t[tl+sl] = s[sl]; }
}
return t;
}
if (s[sl+ml] != m[ml]) {
sl++;
ml = 0;
} else {
ml++;
}
}
if (t = malloc(tl + sl + 1)) {
do { t[tl+sl] = s[sl]; } while (sl--);
}
return t;
}

==== repstr.s created with gcc -S -O2 ====

.file "repstr.c"
.text
.p2align 4,,15
..globl repstr
.type repstr, @function
repstr:
pushl %ebp
xorl %eax, %eax
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
xorl %ebx, %ebx
subl $44, %esp
movl 8(%ebp), %edi
movzbl (%edi), %esi
..L20:
movl %esi, %edx
testb %dl, %dl
je .L22
..L12:
movl 12(%ebp), %ecx
movzbl (%ecx,%eax), %edx
testb %dl, %dl
je .L23
leal (%edi,%eax), %ecx
cmpb (%ecx,%ebx), %dl
je .L11
addl $1, %ebx
xorl %eax, %eax
movzbl (%edi,%ebx), %esi
movl %esi, %edx
testb %dl, %dl
jne .L12
..L22:
movl 20(%ebp), %ecx
leal 1(%ebx,%ecx), %eax
movl %eax, (%esp)
call malloc
testl %eax, %eax
movl %eax, -28(%ebp)
je .L7
movl 20(%ebp), %eax
addl %ebx, %edi
leal (%ebx,%eax), %eax
addl -28(%ebp), %eax
.p2align 4,,7
.p2align 3
..L13:
movzbl (%edi), %edx
subl $1, %ebx
subl $1, %edi
movb %dl, (%eax)
subl $1, %eax
cmpl $-1, %ebx
jne .L13
..L7:
movl -28(%ebp), %eax
addl $44, %esp
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
.p2align 4,,7
.p2align 3
..L11:
addl $1, %eax
jmp .L20
.p2align 4,,7
.p2align 3
..L23:
movl 16(%ebp), %edx
xorl %esi, %esi
cmpb $0, (%edx)
je .L5
.p2align 4,,7
.p2align 3
..L15:
addl $1, %esi
cmpb $0, (%edx,%esi)
jne .L15
..L5:
movl 20(%ebp), %edx
addl %ebx, %eax
leal (%edi,%eax), %eax
leal (%ebx,%edx), %edx
leal (%esi,%edx), %ecx
movl %ecx, 12(%esp)
movl 16(%ebp), %ecx
movl %ecx, 8(%esp)
movl 12(%ebp), %ecx
movl %eax, (%esp)
movl %ecx, 4(%esp)
movl %edx, -36(%ebp)
call repstr
movl -36(%ebp), %edx
testl %eax, %eax
movl %eax, -28(%ebp)
je .L7
testl %esi, %esi
je .L8
leal -1(%esi,%edx), %edx
addl %eax, %edx
xorl %eax, %eax
movl %edx, -32(%ebp)
movl 16(%ebp), %edx
leal (%edx,%esi), %ecx
movl -32(%ebp), %edx
movl %ebx, -32(%ebp)
.p2align 4,,7
.p2align 3
..L9:
movzbl -1(%ecx,%eax), %ebx
subl $1, %eax
movb %bl, (%edx)
leal (%eax,%esi), %ebx
subl $1, %edx
testl %ebx, %ebx
jne .L9
movl -32(%ebp), %ebx
..L8:
testl %ebx, %ebx
je .L7
movl 20(%ebp), %ecx
xorl %eax, %eax
leal (%edi,%ebx), %esi
leal -1(%ebx,%ecx), %edx
addl -28(%ebp), %edx
.p2align 4,,7
.p2align 3
..L10:
movzbl -1(%esi,%eax), %ecx
subl $1, %eax
movb %cl, (%edx)
leal (%eax,%ebx), %ecx
subl $1, %edx
testl %ecx, %ecx
jne .L10
movl -28(%ebp), %eax
addl $44, %esp
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
.size repstr, .-repstr
.ident "GCC: (GNU) 4.4.1 20090725 (Red Hat 4.4.1-2)"
.section .note.GNU-stack,"",@progbits

==== repstr.s created with gcc -S -O3 ====

.file "repstr.c"
.text
.p2align 4,,15
..globl repstr
.type repstr, @function
repstr:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
xorl %ebx, %ebx
subl $172, %esp
movl 8(%ebp), %edx
movl 12(%ebp), %eax
movzbl (%edx), %edi
xorl %edx, %edx
..L105:
movl %edi, %ecx
testb %cl, %cl
je .L112
..L67:
movzbl (%eax,%edx), %ecx
testb %cl, %cl
je .L113
movl 8(%ebp), %esi
addl %edx, %esi
cmpb (%esi,%ebx), %cl
je .L66
movl 8(%ebp), %esi
addl $1, %ebx
xorl %edx, %edx
movzbl (%esi,%ebx), %edi
movl %edi, %ecx
testb %cl, %cl
jne .L67
..L112:
movl 20(%ebp), %esi
leal 1(%ebx,%esi), %eax
movl %eax, (%esp)
call malloc
testl %eax, %eax
movl %eax, -28(%ebp)
je .L32
movl 8(%ebp), %edx
leal (%ebx,%esi), %eax
addl -28(%ebp), %eax
addl %ebx, %edx
.p2align 4,,7
.p2align 3
..L68:
movzbl (%edx), %ecx
subl $1, %ebx
subl $1, %edx
movb %cl, (%eax)
subl $1, %eax
cmpl $-1, %ebx
jne .L68
..L32:
movl -28(%ebp), %eax
addl $172, %esp
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
.p2align 4,,7
.p2align 3
..L66:
addl $1, %edx
jmp .L105
.p2align 4,,7
.p2align 3
..L113:
movl 16(%ebp), %ecx
xorl %edi, %edi
movzbl (%ecx), %ecx
testb %cl, %cl
movb %cl, -56(%ebp)
je .L5
movl 16(%ebp), %ecx
.p2align 4,,7
.p2align 3
..L80:
addl $1, %edi
cmpb $0, (%ecx,%edi)
jne .L80
..L5:
movl 20(%ebp), %esi
addl %ebx, %edx
addl 8(%ebp), %edx
addl %ebx, %esi
movl %esi, -72(%ebp)
addl %edi, %esi
movl %esi, -68(%ebp)
xorl %esi, %esi
movl %edx, -32(%ebp)
movzbl (%edx), %edx
movl %edi, -40(%ebp)
movb %dl, -28(%ebp)
xorl %edx, %edx
..L106:
cmpb $0, -28(%ebp)
je .L114
..L62:
movzbl (%eax,%edx), %ecx
testb %cl, %cl
je .L115
movl -32(%ebp), %edi
addl %edx, %edi
cmpb (%edi,%esi), %cl
je .L61
movl -32(%ebp), %edx
addl $1, %esi
movzbl (%edx,%esi), %edx
movb %dl, -28(%ebp)
xorl %edx, %edx
cmpb $0, -28(%ebp)
jne .L62
..L114:
movl -68(%ebp), %ecx
movl -40(%ebp), %edi
leal 1(%esi,%ecx), %eax
movl %eax, (%esp)
call malloc
testl %eax, %eax
movl %eax, -28(%ebp)
je .L32
movl -68(%ebp), %eax
movl -32(%ebp), %edx
leal (%esi,%eax), %eax
addl -28(%ebp), %eax
addl %esi, %edx
.p2align 4,,7
.p2align 3
..L63:
movzbl (%edx), %ecx
subl $1, %esi
subl $1, %edx
movb %cl, (%eax)
subl $1, %eax
cmpl $-1, %esi
jne .L63
..L59:
testl %edi, %edi
je .L73
movl -72(%ebp), %ecx
xorl %eax, %eax
leal -1(%edi,%ecx), %edx
movl 16(%ebp), %ecx
addl -28(%ebp), %edx
leal (%ecx,%edi), %esi
.p2align 4,,7
.p2align 3
..L64:
movzbl -1(%esi,%eax), %ecx
subl $1, %eax
movb %cl, (%edx)
leal (%eax,%edi), %ecx
subl $1, %edx
testl %ecx, %ecx
jne .L64
..L73:
testl %ebx, %ebx
je .L32
movl 20(%ebp), %esi
xorl %eax, %eax
movl 8(%ebp), %ecx
leal -1(%ebx,%esi), %edx
addl -28(%ebp), %edx
leal (%ecx,%ebx), %esi
.p2align 4,,7
.p2align 3
..L65:
movzbl -1(%esi,%eax), %ecx
subl $1, %eax
movb %cl, (%edx)
leal (%eax,%ebx), %ecx
subl $1, %edx
testl %ecx, %ecx
jne .L65
movl -28(%ebp), %eax
addl $172, %esp
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
.p2align 4,,7
.p2align 3
..L61:
addl $1, %edx
jmp .L106
.p2align 4,,7
.p2align 3
..L115:
cmpb $0, -56(%ebp)
movl -40(%ebp), %edi
movl $0, -60(%ebp)
je .L10
movl 16(%ebp), %ecx
movl %edx, -28(%ebp)
xorl %edx, %edx
.p2align 4,,7
.p2align 3
..L79:
addl $1, %edx
cmpb $0, (%ecx,%edx)
jne .L79
movl %edx, -60(%ebp)
movl -28(%ebp), %edx
..L10:
movl -68(%ebp), %ecx
leal (%esi,%edx), %edx
addl -32(%ebp), %edx
addl %esi, %ecx
movl %ecx, -84(%ebp)
movl -60(%ebp), %ecx
addl -84(%ebp), %ecx
movl %edx, -40(%ebp)
movl %ecx, -88(%ebp)
movzbl (%edx), %edx
movl $0, -28(%ebp)
movl %esi, -48(%ebp)
movb %dl, -36(%ebp)
xorl %edx, %edx
..L107:
cmpb $0, -36(%ebp)
je .L116
..L56:
movl -28(%ebp), %esi
movzbl (%eax,%esi), %ecx
testb %cl, %cl
je .L117
movl -40(%ebp), %esi
addl -28(%ebp), %esi
cmpb (%esi,%edx), %cl
je .L55
movl -40(%ebp), %ecx
addl $1, %edx
movzbl (%ecx,%edx), %ecx
movl $0, -28(%ebp)
movb %cl, -36(%ebp)
cmpb $0, -36(%ebp)
jne .L56
..L116:
movl -88(%ebp), %ecx
movl -48(%ebp), %esi
leal 1(%edx,%ecx), %eax
movl %eax, (%esp)
movl %edx, -136(%ebp)
call malloc
movl -136(%ebp), %edx
testl %eax, %eax
movl %eax, -28(%ebp)
je .L32
movl -88(%ebp), %ecx
movl -40(%ebp), %eax
movl %esi, -40(%ebp)
movl %ebx, %esi
leal (%edx,%ecx), %ecx
addl -28(%ebp), %ecx
addl %edx, %eax
.p2align 4,,7
.p2align 3
..L57:
movzbl (%eax), %ebx
subl $1, %edx
subl $1, %eax
movb %bl, (%ecx)
subl $1, %ecx
cmpl $-1, %edx
jne .L57
movl %esi, %ebx
movl -40(%ebp), %esi
..L53:
movl -60(%ebp), %eax
testl %eax, %eax
je .L72
movl -60(%ebp), %eax
movl -84(%ebp), %ecx
movl %esi, -40(%ebp)
movl %ebx, -44(%ebp)
leal -1(%eax,%ecx), %edx
addl -28(%ebp), %edx
movl %edx, -36(%ebp)
movl 16(%ebp), %edx
leal (%edx,%eax), %ecx
movl -36(%ebp), %edx
xorl %eax, %eax
movl %edi, -36(%ebp)
movl -60(%ebp), %edi
.p2align 4,,7
.p2align 3
..L58:
movzbl -1(%ecx,%eax), %esi
subl $1, %eax
movl %esi, %ebx
leal (%eax,%edi), %esi
movb %bl, (%edx)
subl $1, %edx
testl %esi, %esi
jne .L58
movl -40(%ebp), %esi
movl -36(%ebp), %edi
movl -44(%ebp), %ebx
..L72:
testl %esi, %esi
je .L59
movl -68(%ebp), %eax
leal -1(%esi,%eax), %edx
xorl %eax, %eax
addl -28(%ebp), %edx
movl %edx, -36(%ebp)
movl -32(%ebp), %edx
movl %edi, -32(%ebp)
leal (%edx,%esi), %ecx
movl -36(%ebp), %edx
movl %ebx, -36(%ebp)
.p2align 4,,7
.p2align 3
..L60:
movzbl -1(%ecx,%eax), %edi
subl $1, %eax
movl %edi, %ebx
leal (%eax,%esi), %edi
movb %bl, (%edx)
subl $1, %edx
testl %edi, %edi
jne .L60
movl -32(%ebp), %edi
movl -36(%ebp), %ebx
jmp .L59
.p2align 4,,7
.p2align 3
..L55:
addl $1, -28(%ebp)
jmp .L107
.p2align 4,,7
.p2align 3
..L117:
cmpb $0, -56(%ebp)
movl -48(%ebp), %esi
movl $0, -76(%ebp)
je .L15
movl 16(%ebp), %ecx
movl %edx, -36(%ebp)
xorl %edx, %edx
.p2align 4,,7
.p2align 3
..L78:
addl $1, %edx
cmpb $0, (%ecx,%edx)
jne .L78
movl %edx, -76(%ebp)
movl -36(%ebp), %edx
..L15:
movl -88(%ebp), %ecx
addl %edx, %ecx
movl %ecx, -104(%ebp)
movl -76(%ebp), %ecx
addl -104(%ebp), %ecx
movl %ecx, -100(%ebp)
movl -28(%ebp), %ecx
leal (%edx,%ecx), %ecx
addl -40(%ebp), %ecx
movl %ecx, -48(%ebp)
movzbl (%ecx), %ecx
movl $0, -28(%ebp)
movl $0, -36(%ebp)
movl %edx, -64(%ebp)
movb %cl, -44(%ebp)
movl %ebx, -52(%ebp)
..L108:
cmpb $0, -44(%ebp)
je .L118
..L50:
movl -28(%ebp), %ebx
movzbl (%eax,%ebx), %edx
testb %dl, %dl
je .L119
movl -48(%ebp), %ecx
movl -36(%ebp), %ebx
addl -28(%ebp), %ecx
cmpb (%ecx,%ebx), %dl
je .L49
movl -48(%ebp), %edx
addl $1, %ebx
movl %ebx, -36(%ebp)
movzbl (%edx,%ebx), %edx
movl $0, -28(%ebp)
movb %dl, -44(%ebp)
cmpb $0, -44(%ebp)
jne .L50
..L118:
movl -36(%ebp), %eax
movl -100(%ebp), %ecx
movl -64(%ebp), %edx
movl -52(%ebp), %ebx
leal 1(%eax,%ecx), %ecx
movl %ecx, (%esp)
movl %edx, -136(%ebp)
call malloc
movl -136(%ebp), %edx
testl %eax, %eax
movl %eax, -28(%ebp)
je .L32
movl -36(%ebp), %ecx
addl -100(%ebp), %ecx
addl -28(%ebp), %ecx
movl -48(%ebp), %eax
addl -36(%ebp), %eax
movl %edx, -48(%ebp)
movl -36(%ebp), %edx
movl %esi, -36(%ebp)
movl %ebx, %esi
.p2align 4,,7
.p2align 3
..L51:
movzbl (%eax), %ebx
subl $1, %edx
subl $1, %eax
movb %bl, (%ecx)
subl $1, %ecx
cmpl $-1, %edx
jne .L51
movl %esi, %ebx
movl -48(%ebp), %edx
movl -36(%ebp), %esi
..L47:
movl -76(%ebp), %ecx
testl %ecx, %ecx
je .L71
movl -76(%ebp), %ecx
movl -104(%ebp), %eax
movl %edi, -56(%ebp)
movl %edx, -48(%ebp)
leal -1(%ecx,%eax), %eax
addl -28(%ebp), %eax
movl %eax, -44(%ebp)
movl 16(%ebp), %eax
addl %ecx, %eax
movl -44(%ebp), %ecx
movl %eax, -36(%ebp)
movl -36(%ebp), %edi
xorl %eax, %eax
movl %esi, -44(%ebp)
movl -76(%ebp), %esi
.p2align 4,,7
.p2align 3
..L52:
movzbl -1(%edi,%eax), %edx
subl $1, %eax
movb %dl, (%ecx)
leal (%eax,%esi), %edx
subl $1, %ecx
testl %edx, %edx
jne .L52
movl -48(%ebp), %edx
movl -44(%ebp), %esi
movl -56(%ebp), %edi
..L71:
testl %edx, %edx
je .L53
movl -88(%ebp), %eax
movl %edi, -48(%ebp)
movl %ebx, -44(%ebp)
leal -1(%edx,%eax), %ecx
xorl %eax, %eax
addl -28(%ebp), %ecx
movl %ecx, -36(%ebp)
movl -40(%ebp), %ecx
addl %edx, %ecx
movl %ecx, -40(%ebp)
movl -40(%ebp), %edi
movl -36(%ebp), %ecx
movl %esi, -36(%ebp)
.p2align 4,,7
.p2align 3
..L54:
movzbl -1(%edi,%eax), %esi
subl $1, %eax
movl %esi, %ebx
leal (%eax,%edx), %esi
movb %bl, (%ecx)
subl $1, %ecx
testl %esi, %esi
jne .L54
movl -36(%ebp), %esi
movl -48(%ebp), %edi
movl -44(%ebp), %ebx
jmp .L53
.p2align 4,,7
.p2align 3
..L49:
addl $1, -28(%ebp)
jmp .L108
..L119:
cmpb $0, -56(%ebp)
movl -64(%ebp), %edx
movl -52(%ebp), %ebx
movl $0, -96(%ebp)
je .L20
movl 16(%ebp), %ecx
movl %edx, -44(%ebp)
xorl %edx, %edx
.p2align 4,,7
.p2align 3
..L77:
addl $1, %edx
cmpb $0, (%ecx,%edx)
jne .L77
movl %edx, -96(%ebp)
movl -44(%ebp), %edx
..L20:
movl -36(%ebp), %ecx
addl -100(%ebp), %ecx
movl %ecx, -112(%ebp)
movl -96(%ebp), %ecx
addl -112(%ebp), %ecx
movl %ecx, -116(%ebp)
movl -36(%ebp), %ecx
addl -28(%ebp), %ecx
addl -48(%ebp), %ecx
movl %ecx, -64(%ebp)
movzbl (%ecx), %ecx
movl $0, -28(%ebp)
movl $0, -44(%ebp)
movl %edx, -80(%ebp)
movb %cl, -52(%ebp)
movl %ebx, -92(%ebp)
..L109:
cmpb $0, -52(%ebp)
je .L120
..L44:
movl -28(%ebp), %ebx
movzbl (%eax,%ebx), %edx
testb %dl, %dl
je .L121
movl -64(%ebp), %ecx
addl -44(%ebp), %ecx
cmpb (%ecx,%ebx), %dl
je .L43
addl $1, -44(%ebp)
movl -64(%ebp), %ecx
movl -44(%ebp), %edx
movzbl (%ecx,%edx), %edx
movl $0, -28(%ebp)
movb %dl, -52(%ebp)
cmpb $0, -52(%ebp)
jne .L44
..L120:
movl -44(%ebp), %ecx
movl -116(%ebp), %eax
movl -80(%ebp), %edx
movl -92(%ebp), %ebx
leal 1(%ecx,%eax), %eax
movl %eax, (%esp)
movl %edx, -136(%ebp)
call malloc
movl -136(%ebp), %edx
testl %eax, %eax
movl %eax, -28(%ebp)
je .L32
movl -44(%ebp), %ecx
addl -116(%ebp), %ecx
addl -28(%ebp), %ecx
movl -64(%ebp), %eax
addl -44(%ebp), %eax
movl %edx, -56(%ebp)
movl -44(%ebp), %edx
movl %esi, -44(%ebp)
movl %ebx, %esi
.p2align 4,,7
.p2align 3
..L45:
movzbl (%eax), %ebx
subl $1, %edx
subl $1, %eax
movb %bl, (%ecx)
subl $1, %ecx
cmpl $-1, %edx
jne .L45
movl %esi, %ebx
movl -56(%ebp), %edx
movl -44(%ebp), %esi
..L41:
movl -96(%ebp), %eax
testl %eax, %eax
je .L70
movl -96(%ebp), %eax
movl -112(%ebp), %ecx
movl %esi, -64(%ebp)
movl -96(%ebp), %esi
movl %edx, -56(%ebp)
leal -1(%eax,%ecx), %ecx
movl -28(%ebp), %eax
addl %ecx, %eax
movl -96(%ebp), %ecx
movl %eax, -52(%ebp)
movl 16(%ebp), %eax
leal (%eax,%ecx), %ecx
xorl %eax, %eax
movl %ecx, -44(%ebp)
movl -52(%ebp), %ecx
movl %edi, -52(%ebp)
movl -44(%ebp), %edi
.p2align 4,,7
.p2align 3
..L46:
movzbl -1(%edi,%eax), %edx
subl $1, %eax
movb %dl, (%ecx)
leal (%eax,%esi), %edx
subl $1, %ecx
testl %edx, %edx
jne .L46
movl -56(%ebp), %edx
movl -64(%ebp), %esi
movl -52(%ebp), %edi
..L70:
movl -36(%ebp), %eax
testl %eax, %eax
je .L47
movl -36(%ebp), %eax
movl -100(%ebp), %ecx
movl %esi, -56(%ebp)
movl -36(%ebp), %esi
movl $0, -52(%ebp)
leal -1(%eax,%ecx), %ecx
movl -28(%ebp), %eax
addl %ecx, %eax
movl -36(%ebp), %ecx
movl %eax, -44(%ebp)
movl -48(%ebp), %eax
movl %edi, -36(%ebp)
leal (%eax,%ecx), %ecx
xorl %eax, %eax
movl %ecx, -48(%ebp)
movl -48(%ebp), %edi
movl -44(%ebp), %ecx
movl %edx, -44(%ebp)
.p2align 4,,7
.p2align 3
..L48:
movzbl -1(%edi,%eax), %edx
subl $1, %eax
movb %dl, (%ecx)
leal (%eax,%esi), %edx
subl $1, %ecx
testl %edx, %edx
jne .L48
movl -44(%ebp), %edx
movl -56(%ebp), %esi
movl -36(%ebp), %edi
jmp .L47
.p2align 4,,7
.p2align 3
..L43:
addl $1, -28(%ebp)
jmp .L109
..L121:
cmpb $0, -56(%ebp)
movl -80(%ebp), %edx
movl -92(%ebp), %ebx
movl $0, -108(%ebp)
je .L25
movl 16(%ebp), %ecx
movl %edx, -52(%ebp)
xorl %edx, %edx
.p2align 4,,7
.p2align 3
..L76:
addl $1, %edx
cmpb $0, (%ecx,%edx)
jne .L76
movl %edx, -108(%ebp)
movl -52(%ebp), %edx
..L25:
movl -44(%ebp), %ecx
addl -116(%ebp), %ecx
movl %ecx, -124(%ebp)
movl -108(%ebp), %ecx
addl -124(%ebp), %ecx
movl %ecx, -120(%ebp)
movl -44(%ebp), %ecx
addl -28(%ebp), %ecx
addl -64(%ebp), %ecx
movl %ecx, -80(%ebp)
movzbl (%ecx), %ecx
movl $0, -28(%ebp)
movl $0, -52(%ebp)
movl %edx, -128(%ebp)
movb %cl, -92(%ebp)
movl %ebx, -132(%ebp)
..L110:
cmpb $0, -92(%ebp)
je .L122
..L38:
movl -28(%ebp), %ebx
movzbl (%eax,%ebx), %edx
testb %dl, %dl
je .L123
movl -80(%ebp), %ecx
movl -52(%ebp), %ebx
addl -28(%ebp), %ecx
cmpb (%ecx,%ebx), %dl
je .L37
movl -80(%ebp), %edx
addl $1, %ebx
movl %ebx, -52(%ebp)
movzbl (%edx,%ebx), %edx
movl $0, -28(%ebp)
movb %dl, -92(%ebp)
cmpb $0, -92(%ebp)
jne .L38
..L122:
movl -52(%ebp), %eax
movl -120(%ebp), %ecx
movl -128(%ebp), %edx
movl -132(%ebp), %ebx
leal 1(%eax,%ecx), %ecx
movl %ecx, (%esp)
movl %edx, -136(%ebp)
call malloc
movl -136(%ebp), %edx
testl %eax, %eax
movl %eax, -28(%ebp)
je .L32
movl -52(%ebp), %ecx
addl -120(%ebp), %ecx
addl -28(%ebp), %ecx
movl -80(%ebp), %eax
addl -52(%ebp), %eax
movl %edx, -56(%ebp)
movl -52(%ebp), %edx
movl %esi, -52(%ebp)
movl %ebx, %esi
..L39:
movzbl (%eax), %ebx
subl $1, %edx
subl $1, %eax
movb %bl, (%ecx)
subl $1, %ecx
cmpl $-1, %edx
jne .L39
movl %esi, %ebx
movl -56(%ebp), %edx
movl -52(%ebp), %esi
..L35:
movl -108(%ebp), %ecx
testl %ecx, %ecx
je .L69
movl -108(%ebp), %ecx
movl -124(%ebp), %eax
movl %esi, -80(%ebp)
movl -108(%ebp), %esi
movl %edi, -92(%ebp)
leal -1(%ecx,%eax), %eax
movl -28(%ebp), %ecx
addl %eax, %ecx
movl -108(%ebp), %eax
movl %ecx, -52(%ebp)
movl 16(%ebp), %ecx
leal (%ecx,%eax), %eax
movl -52(%ebp), %ecx
movl %eax, -56(%ebp)
movl -56(%ebp), %edi
xorl %eax, %eax
movl %edx, -52(%ebp)
.p2align 4,,7
.p2align 3
..L40:
movzbl -1(%edi,%eax), %edx
subl $1, %eax
movb %dl, (%ecx)
leal (%eax,%esi), %edx
subl $1, %ecx
testl %edx, %edx
jne .L40
movl -52(%ebp), %edx
movl -80(%ebp), %esi
movl -92(%ebp), %edi
..L69:
movl -44(%ebp), %ecx
testl %ecx, %ecx
je .L41
movl -44(%ebp), %ecx
movl -116(%ebp), %eax
movl $0, -80(%ebp)
leal -1(%ecx,%eax), %eax
movl -28(%ebp), %ecx
addl %eax, %ecx
movl -44(%ebp), %eax
movl %ecx, -52(%ebp)
movl -64(%ebp), %ecx
movl %edx, -64(%ebp)
leal (%ecx,%eax), %eax
movl -52(%ebp), %ecx
movl %eax, -56(%ebp)
xorl %eax, %eax
movl %esi, -52(%ebp)
movl -44(%ebp), %esi
movl %edi, -44(%ebp)
movl -56(%ebp), %edi
.p2align 4,,7
.p2align 3
..L42:
movzbl -1(%edi,%eax), %edx
subl $1, %eax
movb %dl, (%ecx)
leal (%eax,%esi), %edx
subl $1, %ecx
testl %edx, %edx
jne .L42
movl -64(%ebp), %edx
movl -52(%ebp), %esi
movl -44(%ebp), %edi
jmp .L41
..L37:
addl $1, -28(%ebp)
jmp .L110
..L123:
cmpb $0, -56(%ebp)
movl -128(%ebp), %edx
movl -132(%ebp), %ebx
movl $0, -128(%ebp)
je .L30
movl %edx, -56(%ebp)
movl 16(%ebp), %edx
xorl %ecx, %ecx
.p2align 4,,7
.p2align 3
..L75:
addl $1, %ecx
cmpb $0, (%edx,%ecx)
jne .L75
movl -56(%ebp), %edx
movl %ecx, -128(%ebp)
..L30:
movl -52(%ebp), %ecx
addl -120(%ebp), %ecx
movl %ecx, -56(%ebp)
addl -128(%ebp), %ecx
movl %ecx, 12(%esp)
movl 16(%ebp), %ecx
movl %eax, 4(%esp)
movl %ecx, 8(%esp)
movl -52(%ebp), %eax
addl -28(%ebp), %eax
addl -80(%ebp), %eax
movl %eax, (%esp)
movl %edx, -136(%ebp)
call repstr
movl -136(%ebp), %edx
testl %eax, %eax
movl %eax, -28(%ebp)
je .L32
movl -128(%ebp), %ecx
testl %ecx, %ecx
je .L33
movl -128(%ebp), %ecx
movl -56(%ebp), %eax
movl %edi, -132(%ebp)
movl %edx, -128(%ebp)
movl %ebx, -156(%ebp)
leal -1(%ecx,%eax), %eax
addl -28(%ebp), %eax
movl %eax, -56(%ebp)
movl 16(%ebp), %eax
movl -56(%ebp), %edx
movl %esi, -56(%ebp)
addl %ecx, %eax
movl %eax, -92(%ebp)
movl -92(%ebp), %edi
xorl %eax, %eax
.p2align 4,,7
.p2align 3
..L34:
movzbl -1(%edi,%eax), %esi
subl $1, %eax
movl %esi, %ebx
leal (%eax,%ecx), %esi
movb %bl, (%edx)
subl $1, %edx
testl %esi, %esi
jne .L34
movl -128(%ebp), %edx
movl -56(%ebp), %esi
movl -132(%ebp), %edi
movl -156(%ebp), %ebx
..L33:
movl -52(%ebp), %eax
testl %eax, %eax
je .L35
movl -52(%ebp), %ecx
movl -120(%ebp), %eax
movl $0, -120(%ebp)
leal -1(%ecx,%eax), %eax
movl -28(%ebp), %ecx
addl %eax, %ecx
movl -52(%ebp), %eax
movl %ecx, -92(%ebp)
movl -80(%ebp), %ecx
movl %edx, -80(%ebp)
leal (%ecx,%eax), %eax
movl -92(%ebp), %ecx
movl %eax, -56(%ebp)
xorl %eax, %eax
movl %esi, -92(%ebp)
movl -52(%ebp), %esi
movl %edi, -52(%ebp)
movl -56(%ebp), %edi
.p2align 4,,7
.p2align 3
..L36:
movzbl -1(%edi,%eax), %edx
subl $1, %eax
movb %dl, (%ecx)
leal (%eax,%esi), %edx
subl $1, %ecx
testl %edx, %edx
jne .L36
movl -80(%ebp), %edx
movl -92(%ebp), %esi
movl -52(%ebp), %edi
jmp .L35
.size repstr, .-repstr
.ident "GCC: (GNU) 4.4.1 20090725 (Red Hat 4.4.1-2)"
.section .note.GNU-stack,"",@progbits

blmblm · Mar 6, 2010

That, I must admit, totally surprises me. I would have thought that the
cost of strstr() would be trivial compared to the cost of malloc(). I guess
not for some input data!

I guess not .... And it does vary a bit. I'll put full results of
my tests below, in case you're curious. One thing that surprises
me a bit is that apparently exactly which input produces the
most-different results is different for the two systems I tried
my benchmarks on.

Which suggests that, if this would be run often enough, by enough people, who
would be waiting on the output, it is conceivable that it could be worth
spending the extra 8-10 hours of programming effort, plus the lifetime
maintenance effort, for the more complicated code.

Or, alternatively, that it would at least make sense to consider one of the
"don't rescan" options.

Yeah. I think an improvement of, what, 10%? is unlikely to make much
difference in most applications. But I think possibly the point was
not so much to come up with a faster algorithm/implementation as to
come up with one that was somehow more interesting. <shrug>

======== benchmark results on older/slower system ========

==== nilges ====

performing timing tests (4 repeats)

timing (length 4004, changing 2 occurrences of 2 chars to 2 chars) 20000 times
times (seconds): 0.36 0.36 0.36 0.36

timing (length 4020, changing 10 occurrences of 2 chars to 2 chars) 20000 times
times (seconds): 0.39 0.39 0.39 0.40

timing (length 4100, changing 50 occurrences of 2 chars to 2 chars) 20000 times
times (seconds): 0.54 0.54 0.55 0.54

timing (length 4500, changing 50 occurrences of 10 chars to 10 chars) 20000 times
times (seconds): 0.62 0.63 0.63 0.62

16 tests (counting repeats)
0 errors (counting repeats)
total time = 7.69 seconds

==== seebach ====

performing timing tests (4 repeats)

timing (length 4004, changing 2 occurrences of 2 chars to 2 chars) 20000 times
times (seconds): 0.46 0.47 0.46 0.46

timing (length 4020, changing 10 occurrences of 2 chars to 2 chars) 20000 times
times (seconds): 0.49 0.49 0.49 0.49

timing (length 4100, changing 50 occurrences of 2 chars to 2 chars) 20000 times
times (seconds): 0.59 0.59 0.59 0.59

timing (length 4500, changing 50 occurrences of 10 chars to 10 chars) 20000 times
times (seconds): 0.65 0.65 0.65 0.65

16 tests (counting repeats)
0 errors (counting repeats)
total time = 8.76 seconds

======== benchmark results on newer/faster system ========

==== nilges ====

performing timing tests (4 repeats)

timing (length 4004, changing 2 occurrences of 2 chars to 2 chars) 20000 times
times (seconds): 0.10 0.10 0.11 0.11

timing (length 4020, changing 10 occurrences of 2 chars to 2 chars) 20000 times
times (seconds): 0.12 0.11 0.12 0.11

timing (length 4100, changing 50 occurrences of 2 chars to 2 chars) 20000 times
times (seconds): 0.18 0.18 0.18 0.18

timing (length 4500, changing 50 occurrences of 10 chars to 10 chars) 20000 times
times (seconds): 0.19 0.19 0.19 0.19

16 tests (counting repeats)
0 errors (counting repeats)
total time = 2.35 seconds

==== seebach ====

performing timing tests (4 repeats)

timing (length 4004, changing 2 occurrences of 2 chars to 2 chars) 20000 times
times (seconds): 0.10 0.10 0.10 0.10

timing (length 4020, changing 10 occurrences of 2 chars to 2 chars) 20000 times
times (seconds): 0.12 0.12 0.12 0.12

timing (length 4100, changing 50 occurrences of 2 chars to 2 chars) 20000 times
times (seconds): 0.22 0.22 0.22 0.22

timing (length 4500, changing 50 occurrences of 10 chars to 10 chars) 20000 times
times (seconds): 0.34 0.34 0.34 0.34

16 tests (counting repeats)
0 errors (counting repeats)
total time = 3.14 seconds

blmblm · Mar 6, 2010

If I may intrude,

Why not? this is kind of interesting subthread, and while not
entirely topical is perhaps less off-topic than some ....

there are more possibilities.

(4) Replace every occurence of the substring by the replacement
string. There are two occurrences of ana in banana; replace each
with xno yielding bxnoxno.

But how is that a replacement? you replace two occurrences of a
three-letter string with a three-letter string, which *I* would think
should yield something with the same length as the original, but
length(bxnoxno) > length(banana). ?

(5) There can only be overlap if the original substring has the
pattern PQP. The overlap sequence will have the pattern P(QP)*.
In your example, the pattern is PQPQP. Replace the entire
overlap pattern by the replacement, yielding bxno.

Good observation (about the pattern), but again, you've changed the
length -- ?

(6) If the replacement string has the pattern R(SR)* and the
repetition count is the same replace the initial P with R and
each PQ with SR. This does not apply to your example; rather it
is a special case that does apply to replacement ono.

This is the only case in which attempting to do something with
overlapping strings makes sense to me. YMMV, maybe!

Fibonacci	0	May 13, 2023
Blue J Ciphertext Program	2	Nov 22, 2023
Adding adressing of IPv6 to program	1	Feb 16, 2023
Need help finding Segmentation fault C++	0	Apr 16, 2022
Python code problem	2	Apr 23, 2023
Scanf is being prioritized over printf ?	1	Nov 5, 2023
My Status, Ciphertext	2	Nov 28, 2023
Finding a string inside string	3	Nov 23, 2010

substring finding problem!

Nick Keighley

Chris M. Thomasson

Ben Bacarisse

Seebs

Seebs

Julienne Walker

spinoza1111

spinoza1111

spinoza1111

Nick Keighley

Julienne Walker

Richard Bos

Chris M. Thomasson

Ike Naar

spinoza1111

blmblm

blmblm

blmblm

blmblm

blmblm

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads