Iterating through a string and removing leading characters

Randy Kramer · Mar 18, 2005

This is going to seem a little strange (for a number of reasons I might
mention below), but I would like to iterate through a string from beginning
to end, examining each character, then discarding (or moving) them
(elsewhere).

This is so that I can, upon finding certain character(s), run one of several
REs on the next part of the string. I'm looking to find a very efficient way
to do this.

Here are the approaches I'm considering so far. Any other suggestions?

* ?? (Is there any method which chops the first character from the string?
How efficient is that (especially compared to the method described in the
next bullet).
* Reverse the string, then use something like s[length (-1)] to examine
then chop to discard the last character. The problem with this approach is
that I then have to reverse the string again to have the REs work. (I
(briefly) thought about just setting up the REs in reverse, but I foresee a
lot of difficulty there--scanning through the string in reverse (from last
character to first, negates the advantage I was hoping to gain, that of
starting the RE match only from positions where the RE could possibly match.)

Barring anything better, the approach I may take is to iterate through the
string and, when I find a potential RE match, use s[n,len] to return a
partial string to be checked against the RE.

Aside: I'm trying to make a very efficient parser in Ruby for a wiki-like
thing I want to build, and am trying to avoid the approach of simply making
multiple scans through the document for each of a fairly large number of REs.

I might be guilty of premature optimization, but I prefer to think of it as
doing some proof of concept testing before committing to a design.

I have done some tests that show a 1 to 10% savings in time by taking a
similar approach for REs that could only match at the beginning of a string.
(At some point I'll "publish" those results on WikiLearn (or the next
incarnation thereof).) The next REs are considerably more complex as they
can match anywhere in the string--if the savings from the same approach for
them is only 1 to 10%, the complexity will not be worth it. If by some
chance it exceeds say 50%, I will seriously consider that complexity.

Randy Kramer

ES · Mar 18, 2005

In data 3/18/2005 said:
This is going to seem a little strange (for a number of reasons I might
mention below), but I would like to iterate through a string from beginning
to end, examining each character, then discarding (or moving) them
(elsewhere).

This is so that I can, upon finding certain character(s), run one of several
REs on the next part of the string. I'm looking to find a very efficient way
to do this.

Here are the approaches I'm considering so far. Any other suggestions?

* ?? (Is there any method which chops the first character from the string?
How efficient is that (especially compared to the method described in the
next bullet).
* Reverse the string, then use something like s[length (-1)] to examine
then chop to discard the last character. The problem with this approach is
that I then have to reverse the string again to have the REs work. (I
(briefly) thought about just setting up the REs in reverse, but I foresee a
lot of difficulty there--scanning through the string in reverse (from last
character to first, negates the advantage I was hoping to gain, that of
starting the RE match only from positions where the RE could possibly match.)

Barring anything better, the approach I may take is to iterate through the
string and, when I find a potential RE match, use s[n,len] to return a
partial string to be checked against the RE.

Aside: I'm trying to make a very efficient parser in Ruby for a wiki-like
thing I want to build, and am trying to avoid the approach of simply making
multiple scans through the document for each of a fairly large number of REs.

I might be guilty of premature optimization, but I prefer to think of it as
doing some proof of concept testing before committing to a design.

I have done some tests that show a 1 to 10% savings in time by taking a
similar approach for REs that could only match at the beginning of a string.
(At some point I'll "publish" those results on WikiLearn (or the next
incarnation thereof).) The next REs are considerably more complex as they
can match anywhere in the string--if the savings from the same approach for
them is only 1 to 10%, the complexity will not be worth it. If by some
chance it exceeds say 50%, I will seriously consider that complexity.

I implemented a semi-stateful wiki parser in a somewhat similar manner
using StringScanner. It's fairly fast.

Randy Kramer

E

Florian Gross · Mar 18, 2005

Randy said:
* ?? (Is there any method which chops the first character from the string?
How efficient is that (especially compared to the method described in the
next bullet).

str.slice!(0)

I think this is actually fairly efficient, but I suppose there's a
memmove involved at the C level.

Perhaps you could set the characters you don't need to '\0' and remove
all of them after the iteration has ended with a Regexp?

Robert Klemme · Mar 19, 2005

Christian Neukirchen said:
Better use String#delete! for that, no?

For removing \0 chars? Yepp, that seems like an efficient way to do it.
For direct removing slice! and String#[]= are most efficient:
?> end
user system total real
=> true?> r.report("slice!") do
?> 10000.times { "aaaaaaaaaaaaaa".slice!(0,5) }

end
r.report("[]=") do ?> 10000.times { "aaaaaaaaaaaaaa"[0,5]="" }
end
end

Click to expand...

user system total real
slice! 0.140000 0.000000 0.140000 ( 0.130000)
[]= 0.063000 0.000000 0.063000 ( 0.065000)
=> true
Kind regards

robert

Robert Klemme · Mar 19, 2005

Randy Kramer said:
This is going to seem a little strange (for a number of reasons I might
mention below), but I would like to iterate through a string from
beginning
to end, examining each character, then discarding (or moving) them
(elsewhere).

This is so that I can, upon finding certain character(s), run one of
several
REs on the next part of the string. I'm looking to find a very efficient
way
to do this.

Here are the approaches I'm considering so far. Any other suggestions?

* ?? (Is there any method which chops the first character from the
string?
How efficient is that (especially compared to the method described in the
next bullet).
* Reverse the string, then use something like s[length (-1)] to examine
then chop to discard the last character. The problem with this approach
is
that I then have to reverse the string again to have the REs work. (I
(briefly) thought about just setting up the REs in reverse, but I foresee
a
lot of difficulty there--scanning through the string in reverse (from last
character to first, negates the advantage I was hoping to gain, that of
starting the RE match only from positions where the RE could possibly
match.)

Barring anything better, the approach I may take is to iterate through the
string and, when I find a potential RE match, use s[n,len] to return a
partial string to be checked against the RE.

I'd bet that this approach is slower than a pure regexp based approach. If
you cannot stuck all exact regexps into one (see below) then maybe some form
of stripped regexps might help. For example:

rx1 = /ab+/
rx2 = /cd+/

rx_all = /(ab+)|(cd+)/

rx_stripped = /[ab](\w+)/
# then, use these on the second part
rx_stripped_1 = /^b+/
rx_stripped_2 = /^d+/

This is just a simple example for demonstration. For these simple regexps
rx_all is the most efficient one I'm sure.

Aside: I'm trying to make a very efficient parser in Ruby for a wiki-like
thing I want to build, and am trying to avoid the approach of simply
making
multiple scans through the document for each of a fairly large number of
REs.

What does "fairly large" mean? I would try to start with stucking *all*
these regexps into one - if the rx engine does not choke on that regexp I'd
assume that this is the most efficient way to do it, as then you have the
best ratio of machine code to ruby interpretation. Maybe you just show us
all these regexps so we can better understand the problem.

I might be guilty of premature optimization, but I prefer to think of it
as
doing some proof of concept testing before committing to a design.

I have done some tests that show a 1 to 10% savings in time by taking a
similar approach for REs that could only match at the beginning of a
string.
(At some point I'll "publish" those results on WikiLearn (or the next
incarnation thereof).) The next REs are considerably more complex as they
can match anywhere in the string--if the savings from the same approach
for
them is only 1 to 10%, the complexity will not be worth it. If by some
chance it exceeds say 50%, I will seriously consider that complexity.

Now I'm getting really curios. Care to post some more details?

Kind regards

robert

Randy Kramer · Mar 19, 2005

Thanks to all who replied so far. I also want to look into the StringScanner
approach (I'll reply separately with some questions about that), and I can't
believe I couldn't find the ways to delete the first character of a string.
Guess I am a newbie!

I'd bet that this approach is slower than a pure regexp based approach.

So far, you're very right--my approach took about 30 times as long as the pure
regexp approach, although my Ruby code might not be very efficient. (In case
nobody noticed, I'm very much a newbie to Ruby.)

If
you cannot stuck all exact regexps into one (see below) then maybe some
form of stripped regexps might help. For example:

This sounds like its worth a try, but:
1) I haven't created all the necessary REs yet
2) Question below (for clarification)

rx1 = /ab+/
rx2 = /cd+/

rx_all = /(ab+)|(cd+)/

rx_stripped = /[ab](\w+)/

Question: IIUC, the [ab] above should be [ac]?

# then, use these on the second part
rx_stripped_1 = /^b+/
rx_stripped_2 = /^d+/

This is just a simple example for demonstration. For these simple regexps
rx_all is the most efficient one I'm sure.

What does "fairly large" mean? I would try to start with stucking *all*
these regexps into one - if the rx engine does not choke on that regexp I'd
assume that this is the most efficient way to do it, as then you have the
best ratio of machine code to ruby interpretation. Maybe you just show us
all these regexps so we can better understand the problem.

It's hard even to guess, I intended to combine several REs into one anyway
when they had a lot of commonality. For example, the TWiki markup for
headings (which I'm planning to use) is like this:

---* Level 1
---** Level 2
---*** Level 3
---**** Level 4
---***** Level 5
---****** Level 6

I've planned to use one RE for all the above, then determine the level from
the length of the match (like level = len - 3).

Likewise, "inline" markup is *for bold*, _for italic,_ __for bold italic__,
and so forth. I'd try to have one RE looking for words preceded by _, *, or
__, and another with words ending with the same. (And might combine words
marked with % for %TWikiVariables% as well.

With "optimizations" like this, I'd guess on the order of 15 or so regexps.

Now I'm getting really curios. Care to post some more details?

I presume you mean on the 1 to 10% savings? I planned to do that, I'll try to
put something on WikiLearn this weekend then post something here.

Randy Kramer

Randy Kramer · Mar 19, 2005

I implemented a semi-stateful wiki parser in a somewhat similar manner
using StringScanner. It's fairly fast.

StringScanner looks interesting, and I'm starting to dig into it. Care to
provide any more information? Do you have a working wiki?

Randy Kramer

ES · Mar 19, 2005

In data 3/19/2005 said:
StringScanner looks interesting, and I'm starting to dig into it. Care to
provide any more information? Do you have a working wiki?

I do. I was waiting to complete the actual application it is
to feature in before releasing it but if you want to see the
parsing code, I'll do a round or two of refactoring and post
it on my website, work permitting.

Randy Kramer

E

Randy Kramer · Mar 19, 2005

str.slice!(0)

Thanks! (Can't believe I didn't recognize that approach.)

I think this is actually fairly efficient, but I suppose there's a
memmove involved at the C level.

Perhaps you could set the characters you don't need to '\0' and remove
all of them after the iteration has ended with a Regexp?

Sounds like a possibility, although my first set of tests show the REs to be
much faster.

regards,
Randy Kramer

Randy Kramer · Mar 19, 2005

For direct removing slice! and String#[]= are most efficient:

What is the significance to the # in String#delete! and String#[]= above? (I
can't find anything with the # in it like that in "Ruby In a Nutshell"--is it
something new, or just some shorthand/jargon? (Or did I look in the wrong
places ;-)

Ahh, that's useful, I was using Time::now to time my functions.

Randy Kramer

Randy Kramer · Mar 19, 2005

if you want to see the
parsing code, I'll do a round or two of refactoring and post
it on my website, work permitting.

It would be interesting to see, if it's not too much trouble.

regards,
Randy Kramer

Timothy Hunter · Mar 19, 2005

Randy said:
What is the significance to the # in String#delete! and String#[]= above? (I
can't find anything with the # in it like that in "Ruby In a Nutshell"--is it
something new, or just some shorthand/jargon? (Or did I look in the wrong
places ;-)

It is simply a notational convention for identifying instance methods.
Since it is not real Ruby syntax it can be misleading, but its usage is
so widespread it's probably here to stay.

Randy Kramer · Mar 20, 2005

Randy said:
Randy said:

What is the significance to the # in String#delete! and String#[]= above?
(I can't find anything with the # in it like that in "Ruby In a
Nutshell"--is it something new, or just some shorthand/jargon? (Or did I
look in the wrong places ;-)

Click to expand...

It is simply a notational convention for identifying instance methods.
Since it is not real Ruby syntax it can be misleading, but its usage is
so widespread it's probably here to stay.

Thanks!

Robert Klemme · Mar 20, 2005

Randy Kramer said:
Thanks to all who replied so far. I also want to look into the
StringScanner
approach (I'll reply separately with some questions about that), and I
can't
believe I couldn't find the ways to delete the first character of a
string.
Guess I am a newbie!

I'd bet that this approach is slower than a pure regexp based approach.

Click to expand...

So far, you're very right--my approach took about 30 times as long as the
pure
regexp approach, although my Ruby code might not be very efficient. (In
case
nobody noticed, I'm very much a newbie to Ruby.)

If
you cannot stuck all exact regexps into one (see below) then maybe some
form of stripped regexps might help. For example:

Click to expand...

This sounds like its worth a try, but:
1) I haven't created all the necessary REs yet
2) Question below (for clarification)

rx1 = /ab+/
rx2 = /cd+/

rx_all = /(ab+)|(cd+)/

rx_stripped = /[ab](\w+)/

Click to expand...

Question: IIUC, the [ab] above should be [ac]?

Exactly. My mistake.

It's hard even to guess, I intended to combine several REs into one anyway
when they had a lot of commonality. For example, the TWiki markup for
headings (which I'm planning to use) is like this:

---* Level 1
---** Level 2
---*** Level 3
---**** Level 4
---***** Level 5
---****** Level 6

I've planned to use one RE for all the above, then determine the level
from
the length of the match (like level = len - 3).

Sounds very reasonable.

Likewise, "inline" markup is *for bold*, _for italic,_ __for bold
italic__,
and so forth. I'd try to have one RE looking for words preceded by _, *,
or
__, and another with words ending with the same. (And might combine words
marked with % for %TWikiVariables% as well.

Seeing this lets me think of another approach: split the string into tokens
with a simple regexp and do the rest of the calculation on the tokens.
Drawback is of course if the input is large this will not perform very well.
Hm...

With "optimizations" like this, I'd guess on the order of 15 or so
regexps.

I presume you mean on the 1 to 10% savings?

Yeah, and on the rx's as well (you did show some of theme alread).

I planned to do that, I'll try to
put something on WikiLearn this weekend then post something here.

Some experiments:
=> "** caption \nfoo _italic_ asdasd __bold__ var=%var%"

str.scan /(^\s*\*+)|(([*_])\3*)(.*?)\2|(%\w+%)/m do |m| p m end

Click to expand...

["**", nil, nil, nil, nil]
[nil, "_", "_", "italic", nil]
[nil, "__", "_", "bold", nil]
[nil, nil, nil, nil, "%var%"]
=> "** caption \nfoo _italic_ asdasd __bold__ var=%var%"

str.scan /(^\s*\*+)|(([*_])\3*)(.*?)\2|%(\w+)%/m do |m| p m end

Click to expand...

["**", nil, nil, nil, nil]
[nil, "_", "_", "italic", nil]
[nil, "__", "_", "bold", nil]
[nil, nil, nil, nil, "var"]
=> "** caption \nfoo _italic_ asdasd __bold__ var=%var%"

Kind regards

robert

Randy Kramer · Mar 20, 2005

See http://twiki.org/cgi-bin/view/Wikilearn/RWP_RE_Tests -- it's rather ugly
but the information is there.

Randy Kramer

Robert Klemme · Mar 21, 2005

Randy Kramer said:
See http://twiki.org/cgi-bin/view/Wikilearn/RWP_RE_Tests -- it's rather ugly
but the information is there.

Randy Kramer

Some remarks:

- The comparison between 5 and 6 does not seem fair, as you iterate in 6
but not in 5.

- You don't use String#scan or #split which you are likely to need in
practice, because you want to sift through complete documents and want to
treat all occurrences.

- The differences between the check-first-char approach and the pure RE
approach are so insignificant that I'd not bother using the more complex
code. I'd stick with pure RE based approaches and only try to optimize if
performance is bad. (You mentioned premature optimization already...

)

Kind regards

robert

Randy Kramer · Mar 21, 2005

Some remarks:

- The comparison between 5 and 6 does not seem fair, as you iterate in 6
but not in 5.

Oops, for a minute I thought I had really screwed up (like by not doing 5 the
10000 times). AFAIK, I don't need to iterate (through the string) in 5 as
the RE is not anchored to the beginning of the string--it still checks the
entire string (line) for that pattern, which is what I try to achieve in 6 by
the iteration through the string and check a character then invoke RE
approach. I am sure that my Ruby code to do that is not the best, and I may
learn something by making it better, but I agree with your conclusion /
recommendation (below) at least for the time being (although I do plan to
play with str::scan and StringScanner at least a little bit (I presume they
do similar things, but perhaps the 2nd is optimized somehow, particularly if
I "require C" or whatever)).

- You don't use String#scan or #split which you are likely to need in
practice, because you want to sift through complete documents and want to
treat all occurrences.

I need to let that sink in a bit. In general I do want to treat all
occurrences, but I plan to scan a line (actually a paragraph) at a time, and
some things can only occur at the beginning of a line, so those would only be
checked at the beginning of a line.

- The differences between the check-first-char approach and the pure RE
approach are so insignificant that I'd not bother using the more complex
code. I'd stick with pure RE based approaches and only try to optimize if
performance is bad. (You mentioned premature optimization already... )

Thanks! I pretty much agree at this time, although I want to play a little
bit with StringScanner.

regards,
Randy Kramer

Robert Klemme · Mar 21, 2005

Randy Kramer said:
Oops, for a minute I thought I had really screwed up (like by not doing 5 the
10000 times). AFAIK, I don't need to iterate (through the string) in 5 as
the RE is not anchored to the beginning of the string--it still checks the
entire string (line) for that pattern, which is what I try to achieve in 6 by
the iteration through the string and check a character then invoke RE
approach. I am sure that my Ruby code to do that is not the best, and I may
learn something by making it better, but I agree with your conclusion /
recommendation (below) at least for the time being (although I do plan to
play with str::scan and StringScanner at least a little bit (I presume they
do similar things, but perhaps the 2nd is optimized somehow, particularly if
I "require C" or whatever)).

A particular performance show stopper in test 6 is String#[] i.e. you
create a new String object for each test; object creation is comparatively
expensive even though Strings share their internal buffer. But the GC has
to be informed etc. and this is quite some overhead. If you want fast
code, create as few instances as possible. The same holds for Java in 99%
of all cases.

Another general remark: it should be faster to use a range, Fixnum#upto or
Fixnum#times for iterating because then you have iteration in C and you
don't need to recalculate the limit on each iteration:

# old
i = 0
until i==s1.length-6 do
if s1 == 91
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
i += 1
end

# with range
(0...(s1.length-6)).each do |i|
if s1 == ?[
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
end

# with upto
0.upto(s1.length-7) do |i|
if s1 == ?[
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
end

# with times
(s1.length-6).times do |i|
if s1 == ?[
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
end

And you can use "?[" instead of "91" which is far less readable.

I need to let that sink in a bit. In general I do want to treat all
occurrences, but I plan to scan a line (actually a paragraph) at a time, and
some things can only occur at the beginning of a line, so those would only be
checked at the beginning of a line.

Click to expand...

Hm, if you know that the size of files is limited (i.e. something like
just a few KB) then it's usually worth slurping in the whole file with
something like this

contents = File.open(f){|io| io.read}

and then iterate through the whole thing with #scan. You can still use ^
to anchor at line beginnings.

# get the initial sequen until the first non whitespace
# just an example
contents.scan /^\s+\S/ do |m|
p m[0]
end

)

Thanks! I pretty much agree at this time, although I want to play a little
bit with StringScanner.

Click to expand...

Of course, new toys have to be played with!

Kind regards

robert

Randy Kramer · Mar 22, 2005

Robert,

I want to thank you for all your help, it's like having a personal tutor!

Some feedback / observations below that don't really require any response.

After some more testing, your remark seems more on target than I originally
thought--I can account for almost all the 30x increase in required time (for
6) by the additional iterations (repeated invocations of the RE engine).
It's like the invocation is the expensive part, and whether it looks for a
pattern at one point or scans the remainder of the(se short) strings is
negligible. (See the results of tests 6d and 6e below.)

A particular performance show stopper in test 6 is String#[] i.e. you
create a new String object for each test; object creation is comparatively
expensive even though Strings share their internal buffer. But the GC has
to be informed etc. and this is quite some overhead. If you want fast
code, create as few instances as possible. The same holds for Java in 99%
of all cases.

I'm surprised that Ruby creates a new String object for each test--I would
have hoped/thought that it was simply letting me "peek" at a portion of the
existing string (especially since it's only a test). I presume that the
StringScanner behaves more sanely in that respect, but I guess I'll find
out. ;-)

Thanks for all of the following! I did substitute them in test 6 to see what
they would do.

# old (6): ~18 seconds
# with range (6a): didn't work, see below
# with upto (6b): ~14 seconds
# with times (6c): ~12 seconds
#6d: ~0.75 seconds (This is the test that convinced me the iterations are the
problem, I revised the (with times) program to only call the RE once,
although it still scans only from the start of the string--I guess I should
try test 6e with the RE not anchored.)
#6e: ~0.6 seconds (Same as 6d, except I removed the \A anchor--and now I'm
puzzled, how is this faster than the anchored version?? Anyway, at this time
I don't care, I'll just "file it away" as a little anomaly to perhaps
understand some day (and, as I haven't run the test multiple times or similar
in an attempt to discount garbage collection, maybe that is the problem.)

I did create new test programs (6a, 6b, 6c) but I haven't uploaded them to the
TWiki--if you are really interested I can do that, but, as I say below, I'm
not going to lose sleep over the problem with range.

For some reason that I haven't figured out (yet?), the "with range" option
didn't work. I'm not going to lose sleep over it--I did try some
troubleshooting, but it may be a rather subtle bug (or I have a very dense
head).

When I run it as part of a program (re_test_6a.rb), I get the following error
messages:

bash-2.05b$ re_test6a.rb
/re_test6a.rb:40: Invalid char `\240' in expression
/re_test6a.rb:41: Invalid char `\240' in expression
...
/re_test6a.rb:66: Invalid char `\240' in expression
bash-2.05b$

When I simply copy the "individual loop" part of the code (i.e., the portion
you show below under # with range) into IRB and running it (after defining
the appropriate strings), I get (and get kicked out of IRB) BTW, this is the
result of attempting to paste the five lines into IRB as a group:

irb(main):021:0> (0...(s1.length-6)).each do |i|
irb(main):022:1* if s1 == ?[
SyntaxError: compile error
(irb):21: syntax error
from (irb):21
from (null):0
bash-2.05b$

As I try to troubleshoot (by removing pieces from the loop), everything seems
to work OK (and I'm learning what some of those pieces do ;-)

Anyway, since I went this far, I have uploaded programs 6a thru 6e to the
TWiki, but I am not requesting / suggesting that anyone try to spend time
debugging 6a.

http://twiki.org/cgi-bin/view/Wikilearn/RWP_RE_Tests?

# old
i = 0
until i==s1.length-6 do
if s1 == 91
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
i += 1
end

# with range
(0...(s1.length-6)).each do |i|
if s1 == ?[
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
end

# with upto
0.upto(s1.length-7) do |i|
if s1 == ?[
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
end

# with times
(s1.length-6).times do |i|
if s1 == ?[
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
end

Click to expand...

For anyone following along ;-) my next efforts are going to be focused on
StringScanner and then making the necessary substitutions. In parallel I
will probably try to refine the REs.

The remainder of this looks useful as well!

regards,
Randy Kramer

Hm, if you know that the size of files is limited (i.e. something like
just a few KB) then it's usually worth slurping in the whole file with
something like this

contents = File.open(f){|io| io.read}

and then iterate through the whole thing with #scan. You can still use ^
to anchor at line beginnings.

# get the initial sequen until the first non whitespace
# just an example
contents.scan /^\s+\S/ do |m|
p m[0]
end

Click to expand...

Robert Klemme · Mar 23, 2005

Randy Kramer said:
Robert,

I want to thank you for all your help, it's like having a personal

tutor!

You're welcome! I'm glad I could help by sharing my experiences.

Some feedback / observations below that don't really require any

response.

Well, some comments below nevertheless...

After some more testing, your remark seems more on target than I originally
thought--I can account for almost all the 30x increase in required time (for
6) by the additional iterations (repeated invocations of the RE engine).
It's like the invocation is the expensive part, and whether it looks for a
pattern at one point or scans the remainder of the(se short) strings is
negligible. (See the results of tests 6d and 6e below.)

Well, that clearly shows that simple scanning with a RE is superior to
iterating and then scanning.

A particular performance show stopper in test 6 is String#[] i.e. you
create a new String object for each test; object creation is comparatively
expensive even though Strings share their internal buffer. But the GC has
to be informed etc. and this is quite some overhead. If you want fast
code, create as few instances as possible. The same holds for Java in 99%
of all cases.

Click to expand...

I'm surprised that Ruby creates a new String object for each test--I would
have hoped/thought that it was simply letting me "peek" at a portion of the
existing string (especially since it's only a test).

The internal buffer (the characters) is shared but there is a new Ruby
instance each time you invoke String#[]:

10.times { puts s1[2,4].id }

Click to expand...

134979736
134979676
134979652
134979592
134979496
134979472
134979436
134979364
134979268
134979196
=> 10

I presume that the
StringScanner behaves more sanely in that respect, but I guess I'll find
out. ;-)

Never used that myself but it's sure worth a try.

Thanks for all of the following! I did substitute them in test 6 to see what
they would do.

# old (6): ~18 seconds
# with range (6a): didn't work, see below
# with upto (6b): ~14 seconds
# with times (6c): ~12 seconds
#6d: ~0.75 seconds (This is the test that convinced me the iterations are the
problem, I revised the (with times) program to only call the RE once,
although it still scans only from the start of the string--I guess I should
try test 6e with the RE not anchored.)
#6e: ~0.6 seconds (Same as 6d, except I removed the \A anchor--and now I'm
puzzled, how is this faster than the anchored version?? Anyway, at this time
I don't care, I'll just "file it away" as a little anomaly to perhaps
understand some day (and, as I haven't run the test multiple times or similar
in an attempt to discount garbage collection, maybe that is the problem.)

I did create new test programs (6a, 6b, 6c) but I haven't uploaded them to the
TWiki--if you are really interested I can do that, but, as I say below, I'm
not going to lose sleep over the problem with range.

For some reason that I haven't figured out (yet?), the "with range" option
didn't work. I'm not going to lose sleep over it--I did try some
troubleshooting, but it may be a rather subtle bug (or I have a very dense
head).

When I run it as part of a program (re_test_6a.rb), I get the following error
messages:

bash-2.05b$ re_test6a.rb
/re_test6a.rb:40: Invalid char `\240' in expression
/re_test6a.rb:41: Invalid char `\240' in expression
..
/re_test6a.rb:66: Invalid char `\240' in expression
bash-2.05b$

See comment below.

When I simply copy the "individual loop" part of the code (i.e., the portion
you show below under # with range) into IRB and running it (after defining
the appropriate strings), I get (and get kicked out of IRB) BTW, this is the
result of attempting to paste the five lines into IRB as a group:

irb(main):021:0> (0...(s1.length-6)).each do |i|
irb(main):022:1* if s1 == ?[
SyntaxError: compile error
(irb):21: syntax error
from (irb):21
from (null):0
bash-2.05b$

s1="a"*10 => "aaaaaaaaaa"
(0...(s1.length-6)).each do |i| ?> if s1 == ?[
puts "yes"
end
end

Click to expand...

=> 0...4

I guess this and the other syntax error above are caused by copying and
pasting some characters outside the ASCII range. I have experienced
similar errors in the past. Sometimes they look like whitespace
characters so you don't recognize them on first sight.

As I try to troubleshoot (by removing pieces from the loop), everything seems
to work OK (and I'm learning what some of those pieces do ;-)

Anyway, since I went this far, I have uploaded programs 6a thru 6e to the
TWiki, but I am not requesting / suggesting that anyone try to spend time
debugging 6a.

http://twiki.org/cgi-bin/view/Wikilearn/RWP_RE_Tests?

# old
i = 0
until i==s1.length-6 do
if s1 == 91
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
i += 1
end

# with range
(0...(s1.length-6)).each do |i|
if s1 == ?[
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
end

# with upto
0.upto(s1.length-7) do |i|
if s1 == ?[
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
end

# with times
(s1.length-6).times do |i|
if s1 == ?[
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
end

Click to expand...

For anyone following along ;-) my next efforts are going to be focused on
StringScanner and then making the necessary substitutions. In parallel I
will probably try to refine the REs.

Click to expand...

Please let me/us know how that works out.

The remainder of this looks useful as well!

regards,
Randy Kramer

Hm, if you know that the size of files is limited (i.e. something like
just a few KB) then it's usually worth slurping in the whole file with
something like this

contents = File.open(f){|io| io.read}

and then iterate through the whole thing with #scan. You can still use ^
to anchor at line beginnings.

# get the initial sequen until the first non whitespace
# just an example
contents.scan /^\s+\S/ do |m|
p m[0]
end

Click to expand...

Click to expand...

Kind regards

robert

Iterating through a hash	12	Aug 20, 2008
Iterating through a string	9	Sep 21, 2008
Iterating through a database result	3	Oct 23, 2008
C code String And Comparison	2	Dec 27, 2022
Removing lines containing same first string boundaries?	31	Mar 17, 2014
Converting an Array to a String in JavaScript	7	Sep 22, 2023
iterating through the registry	3	Jan 17, 2008
Iterating through class names using a block	16	Nov 24, 2007

Iterating through a string and removing leading characters

Randy Kramer

ES

Florian Gross

Robert Klemme

Robert Klemme

Randy Kramer

Randy Kramer

ES

Randy Kramer

Randy Kramer

Randy Kramer

Timothy Hunter

Randy Kramer

Robert Klemme

Randy Kramer

Robert Klemme

Randy Kramer

Robert Klemme

Randy Kramer

Robert Klemme

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads