Iterating through a string and removing leading characters

R

Randy Kramer

This is going to seem a little strange (for a number of reasons I might
mention below), but I would like to iterate through a string from beginning
to end, examining each character, then discarding (or moving) them
(elsewhere).

This is so that I can, upon finding certain character(s), run one of several
REs on the next part of the string. I'm looking to find a very efficient way
to do this.

Here are the approaches I'm considering so far. Any other suggestions?

* ?? (Is there any method which chops the first character from the string?
How efficient is that (especially compared to the method described in the
next bullet).
* Reverse the string, then use something like s[length (-1)] to examine
then chop to discard the last character. The problem with this approach is
that I then have to reverse the string again to have the REs work. (I
(briefly) thought about just setting up the REs in reverse, but I foresee a
lot of difficulty there--scanning through the string in reverse (from last
character to first, negates the advantage I was hoping to gain, that of
starting the RE match only from positions where the RE could possibly match.)

Barring anything better, the approach I may take is to iterate through the
string and, when I find a potential RE match, use s[n,len] to return a
partial string to be checked against the RE.

Aside: I'm trying to make a very efficient parser in Ruby for a wiki-like
thing I want to build, and am trying to avoid the approach of simply making
multiple scans through the document for each of a fairly large number of REs.

I might be guilty of premature optimization, but I prefer to think of it as
doing some proof of concept testing before committing to a design.

I have done some tests that show a 1 to 10% savings in time by taking a
similar approach for REs that could only match at the beginning of a string.
(At some point I'll "publish" those results on WikiLearn (or the next
incarnation thereof).) The next REs are considerably more complex as they
can match anywhere in the string--if the savings from the same approach for
them is only 1 to 10%, the complexity will not be worth it. If by some
chance it exceeds say 50%, I will seriously consider that complexity.

Randy Kramer
 
E

ES

In data 3/18/2005 said:
This is going to seem a little strange (for a number of reasons I might
mention below), but I would like to iterate through a string from beginning
to end, examining each character, then discarding (or moving) them
(elsewhere).

This is so that I can, upon finding certain character(s), run one of several
REs on the next part of the string. I'm looking to find a very efficient way
to do this.

Here are the approaches I'm considering so far. Any other suggestions?

* ?? (Is there any method which chops the first character from the string?
How efficient is that (especially compared to the method described in the
next bullet).
* Reverse the string, then use something like s[length (-1)] to examine
then chop to discard the last character. The problem with this approach is
that I then have to reverse the string again to have the REs work. (I
(briefly) thought about just setting up the REs in reverse, but I foresee a
lot of difficulty there--scanning through the string in reverse (from last
character to first, negates the advantage I was hoping to gain, that of
starting the RE match only from positions where the RE could possibly match.)

Barring anything better, the approach I may take is to iterate through the
string and, when I find a potential RE match, use s[n,len] to return a
partial string to be checked against the RE.

Aside: I'm trying to make a very efficient parser in Ruby for a wiki-like
thing I want to build, and am trying to avoid the approach of simply making
multiple scans through the document for each of a fairly large number of REs.

I might be guilty of premature optimization, but I prefer to think of it as
doing some proof of concept testing before committing to a design.

I have done some tests that show a 1 to 10% savings in time by taking a
similar approach for REs that could only match at the beginning of a string.
(At some point I'll "publish" those results on WikiLearn (or the next
incarnation thereof).) The next REs are considerably more complex as they
can match anywhere in the string--if the savings from the same approach for
them is only 1 to 10%, the complexity will not be worth it. If by some
chance it exceeds say 50%, I will seriously consider that complexity.

I implemented a semi-stateful wiki parser in a somewhat similar manner
using StringScanner. It's fairly fast.
Randy Kramer

E
 
F

Florian Gross

Randy said:
* ?? (Is there any method which chops the first character from the string?
How efficient is that (especially compared to the method described in the
next bullet).

str.slice!(0)

I think this is actually fairly efficient, but I suppose there's a
memmove involved at the C level.

Perhaps you could set the characters you don't need to '\0' and remove
all of them after the iteration has ended with a Regexp?
 
R

Robert Klemme

Christian Neukirchen said:
Better use String#delete! for that, no?

For removing \0 chars? Yepp, that seems like an efficient way to do it.
For direct removing slice! and String#[]= are most efficient:
?> end
user system total real
=> true?> r.report("slice!") do
?> 10000.times { "aaaaaaaaaaaaaa".slice!(0,5) }
end
r.report("[]=") do ?> 10000.times { "aaaaaaaaaaaaaa"[0,5]="" }
end
end
user system total real
slice! 0.140000 0.000000 0.140000 ( 0.130000)
[]= 0.063000 0.000000 0.063000 ( 0.065000)
=> true
Kind regards

robert
 
R

Robert Klemme

Randy Kramer said:
This is going to seem a little strange (for a number of reasons I might
mention below), but I would like to iterate through a string from
beginning
to end, examining each character, then discarding (or moving) them
(elsewhere).

This is so that I can, upon finding certain character(s), run one of
several
REs on the next part of the string. I'm looking to find a very efficient
way
to do this.

Here are the approaches I'm considering so far. Any other suggestions?

* ?? (Is there any method which chops the first character from the
string?
How efficient is that (especially compared to the method described in the
next bullet).
* Reverse the string, then use something like s[length (-1)] to examine
then chop to discard the last character. The problem with this approach
is
that I then have to reverse the string again to have the REs work. (I
(briefly) thought about just setting up the REs in reverse, but I foresee
a
lot of difficulty there--scanning through the string in reverse (from last
character to first, negates the advantage I was hoping to gain, that of
starting the RE match only from positions where the RE could possibly
match.)

Barring anything better, the approach I may take is to iterate through the
string and, when I find a potential RE match, use s[n,len] to return a
partial string to be checked against the RE.

I'd bet that this approach is slower than a pure regexp based approach. If
you cannot stuck all exact regexps into one (see below) then maybe some form
of stripped regexps might help. For example:

rx1 = /ab+/
rx2 = /cd+/

rx_all = /(ab+)|(cd+)/

rx_stripped = /[ab](\w+)/
# then, use these on the second part
rx_stripped_1 = /^b+/
rx_stripped_2 = /^d+/

This is just a simple example for demonstration. For these simple regexps
rx_all is the most efficient one I'm sure.
Aside: I'm trying to make a very efficient parser in Ruby for a wiki-like
thing I want to build, and am trying to avoid the approach of simply
making
multiple scans through the document for each of a fairly large number of
REs.

What does "fairly large" mean? I would try to start with stucking *all*
these regexps into one - if the rx engine does not choke on that regexp I'd
assume that this is the most efficient way to do it, as then you have the
best ratio of machine code to ruby interpretation. Maybe you just show us
all these regexps so we can better understand the problem.
I might be guilty of premature optimization, but I prefer to think of it
as
doing some proof of concept testing before committing to a design.

I have done some tests that show a 1 to 10% savings in time by taking a
similar approach for REs that could only match at the beginning of a
string.
(At some point I'll "publish" those results on WikiLearn (or the next
incarnation thereof).) The next REs are considerably more complex as they
can match anywhere in the string--if the savings from the same approach
for
them is only 1 to 10%, the complexity will not be worth it. If by some
chance it exceeds say 50%, I will seriously consider that complexity.

Now I'm getting really curios. Care to post some more details?

Kind regards

robert
 
R

Randy Kramer

Thanks to all who replied so far. I also want to look into the StringScanner
approach (I'll reply separately with some questions about that), and I can't
believe I couldn't find the ways to delete the first character of a string.
Guess I am a newbie!

I'd bet that this approach is slower than a pure regexp based approach.

So far, you're very right--my approach took about 30 times as long as the pure
regexp approach, although my Ruby code might not be very efficient. (In case
nobody noticed, I'm very much a newbie to Ruby.)
If
you cannot stuck all exact regexps into one (see below) then maybe some
form of stripped regexps might help. For example:

This sounds like its worth a try, but:
1) I haven't created all the necessary REs yet
2) Question below (for clarification)
rx1 = /ab+/
rx2 = /cd+/

rx_all = /(ab+)|(cd+)/

rx_stripped = /[ab](\w+)/

Question: IIUC, the [ab] above should be [ac]?
# then, use these on the second part
rx_stripped_1 = /^b+/
rx_stripped_2 = /^d+/

This is just a simple example for demonstration. For these simple regexps
rx_all is the most efficient one I'm sure.
What does "fairly large" mean? I would try to start with stucking *all*
these regexps into one - if the rx engine does not choke on that regexp I'd
assume that this is the most efficient way to do it, as then you have the
best ratio of machine code to ruby interpretation. Maybe you just show us
all these regexps so we can better understand the problem.

It's hard even to guess, I intended to combine several REs into one anyway
when they had a lot of commonality. For example, the TWiki markup for
headings (which I'm planning to use) is like this:

---* Level 1
---** Level 2
---*** Level 3
---**** Level 4
---***** Level 5
---****** Level 6

I've planned to use one RE for all the above, then determine the level from
the length of the match (like level = len - 3).

Likewise, "inline" markup is *for bold*, _for italic,_ __for bold italic__,
and so forth. I'd try to have one RE looking for words preceded by _, *, or
__, and another with words ending with the same. (And might combine words
marked with % for %TWikiVariables% as well.

With "optimizations" like this, I'd guess on the order of 15 or so regexps.
Now I'm getting really curios. Care to post some more details?

I presume you mean on the 1 to 10% savings? I planned to do that, I'll try to
put something on WikiLearn this weekend then post something here.

Randy Kramer
 
R

Randy Kramer

I implemented a semi-stateful wiki parser in a somewhat similar manner
using StringScanner. It's fairly fast.

StringScanner looks interesting, and I'm starting to dig into it. Care to
provide any more information? Do you have a working wiki?

Randy Kramer
 
E

ES

In data 3/19/2005 said:
StringScanner looks interesting, and I'm starting to dig into it. Care to
provide any more information? Do you have a working wiki?

I do. I was waiting to complete the actual application it is
to feature in before releasing it but if you want to see the
parsing code, I'll do a round or two of refactoring and post
it on my website, work permitting.
Randy Kramer

E
 
R

Randy Kramer

str.slice!(0)

Thanks! (Can't believe I didn't recognize that approach.)
I think this is actually fairly efficient, but I suppose there's a
memmove involved at the C level.

Perhaps you could set the characters you don't need to '\0' and remove
all of them after the iteration has ended with a Regexp?

Sounds like a possibility, although my first set of tests show the REs to be
much faster.

regards,
Randy Kramer
 
R

Randy Kramer

For direct removing slice! and String#[]= are most efficient:

What is the significance to the # in String#delete! and String#[]= above? (I
can't find anything with the # in it like that in "Ruby In a Nutshell"--is it
something new, or just some shorthand/jargon? (Or did I look in the wrong
places ;-)

Ahh, that's useful, I was using Time::now to time my functions.

Randy Kramer
 
R

Randy Kramer

if you want to see the
parsing code, I'll do a round or two of refactoring and post
it on my website, work permitting.

It would be interesting to see, if it's not too much trouble.

regards,
Randy Kramer
 
T

Timothy Hunter

Randy said:
What is the significance to the # in String#delete! and String#[]= above? (I
can't find anything with the # in it like that in "Ruby In a Nutshell"--is it
something new, or just some shorthand/jargon? (Or did I look in the wrong
places ;-)

It is simply a notational convention for identifying instance methods.
Since it is not real Ruby syntax it can be misleading, but its usage is
so widespread it's probably here to stay.
 
R

Randy Kramer

Randy said:
What is the significance to the # in String#delete! and String#[]= above?
(I can't find anything with the # in it like that in "Ruby In a
Nutshell"--is it something new, or just some shorthand/jargon? (Or did I
look in the wrong places ;-)

It is simply a notational convention for identifying instance methods.
Since it is not real Ruby syntax it can be misleading, but its usage is
so widespread it's probably here to stay.

Thanks!
 
R

Robert Klemme

Randy Kramer said:
Thanks to all who replied so far. I also want to look into the
StringScanner
approach (I'll reply separately with some questions about that), and I
can't
believe I couldn't find the ways to delete the first character of a
string.
Guess I am a newbie!

I'd bet that this approach is slower than a pure regexp based approach.

So far, you're very right--my approach took about 30 times as long as the
pure
regexp approach, although my Ruby code might not be very efficient. (In
case
nobody noticed, I'm very much a newbie to Ruby.)
If
you cannot stuck all exact regexps into one (see below) then maybe some
form of stripped regexps might help. For example:

This sounds like its worth a try, but:
1) I haven't created all the necessary REs yet
2) Question below (for clarification)
rx1 = /ab+/
rx2 = /cd+/

rx_all = /(ab+)|(cd+)/

rx_stripped = /[ab](\w+)/

Question: IIUC, the [ab] above should be [ac]?

Exactly. My mistake.
It's hard even to guess, I intended to combine several REs into one anyway
when they had a lot of commonality. For example, the TWiki markup for
headings (which I'm planning to use) is like this:

---* Level 1
---** Level 2
---*** Level 3
---**** Level 4
---***** Level 5
---****** Level 6

I've planned to use one RE for all the above, then determine the level
from
the length of the match (like level = len - 3).

Sounds very reasonable.
Likewise, "inline" markup is *for bold*, _for italic,_ __for bold
italic__,
and so forth. I'd try to have one RE looking for words preceded by _, *,
or
__, and another with words ending with the same. (And might combine words
marked with % for %TWikiVariables% as well.

Seeing this lets me think of another approach: split the string into tokens
with a simple regexp and do the rest of the calculation on the tokens.
Drawback is of course if the input is large this will not perform very well.
Hm...
With "optimizations" like this, I'd guess on the order of 15 or so
regexps.


I presume you mean on the 1 to 10% savings?

Yeah, and on the rx's as well (you did show some of theme alread).
I planned to do that, I'll try to
put something on WikiLearn this weekend then post something here.

Some experiments:
=> "** caption \nfoo _italic_ asdasd __bold__ var=%var%"
str.scan /(^\s*\*+)|(([*_])\3*)(.*?)\2|(%\w+%)/m do |m| p m end
["**", nil, nil, nil, nil]
[nil, "_", "_", "italic", nil]
[nil, "__", "_", "bold", nil]
[nil, nil, nil, nil, "%var%"]
=> "** caption \nfoo _italic_ asdasd __bold__ var=%var%"
str.scan /(^\s*\*+)|(([*_])\3*)(.*?)\2|%(\w+)%/m do |m| p m end
["**", nil, nil, nil, nil]
[nil, "_", "_", "italic", nil]
[nil, "__", "_", "bold", nil]
[nil, nil, nil, nil, "var"]
=> "** caption \nfoo _italic_ asdasd __bold__ var=%var%"

Kind regards

robert
 
R

Robert Klemme

Randy Kramer said:
See http://twiki.org/cgi-bin/view/Wikilearn/RWP_RE_Tests -- it's rather ugly
but the information is there.

Randy Kramer

Some remarks:

- The comparison between 5 and 6 does not seem fair, as you iterate in 6
but not in 5.

- You don't use String#scan or #split which you are likely to need in
practice, because you want to sift through complete documents and want to
treat all occurrences.

- The differences between the check-first-char approach and the pure RE
approach are so insignificant that I'd not bother using the more complex
code. I'd stick with pure RE based approaches and only try to optimize if
performance is bad. (You mentioned premature optimization already... :))

Kind regards

robert
 
R

Randy Kramer

Some remarks:

- The comparison between 5 and 6 does not seem fair, as you iterate in 6
but not in 5.

Oops, for a minute I thought I had really screwed up (like by not doing 5 the
10000 times). AFAIK, I don't need to iterate (through the string) in 5 as
the RE is not anchored to the beginning of the string--it still checks the
entire string (line) for that pattern, which is what I try to achieve in 6 by
the iteration through the string and check a character then invoke RE
approach. I am sure that my Ruby code to do that is not the best, and I may
learn something by making it better, but I agree with your conclusion /
recommendation (below) at least for the time being (although I do plan to
play with str::scan and StringScanner at least a little bit (I presume they
do similar things, but perhaps the 2nd is optimized somehow, particularly if
I "require C" or whatever)).
- You don't use String#scan or #split which you are likely to need in
practice, because you want to sift through complete documents and want to
treat all occurrences.

I need to let that sink in a bit. In general I do want to treat all
occurrences, but I plan to scan a line (actually a paragraph) at a time, and
some things can only occur at the beginning of a line, so those would only be
checked at the beginning of a line.
- The differences between the check-first-char approach and the pure RE
approach are so insignificant that I'd not bother using the more complex
code. I'd stick with pure RE based approaches and only try to optimize if
performance is bad. (You mentioned premature optimization already... :))

Thanks! I pretty much agree at this time, although I want to play a little
bit with StringScanner.

regards,
Randy Kramer
 
R

Robert Klemme

Randy Kramer said:
Oops, for a minute I thought I had really screwed up (like by not doing 5 the
10000 times). AFAIK, I don't need to iterate (through the string) in 5 as
the RE is not anchored to the beginning of the string--it still checks the
entire string (line) for that pattern, which is what I try to achieve in 6 by
the iteration through the string and check a character then invoke RE
approach. I am sure that my Ruby code to do that is not the best, and I may
learn something by making it better, but I agree with your conclusion /
recommendation (below) at least for the time being (although I do plan to
play with str::scan and StringScanner at least a little bit (I presume they
do similar things, but perhaps the 2nd is optimized somehow, particularly if
I "require C" or whatever)).

A particular performance show stopper in test 6 is String#[] i.e. you
create a new String object for each test; object creation is comparatively
expensive even though Strings share their internal buffer. But the GC has
to be informed etc. and this is quite some overhead. If you want fast
code, create as few instances as possible. The same holds for Java in 99%
of all cases.

Another general remark: it should be faster to use a range, Fixnum#upto or
Fixnum#times for iterating because then you have iteration in C and you
don't need to recalculate the limit on each iteration:

# old
i = 0
until i==s1.length-6 do
if s1 == 91
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
i += 1
end

# with range
(0...(s1.length-6)).each do |i|
if s1 == ?[
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
end

# with upto
0.upto(s1.length-7) do |i|
if s1 == ?[
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
end

# with times
(s1.length-6).times do |i|
if s1 == ?[
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
end

And you can use "?[" instead of "91" which is far less readable.
I need to let that sink in a bit. In general I do want to treat all
occurrences, but I plan to scan a line (actually a paragraph) at a time, and
some things can only occur at the beginning of a line, so those would only be
checked at the beginning of a line.

Hm, if you know that the size of files is limited (i.e. something like
just a few KB) then it's usually worth slurping in the whole file with
something like this

contents = File.open(f){|io| io.read}

and then iterate through the whole thing with #scan. You can still use ^
to anchor at line beginnings.

# get the initial sequen until the first non whitespace
# just an example
contents.scan /^\s+\S/ do |m|
p m[0]
end
:))

Thanks! I pretty much agree at this time, although I want to play a little
bit with StringScanner.

Of course, new toys have to be played with! :)

Kind regards

robert
 
R

Randy Kramer

Robert,

I want to thank you for all your help, it's like having a personal tutor!

Some feedback / observations below that don't really require any response.


After some more testing, your remark seems more on target than I originally
thought--I can account for almost all the 30x increase in required time (for
6) by the additional iterations (repeated invocations of the RE engine).
It's like the invocation is the expensive part, and whether it looks for a
pattern at one point or scans the remainder of the(se short) strings is
negligible. (See the results of tests 6d and 6e below.)
A particular performance show stopper in test 6 is String#[] i.e. you
create a new String object for each test; object creation is comparatively
expensive even though Strings share their internal buffer. But the GC has
to be informed etc. and this is quite some overhead. If you want fast
code, create as few instances as possible. The same holds for Java in 99%
of all cases.

I'm surprised that Ruby creates a new String object for each test--I would
have hoped/thought that it was simply letting me "peek" at a portion of the
existing string (especially since it's only a test). I presume that the
StringScanner behaves more sanely in that respect, but I guess I'll find
out. ;-)

Thanks for all of the following! I did substitute them in test 6 to see what
they would do.

# old (6): ~18 seconds
# with range (6a): didn't work, see below
# with upto (6b): ~14 seconds
# with times (6c): ~12 seconds
#6d: ~0.75 seconds (This is the test that convinced me the iterations are the
problem, I revised the (with times) program to only call the RE once,
although it still scans only from the start of the string--I guess I should
try test 6e with the RE not anchored.)
#6e: ~0.6 seconds (Same as 6d, except I removed the \A anchor--and now I'm
puzzled, how is this faster than the anchored version?? Anyway, at this time
I don't care, I'll just "file it away" as a little anomaly to perhaps
understand some day (and, as I haven't run the test multiple times or similar
in an attempt to discount garbage collection, maybe that is the problem.)

I did create new test programs (6a, 6b, 6c) but I haven't uploaded them to the
TWiki--if you are really interested I can do that, but, as I say below, I'm
not going to lose sleep over the problem with range.

For some reason that I haven't figured out (yet?), the "with range" option
didn't work. I'm not going to lose sleep over it--I did try some
troubleshooting, but it may be a rather subtle bug (or I have a very dense
head).

When I run it as part of a program (re_test_6a.rb), I get the following error
messages:

bash-2.05b$ re_test6a.rb
/re_test6a.rb:40: Invalid char `\240' in expression
/re_test6a.rb:41: Invalid char `\240' in expression
...
/re_test6a.rb:66: Invalid char `\240' in expression
bash-2.05b$

When I simply copy the "individual loop" part of the code (i.e., the portion
you show below under # with range) into IRB and running it (after defining
the appropriate strings), I get (and get kicked out of IRB) BTW, this is the
result of attempting to paste the five lines into IRB as a group:

irb(main):021:0> (0...(s1.length-6)).each do |i|
irb(main):022:1*   if s1 == ?[
SyntaxError: compile error
(irb):21: syntax error
from (irb):21
from (null):0
bash-2.05b$

As I try to troubleshoot (by removing pieces from the loop), everything seems
to work OK (and I'm learning what some of those pieces do ;-)

Anyway, since I went this far, I have uploaded programs 6a thru 6e to the
TWiki, but I am not requesting / suggesting that anyone try to spend time
debugging 6a.

http://twiki.org/cgi-bin/view/Wikilearn/RWP_RE_Tests?
# old
i = 0
until i==s1.length-6 do
if s1 == 91
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
i += 1
end

# with range
(0...(s1.length-6)).each do |i|
if s1 == ?[
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
end

# with upto
0.upto(s1.length-7) do |i|
if s1 == ?[
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
end

# with times
(s1.length-6).times do |i|
if s1 == ?[
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
end


For anyone following along ;-) my next efforts are going to be focused on
StringScanner and then making the necessary substitutions. In parallel I
will probably try to refine the REs.

The remainder of this looks useful as well!

regards,
Randy Kramer
Hm, if you know that the size of files is limited (i.e. something like
just a few KB) then it's usually worth slurping in the whole file with
something like this

contents = File.open(f){|io| io.read}

and then iterate through the whole thing with #scan. You can still use ^
to anchor at line beginnings.

# get the initial sequen until the first non whitespace
# just an example
contents.scan /^\s+\S/ do |m|
p m[0]
end
 
R

Robert Klemme

Randy Kramer said:
Robert,

I want to thank you for all your help, it's like having a personal
tutor!

You're welcome! I'm glad I could help by sharing my experiences.
Some feedback / observations below that don't really require any
response.

Well, some comments below nevertheless... :)
After some more testing, your remark seems more on target than I originally
thought--I can account for almost all the 30x increase in required time (for
6) by the additional iterations (repeated invocations of the RE engine).
It's like the invocation is the expensive part, and whether it looks for a
pattern at one point or scans the remainder of the(se short) strings is
negligible. (See the results of tests 6d and 6e below.)

Well, that clearly shows that simple scanning with a RE is superior to
iterating and then scanning.
A particular performance show stopper in test 6 is String#[] i.e. you
create a new String object for each test; object creation is comparatively
expensive even though Strings share their internal buffer. But the GC has
to be informed etc. and this is quite some overhead. If you want fast
code, create as few instances as possible. The same holds for Java in 99%
of all cases.

I'm surprised that Ruby creates a new String object for each test--I would
have hoped/thought that it was simply letting me "peek" at a portion of the
existing string (especially since it's only a test).

The internal buffer (the characters) is shared but there is a new Ruby
instance each time you invoke String#[]:
10.times { puts s1[2,4].id }
134979736
134979676
134979652
134979592
134979496
134979472
134979436
134979364
134979268
134979196
=> 10
I presume that the
StringScanner behaves more sanely in that respect, but I guess I'll find
out. ;-)

Never used that myself but it's sure worth a try.
Thanks for all of the following! I did substitute them in test 6 to see what
they would do.

# old (6): ~18 seconds
# with range (6a): didn't work, see below
# with upto (6b): ~14 seconds
# with times (6c): ~12 seconds
#6d: ~0.75 seconds (This is the test that convinced me the iterations are the
problem, I revised the (with times) program to only call the RE once,
although it still scans only from the start of the string--I guess I should
try test 6e with the RE not anchored.)
#6e: ~0.6 seconds (Same as 6d, except I removed the \A anchor--and now I'm
puzzled, how is this faster than the anchored version?? Anyway, at this time
I don't care, I'll just "file it away" as a little anomaly to perhaps
understand some day (and, as I haven't run the test multiple times or similar
in an attempt to discount garbage collection, maybe that is the problem.)

I did create new test programs (6a, 6b, 6c) but I haven't uploaded them to the
TWiki--if you are really interested I can do that, but, as I say below, I'm
not going to lose sleep over the problem with range.

For some reason that I haven't figured out (yet?), the "with range" option
didn't work. I'm not going to lose sleep over it--I did try some
troubleshooting, but it may be a rather subtle bug (or I have a very dense
head).

When I run it as part of a program (re_test_6a.rb), I get the following error
messages:

bash-2.05b$ re_test6a.rb
/re_test6a.rb:40: Invalid char `\240' in expression
/re_test6a.rb:41: Invalid char `\240' in expression
..
/re_test6a.rb:66: Invalid char `\240' in expression
bash-2.05b$

See comment below.
When I simply copy the "individual loop" part of the code (i.e., the portion
you show below under # with range) into IRB and running it (after defining
the appropriate strings), I get (and get kicked out of IRB) BTW, this is the
result of attempting to paste the five lines into IRB as a group:

irb(main):021:0> (0...(s1.length-6)).each do |i|
irb(main):022:1* if s1 == ?[
SyntaxError: compile error
(irb):21: syntax error
from (irb):21
from (null):0
bash-2.05b$
s1="a"*10 => "aaaaaaaaaa"
(0...(s1.length-6)).each do |i| ?> if s1 == ?[
puts "yes"
end
end

=> 0...4

I guess this and the other syntax error above are caused by copying and
pasting some characters outside the ASCII range. I have experienced
similar errors in the past. Sometimes they look like whitespace
characters so you don't recognize them on first sight.
As I try to troubleshoot (by removing pieces from the loop), everything seems
to work OK (and I'm learning what some of those pieces do ;-)

Anyway, since I went this far, I have uploaded programs 6a thru 6e to the
TWiki, but I am not requesting / suggesting that anyone try to spend time
debugging 6a.

http://twiki.org/cgi-bin/view/Wikilearn/RWP_RE_Tests?
# old
i = 0
until i==s1.length-6 do
if s1 == 91
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
i += 1
end

# with range
(0...(s1.length-6)).each do |i|
if s1 == ?[
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
end

# with upto
0.upto(s1.length-7) do |i|
if s1 == ?[
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
end

# with times
(s1.length-6).times do |i|
if s1 == ?[
s1[i,s1.length] =~
/\A\[(([A-Z]\w*)\.)?(.*)(#([A-Z]\w*))?\](\[(.*)\])?\]/
end
end


For anyone following along ;-) my next efforts are going to be focused on
StringScanner and then making the necessary substitutions. In parallel I
will probably try to refine the REs.


Please let me/us know how that works out.
The remainder of this looks useful as well!

regards,
Randy Kramer
Hm, if you know that the size of files is limited (i.e. something like
just a few KB) then it's usually worth slurping in the whole file with
something like this

contents = File.open(f){|io| io.read}

and then iterate through the whole thing with #scan. You can still use ^
to anchor at line beginnings.

# get the initial sequen until the first non whitespace
# just an example
contents.scan /^\s+\S/ do |m|
p m[0]
end

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top