False positives in editing data

Raul Parolari · Nov 22, 2007

RichardOnRails said:
Hi Raul,

I like your "battle plan"..
I especially appreciate your showing me how a regex can be written to
handle an arbitrary number of dot-separated numbers (rather than hard-
code distinct sub-expressions).

if line =~ /^ (.*?) [a-zA-Z] /x

Click to expand...

I thought I could simply remove the question-mark.
So, your question mark is clearly working, but HOW?

Richard

I saw that Gavin has given you (in another thread) a general tutorial on
this. I add a simpler explanation just in the context of the problem we
treated;

.* means 'as many characters as possible'

Now, the point 1 of the 'battle plan' was (I quote):
"1) we first collect everything until the first letter (not included);
we
will consider this the Prefix."

So we want to tell the Regexp Engine: "as few characters as possible
until you see a letter (a-zA-Z), then stop right there!".

Let's examine the 2 expressions, with and without the question mark:

(.*?) [a-zA-Z]
minimal nr of chars needed until .. 1st letter

(.*) [a-zA-Z]
as many chars you can get
possible get away with, and then .. a letter

An example:

s="2.1Topic 2.1"

md = s.match( /^ (.*?) [a-zA-Z] /x )
md[1] # => "2.1"

md = s.match( /^ (.*) [a-zA-Z] /x )
md[1] # => "2.1Topi"

Have you seen? Both expressions were satisfied, but in different ways:
a) the first (with .*?) tried to find the minimal number of characters
until the first letter, and so it stopped when it found the 'T' of
Topic.

b) the second expression tried to find as many characters as possible,
only bounded by having to then find a letter, so it stopped at the 'c'
of Topic.

With sense of humour, somebody observed that ".*? values contentment
over greed"; and since then the ".*?" were called "not greedy", while
the ".*" were called "greedy".

[I stop here as Gavin described to you '.+" & co].

One advice: the key to learn the regular expression is to read a good
book (just trying them drives one insane) while experimenting (just
reading drives one insane too). The time spent pays you back very
quickly at the first serious exercise (as you can develop a
'battle-plan' rather than a 'guerrilla war' with the regexps).

I am glad that you found the script useful, and I hope that this helped
too

Raul

RichardOnRails · Nov 23, 2007

if line =~ /^ (.*?) [a-zA-Z] /x

Click to expand...

Click to expand...

I have one question about this regex. I have a book that I bought in
2002 but never read until today: "Mastering Regular Expressions", 2nd
Ed., from O'Reilly. I haven't been able to find any reference in
there to a question-mark following ".*".

Click to expand...

I thought I could simply remove the question-mark. That caused the
match to fail and yield the programmed error msg. I tried omitting
the question-mark and add a closing "$" in the regex. That made the
parsing fail. So, your question mark is clearly working, but HOW?

Click to expand...

I can help with this one. Check "Mastering Regular Expressions" for
non-greedy operators. Normally, ".*" will match everything it possibly
can. Adding the "?" causes it to do a minimal match -- it will match
as little as necessary to still fill the requirements. In the above
case, it matches everything until the first letter. Without the "?" it
matches everything until the last letter.

Jeremy

Hi Jeremy,

non-greedy operators.

Thanks for that detailed explanation. I tried Googling for "(.*?)
regular expression", but Google thiks it's too weird to actutally
include "(.*?)" in a search ... and I don;t blame them. I checed
Amazon to see if there is a later edition of "Masttering ...", to no
avail.

Again, thanks for taking the time to respond.

Regards,
Richard

RichardOnRails · Nov 24, 2007

RichardOnRails said:
RichardOnRails said:

Hi Raul,

Click to expand...

I like your "battle plan"..
I especially appreciate your showing me how a regex can be written to
handle an arbitrary number of dot-separated numbers (rather than hard-
code distinct sub-expressions).

if line =~ /^ (.*?) [a-zA-Z] /x

Click to expand...

Click to expand...

I thought I could simply remove the question-mark.
So, your question mark is clearly working, but HOW?

Click to expand...

Richard

I saw that Gavin has given you (in another thread) a general tutorial on
this. I add a simpler explanation just in the context of the problem we
treated;

.* means 'as many characters as possible'

Now, the point 1 of the 'battle plan' was (I quote):
"1) we first collect everything until the first letter (not included);
we
will consider this the Prefix."

So we want to tell the Regexp Engine: "as few characters as possible
until you see a letter (a-zA-Z), then stop right there!".

Let's examine the 2 expressions, with and without the question mark:

(.*?) [a-zA-Z]
minimal nr of chars needed until .. 1st letter

(.*) [a-zA-Z]
as many chars you can get
possible get away with, and then .. a letter

An example:

s="2.1Topic 2.1"

md = s.match( /^ (.*?) [a-zA-Z] /x )
md[1] # => "2.1"

md = s.match( /^ (.*) [a-zA-Z] /x )
md[1] # => "2.1Topi"

Have you seen? Both expressions were satisfied, but in different ways:
a) the first (with .*?) tried to find the minimal number of characters
until the first letter, and so it stopped when it found the 'T' of
Topic.

b) the second expression tried to find as many characters as possible,
only bounded by having to then find a letter, so it stopped at the 'c'
of Topic.

With sense of humour, somebody observed that ".*? values contentment
over greed"; and since then the ".*?" were called "not greedy", while
the ".*" were called "greedy".

[I stop here as Gavin described to you '.+" & co].

One advice: the key to learn the regular expression is to read a good
book (just trying them drives one insane) while experimenting (just
reading drives one insane too). The time spent pays you back very
quickly at the first serious exercise (as you can develop a
'battle-plan' rather than a 'guerrilla war' with the regexps).

I am glad that you found the script useful, and I hope that this helped
too

Raul

Hi Raul,

Thank you very much for your expanded analysis.

I saw that Gavin has given you (in another thread) ...

I started a new thread on the "(.*?)" because this thread was getting
too long. And Gavin tuning me on to "greedy" was a big boost. That
let me find some relevant stuff in "Mastering Reglar Expressions, 2nd
ed."

An example: ...

Your example is great. I went back to Hal Fulton's "The Ruby Way, 2nd
ed." and http://www.ruby-doc.org/core/classes/Regexp.html for
additional Regexp#match documentation.

Not withstanding your exposition and the documentation cited, my
reptilian brain refuses acceptance on this issue. But by running the
examples given and some of my own construction, I should get over
this hump. (I wrote my own NFSA in C for a client's application
roughly 30 year's ago, so I should be equal to the task.)

I'm not going expose my ignorance with any further questions on this
matter. I'll do my homework

With thanks and best wishes,
Richard

RichardOnRails · Nov 26, 2007

RichardOnRails said:
RichardOnRails said:

RichardOnRails wrote:
sName =~ /^([\d]+)?\.?([\d]+)?\.?([\d]+)?\.?/
sName
Did you draw attention to this because of the Hungarian notation? If
so, do you think I'm unwise to adopt the style once advocated by
Charles Simoni, super-programmer and co-founder of a giant software
company?
Yes. Adamantly, and definitely yes.
I might have been a bad idea when he had it, even though to his
credit he was trying to make the best of a bad situation, where
MS had bought the worst C compiler on the planet because the good
ones weren't for sale - they could make more money *not* selling
to MS. Because of a spate of bugs and bad code churned out by the
MS software factory, many caused by type mismatches on function
parameters that weren't detected either at compile time or at
runtime, Hungarian notation *might* have been a good idea once.
It's definitely *not* a good idea with modern C, and even less of
a good idea with Ruby.
Clifford Heath.

Click to expand...

Click to expand...

Hi Clifford,

Click to expand...

I don't know anything about Microsoft's choice of compilers. But I
used several C compilers in the '80s, and all cases found Hungarian
notation helpful. I don't think Microsoft's initial choice of
compilers is relevant to my and other's successful employment of that
convention.

Click to expand...

Why would it be a bad idea with modern C compilers or with Ruby? You
offer no reason. All it does is add one or two letters before names!
That doesn't bother any human or compiler or interpreter.

Click to expand...

It depends on the type of Hungarian that you're using, and it's not
clear from your code sample which it is. If you're using an
abbreviation prefix to denote a semantic difference within a type, then
that's (potentially) useful, in both Ruby and C:

us_username = read_unsafe_input()
s_username = sanitise(us_username)

with us_ meaning unsafe and s_ meaning safe, for example. Not something
I'd use myself, but I can see the utility. If it's denoting a class,
then it's not something I can see as useful, in either C or Ruby. In C,
you're duplicating the compiler's type-checking, and in Ruby,
duck-typing means that you shouldn't need to care; it becomes
readability-damaging line noise.

Hi Alex,

abbreviation prefix to denote a semantic difference within a
type, then that's (potentially) useful, in both Ruby and C:
us_username = read_unsafe_input()
s_username = sanitise(us_username)

I like that.

If it's denoting a class,
then it's not something I can see as useful

I like to know merely by inspection whether a referent denotes an
integer, a string, a hash or an array of such things. I'd like to
avoid "Syntax error" simply because I failed to include a to_s, to_i,
[] or whatever. I really can't see why a prefixed lower-case letter
or two before a camel-case object name can create so much discussion
irrelevant to the question at hand.

Maybe I'll have so "sanitize" all the code I post to exemplify a
coding issue.

Thank you for your response, notwithstanding my lack of total
agreement.

Best wishes,
Richard

RichardOnRails · Nov 26, 2007

sName =~ /^([\d]+)?\.?([\d]+)?\.?([\d]+)?\.?/
sName

Click to expand...

Click to expand...

Did you draw attention to this because of the Hungarian notation? If
so, do you think I'm unwise to adopt the style once advocated by
Charles Simoni, super-programmer and co-founder of a giant software
company?

Click to expand...

Actually this form of Hungarian notation, which was called System
Hungarian in Microsoft, is NOT what Simonyi originally sugested (and
what was used in the Application Division).

http://talklikeaduck.denhaven2.com/articles/2007/04/09/hungarian-ducks

Hi Rick,

I loved your blog. Thanks for posting it and informing me about it.

I think my usage of "Hungarian" consistent with Simonyi's intent, at
least how I understand it. In any case, I find my uasage helpful, as
I mentioned to Alex Young on this thread, though I may have to
sanitize future posts to avoid people who don't respond to what I
mean but instead waste time on how I express my question.

Best wishes,
Richard

RichardOnRails · Nov 26, 2007

Harry is quoting from the movie Airplane (1980)

Todd

Thanks, Todd. That was over my head. "Airplane" was not a movie I'd
run to see

Regards,
Richard

Alex Young · Nov 26, 2007

RichardOnRails wrote:

Hi Alex,

abbreviation prefix to denote a semantic difference within a
type, then that's (potentially) useful, in both Ruby and C:
us_username = read_unsafe_input()
s_username = sanitise(us_username)

Click to expand...

I like that.

If it's denoting a class,
then it's not something I can see as useful

Click to expand...

I like to know merely by inspection whether a referent denotes an
integer, a string, a hash or an array of such things. I'd like to
avoid "Syntax error" simply because I failed to include a to_s, to_i,
[] or whatever.

You'll tend to find that types are much less relevant in Ruby than in C.
The actual class of an object is much less important than the methods
it responds to, and you won't get a syntax error unless the syntax is
actually wrong; this won't be a problem with variables because of the
lack of compile-time type checking. I've tried the method you're
espousing myself, and it didn't actually help me at all. I found it
just wasn't worth the effort. However, I'm more than willing to accept
that it's a difference between your coding style and mine rather than
any fundamental problem with the concept that made the difference.

In terms of posting code here, the most important thing is to make it
readable. Most people won't know what your hungarian prefixes mean, so
they're just line noise to them.

Raul Parolari · Nov 26, 2007

RichardOnRails said:
I like to know merely by inspection whether a referent denotes an
integer, a string, a hash or an array of such things. I'd like to
avoid "Syntax error" simply because I failed to include a to_s, to_i,
[] or whatever. I really can't see why a prefixed lower-case letter
or two before a camel-case object name can create so much discussion
irrelevant to the question at hand.

Richard,

I always found fascinating the issue of 'how we name things', and not
only for philosophical reasons; personally I think that the horrendous
amount of time spent to-day in what is called "Testing Dept" is due in
part to problems like that.

A bad and careless Naming methodology (when there is one!) leads
(especially in a project where people share code) to subtle errors,
flawed assumptions, and ultimately to errors (unfortunately, when there
is somebody put in charge of the 'naming standards', he is often very
politically correct, but not the brightest guy around, and the result is
even worse than 'no standards').

One person who wrote something intelligent about this subject is Damian
Conway in his book "Perl Best Practices" (ok, I will get the usual
parochial boos for naming that language, but ok, life continues),
specifically chapter 3 "Naming Conventions".

Even if the examples are on Perl syntax, the substance goes beyond. He
suggests something different than your approach; the name should
indicate not so much the class/type, but the MEANING of the data
structure; for example (of course I will not use Perl syntax, and avoid
the examples that make sense for Perl only):

# scalars
running_total games_count # (rather than 'total','count')

# booleans
is_valid has_end_tag loading_finished

# arrays
events handlers unknowns
# the iteration var
event handler unknown

# hashes
title_of count_for sales_from isbn_from

He even discusses the role of 'nouns' and 'adjectives' in names.. (a
delight to read!).

The emphasis is in the MEANING of the data, not on the 'class/type': do
you see? but the objective is similar to yours: grant somebody looking
at code of somebody else (translated: ourselves 6 months later!) at
least a hope to vaguely understand what is going on!

You may want to glance at it, if you find that approach of interest.

Raul

Names are but noise and smoke,
obscuring heavenly light

Johann Wolfgang von Goethe, "Faust: Part I"

RichardOnRails · Nov 26, 2007

RichardOnRails said:
RichardOnRails said:

Hi Raul,

Click to expand...

I like your "battle plan"..
I especially appreciate your showing me how a regex can be written to
handle an arbitrary number of dot-separated numbers (rather than hard-
code distinct sub-expressions).

if line =~ /^ (.*?) [a-zA-Z] /x

Click to expand...

Click to expand...

I thought I could simply remove the question-mark.
So, your question mark is clearly working, but HOW?

Click to expand...

Richard

I saw that Gavin has given you (in another thread) a general tutorial on
this. I add a simpler explanation just in the context of the problem we
treated;

.* means 'as many characters as possible'

Now, the point 1 of the 'battle plan' was (I quote):
"1) we first collect everything until the first letter (not included);
we
will consider this the Prefix."

So we want to tell the Regexp Engine: "as few characters as possible
until you see a letter (a-zA-Z), then stop right there!".

Let's examine the 2 expressions, with and without the question mark:

(.*?) [a-zA-Z]
minimal nr of chars needed until .. 1st letter

(.*) [a-zA-Z]
as many chars you can get
possible get away with, and then .. a letter

An example:

s="2.1Topic 2.1"

md = s.match( /^ (.*?) [a-zA-Z] /x )
md[1] # => "2.1"

md = s.match( /^ (.*) [a-zA-Z] /x )
md[1] # => "2.1Topi"

Have you seen? Both expressions were satisfied, but in different ways:
a) the first (with .*?) tried to find the minimal number of characters
until the first letter, and so it stopped when it found the 'T' of
Topic.

b) the second expression tried to find as many characters as possible,
only bounded by having to then find a letter, so it stopped at the 'c'
of Topic.

With sense of humour, somebody observed that ".*? values contentment
over greed"; and since then the ".*?" were called "not greedy", while
the ".*" were called "greedy".

[I stop here as Gavin described to you '.+" & co].

One advice: the key to learn the regular expression is to read a good
book (just trying them drives one insane) while experimenting (just
reading drives one insane too). The time spent pays you back very
quickly at the first serious exercise (as you can develop a
'battle-plan' rather than a 'guerrilla war' with the regexps).

I am glad that you found the script useful, and I hope that this helped
too

Raul

Hi Raul,

read a good book (just trying them drives one insane)
while experimenting

I certainly agree with regard to REs. And I've got "Mastering Regular
Expressions, vol. 2" by Friedl.

(just reading drives one insane too).

Well, they haven't carted me off yet

As I've said, your approach works great. But I do want to
experiment, too. So I tried the following, and I'm hopeful that you
can tolerate another question:

Program
-------
input = <<DATA
05Topic 5
2.002.1Topic 2.2.1
DATA

input.each do |line|
line =~ /^ (\d+[\.]?)+ [^\.\d] /x #
puts line
puts $1, $2
puts
end

Output
------
05Topic 5
05
nil

2.002.1Topic 2.2.1
1
nil

Question
--------
I'm puzzled by "1" in the second output, because the "^" in the RE
specifies that a match must occur at the first character. I expected
to get $1=2, at least, and hopefully $2=002 and %3=1, though I was
willing to work on the last two items.

Am I wrong about the caret?

RichardOnRails · Nov 26, 2007

RichardOnRails said:
RichardOnRails said:

Hi Raul,

Click to expand...

I like your "battle plan"..
I especially appreciate your showing me how a regex can be written to
handle an arbitrary number of dot-separated numbers (rather than hard-
code distinct sub-expressions).

if line =~ /^ (.*?) [a-zA-Z] /x

Click to expand...

Click to expand...

I thought I could simply remove the question-mark.
So, your question mark is clearly working, but HOW?

Click to expand...

Richard

I saw that Gavin has given you (in another thread) a general tutorial on
this. I add a simpler explanation just in the context of the problem we
treated;

.* means 'as many characters as possible'

Now, the point 1 of the 'battle plan' was (I quote):
"1) we first collect everything until the first letter (not included);
we
will consider this the Prefix."

So we want to tell the Regexp Engine: "as few characters as possible
until you see a letter (a-zA-Z), then stop right there!".

Let's examine the 2 expressions, with and without the question mark:

(.*?) [a-zA-Z]
minimal nr of chars needed until .. 1st letter

(.*) [a-zA-Z]
as many chars you can get
possible get away with, and then .. a letter

An example:

s="2.1Topic 2.1"

md = s.match( /^ (.*?) [a-zA-Z] /x )
md[1] # => "2.1"

md = s.match( /^ (.*) [a-zA-Z] /x )
md[1] # => "2.1Topi"

Have you seen? Both expressions were satisfied, but in different ways:
a) the first (with .*?) tried to find the minimal number of characters
until the first letter, and so it stopped when it found the 'T' of
Topic.

b) the second expression tried to find as many characters as possible,
only bounded by having to then find a letter, so it stopped at the 'c'
of Topic.

With sense of humour, somebody observed that ".*? values contentment
over greed"; and since then the ".*?" were called "not greedy", while
the ".*" were called "greedy".

[I stop here as Gavin described to you '.+" & co].

One advice: the key to learn the regular expression is to read a good
book (just trying them drives one insane) while experimenting (just
reading drives one insane too). The time spent pays you back very
quickly at the first serious exercise (as you can develop a
'battle-plan' rather than a 'guerrilla war' with the regexps).

I am glad that you found the script useful, and I hope that this helped
too

Raul

Hi Raul,

I forgot to tell you that I finally understand your second example.

md = s.match( /^ (.*) [a-zA-Z] /x )
md[1] # => "2.1Topi"

Without the question mark, in principal, the ".* initially consumes
all the characters, but then it sees the match fails, because there's
no match for the "[a-zA-Z]". So the ".*" sort of "backs off" and
satisfies it self with "2.1Topi", leaving the "c" to satisfy "[a-zA-
Z]".

Cool. Actually, I read that in "Mastering Regular Expressions, vol.
2", but it really didn't settle into my WeltAnshaung. But I think I
got it now!

Furthermore, the "non-greedy question mark" says "consume only as much
as you need in order to satisfy the total RE. So "(.*?) needs to
consumed all the caracters up to something satisfying the "[a-zA-Z]",
which is the "T"

The one I like settled on is:

s="2.1Topic 2.1"
md = s.match( /^ ([\.\d]*) [^\.\d] /x )
#md[0]=2.1T
#md[1]=2.1

Raul Parolari · Nov 27, 2007

RichardOnRails said:
I forgot to tell you that I finally understand your second example.

md = s.match( /^ (.*) [a-zA-Z] /x )
md[1] # => "2.1Topi"

Click to expand...

Without the question mark, in principal, the ".* initially consumes
all the characters, but then it sees the match fails, because there's
no match for the "[a-zA-Z]". So the ".*" sort of "backs off" and
satisfies it self with "2.1Topi", leaving the "c" to satisfy "[a-zA-
Z]".

Very good, Richard!
It is a question on when one is 'content'; if you need a metaphor to
remember it, think of a WallStreet banker (.*, .+) vs a Franciscan monk
(.*?, .+?)

The one I like settled on is:

s="2.1Topic 2.1"
md = s.match( /^ ([\.\d]*) [^\.\d] /x )
#md[0]=2.1T
#md[1]=2.1

I see that you have solved the problem in your previous post (that I
could not reply to), when you wrote (removing all other code):

s = "2.002.1Topic 2.2.1"

s =~ /^ (\d+[.]?)+ [^\.\d] /x

I must confess: I was stunned myself that it did not work; foolish of
us, in fact it was working, but you failed to collect the bounty! you
needed parenthesis to include the '+'!

s =~ /^ ((\d+[.]?)+) [^\.\d] /x

p $1, 2 # => "2.002.1", "1"

However it is better to avoid collecting also the inner results as they
overwrite each other in $2 and then confuse us (that's the reason that
you saw the last digit captured above..); so let's use the '?:' trick,
to avoid writing in $2, where instead we will capture the 'non
digits/dots' that come after:

s =~ /^ ((?:\d+[.]?)*) ([^\.\d]+) /x

p $1, $2 # => "2.002.1", "Topic "

Do you see it? I think you do. Now, to finish, let's examine how you
solved the problem in this post:

s="2.1Topic 2.1"
md = s.match( /^ ([.\d]*) [^\.\d] /x )

Ah, you resorted to 'pragmatism'.. you said: "the bloody '\d+[.]?)+'
does not work, so I will change it". This was ok, but do you see the
difference between:

((?:\d+[.]?)*) # I changed + -> * to compare

([\d[.]]*)

aside that the second one is easier to read? (you may want to stop
reading and think about this as this is your test to graduate from
"intermediate level regexp"

Ok: if they could speak, they would say respectively:
1) I want 0 or more sequences of (digits followed optionally by a dot)
2) I want 0 or more combinations of digits and dots as they come

Do you see?
both would match: "2.002.1" but the 2nd would also match "...1..37"!

The last question you had was: how do I pick up the digits once I
collected the "2.002.1"? Study scan in Pickaxe and then do:

str = "2.002.1"

str.scan(/ (\d+) /x) # => [["2"], ["002"], ["1"]]

All right, let's call it a Regexp day,

Raul

MonkeeSage · Nov 27, 2007

Hi Richard,

Here's a cheat-sheet for ruby regular expression syntax:

http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

Regards,
Jordan

RichardOnRails · Nov 27, 2007

Hi Richard,

Here's a cheat-sheet for ruby regular expression syntax:

http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

Regards,
Jordan

Hi Jordan,

http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

Thanks. I'm running ruby 1.8.2 (2004-12-25) [i386-mswin32]. How can
I tell if it uses Oniguruma RE ver.5.6.0?

"gem list oniguruma -b" gave me:
*** LOCAL GEMS ***
[nothing]
*** REMOTE GEMS ***
oniguruma (1.1.0, 1.0.1, 1.0.0, 0.9.1, 0.9.0)

Judging by this result, I'd say the "5.6.0" is the version of the
Cheat Sheet itself; and that I don't have Oniguruma installed.

I've got The Ruby Way, ver. 2, that covers Oniguruma. But I'm fairly
new to Ruby, so I wonder whether stepping up to Oniguruma is
prudent.

Any ideas?

Regards,
Richard
Regards,
Richard

MonkeeSage · Nov 27, 2007

On Nov 26, 9:13 pm, RichardOnRails

Thanks. I'm running ruby 1.8.2 (2004-12-25) [i386-mswin32]. How can
I tell if it uses Oniguruma RE ver.5.6.0?

You don't actually need oniguruma, it's the same syntax as class
Regexp in ruby 1.8 (well, a few things don't work, but 99% does).

Regards,
Jordan

RichardOnRails · Nov 28, 2007

On Nov 26, 9:13 pm, RichardOnRails

Thanks. I'm running ruby 1.8.2 (2004-12-25) [i386-mswin32]. How can
I tell if it uses Oniguruma RE ver.5.6.0?

Click to expand...

You don't actually need oniguruma, it's the same syntax as class
Regexp in ruby 1.8 (well, a few things don't work, but 99% does).

Regards,
Jordan

Hi Jordan,

You don't actually need oniguruma, it's the same syntax as class
Regexp in ruby 1.8 (well, a few things don't work, but 99% does).

Great! Thank you very much for the Cheat Sheet.

Best wishes,
Richard

RichardOnRails · Nov 28, 2007

RichardOnRails said:
RichardOnRails said:

I forgot to tell you that I finally understand your second example.

md = s.match( /^ (.*) [a-zA-Z] /x )
md[1] # => "2.1Topi"

Click to expand...

Click to expand...

Without the question mark, in principal, the ".* initially consumes
all the characters, but then it sees the match fails, because there's
no match for the "[a-zA-Z]". So the ".*" sort of "backs off" and
satisfies it self with "2.1Topi", leaving the "c" to satisfy "[a-zA-
Z]".

Click to expand...

Very good, Richard!
It is a question on when one is 'content'; if you need a metaphor to
remember it, think of a WallStreet banker (.*, .+) vs a Franciscan monk
(.*?, .+?)

The one I like settled on is:

Click to expand...

s="2.1Topic 2.1"
md = s.match( /^ ([\.\d]*) [^\.\d] /x )
#md[0]=2.1T
#md[1]=2.1

Click to expand...

I see that you have solved the problem in your previous post (that I
could not reply to), when you wrote (removing all other code):

s = "2.002.1Topic 2.2.1"

s =~ /^ (\d+[.]?)+ [^\.\d] /x

I must confess: I was stunned myself that it did not work; foolish of
us, in fact it was working, but you failed to collect the bounty! you
needed parenthesis to include the '+'!

s =~ /^ ((\d+[.]?)+) [^\.\d] /x

p $1, 2 # => "2.002.1", "1"

However it is better to avoid collecting also the inner results as they
overwrite each other in $2 and then confuse us (that's the reason that
you saw the last digit captured above..); so let's use the '?:' trick,
to avoid writing in $2, where instead we will capture the 'non
digits/dots' that come after:

s =~ /^ ((?:\d+[.]?)*) ([^\.\d]+) /x

p $1, $2 # => "2.002.1", "Topic "

Do you see it? I think you do. Now, to finish, let's examine how you
solved the problem in this post:

s="2.1Topic 2.1"
md = s.match( /^ ([.\d]*) [^\.\d] /x )

Click to expand...

Ah, you resorted to 'pragmatism'.. you said: "the bloody '\d+[.]?)+'
does not work, so I will change it". This was ok, but do you see the
difference between:

((?:\d+[.]?)*) # I changed + -> * to compare

([\d[.]]*)

aside that the second one is easier to read? (you may want to stop
reading and think about this as this is your test to graduate from
"intermediate level regexp"

Ok: if they could speak, they would say respectively:
1) I want 0 or more sequences of (digits followed optionally by a dot)
2) I want 0 or more combinations of digits and dots as they come

Do you see?
both would match: "2.002.1" but the 2nd would also match "...1..37"!

The last question you had was: how do I pick up the digits once I
collected the "2.002.1"? Study scan in Pickaxe and then do:

str = "2.002.1"

str.scan(/ (\d+) /x) # => [["2"], ["002"], ["1"]]

All right, let's call it a Regexp day,

Raul

Hi Raul,

Thank you for your further support of my obstinacy ƒº Your help has
guided me to the solution I wanted. Your original one is succinct,
perhaps even elegant in that it decomposes the problem into two sub-
problems which admit of essentially one-line solutions. While I truly
appreciate that approach, I wanted to find a "natural" solution,
which is the one included below. It has one caveat: it's aimed at
processing files of only a few megabytes. That said, I'd be pleased
to hear of any downsides you may foresee.

I forgot to tell you that I finally understand your second example.

Click to expand...

[snip]
Very good, Richard!
It is a question on when one is 'content'; if you need a metaphor ...

Thanks. I've got that stuff wired into brain now.

I must confess: I was stunned myself that it did not work; foolish of
us, in fact it was working, but you failed to collect the bounty! you
needed parenthesis to include the '+'!

That approach is old news, now that I've conceived of my "natural"
approach

However it is better to avoid collecting also the inner results as they
overwrite each other in $2 and then confuse us

Understood! As you'll see, I avoided that pitfall below.

[snip]

Do you see it? I think you do.

Quit so.

[snip]
This was ok, but do you see the

difference between:

((?:\d+[.]?)*) # I changed + -> * to compare

([\d[.]]*) [snip]
Do you see?

For sure!

All right, let's call it a Regexp day,

I'll drink to that!

With Thanks and Best Wishes, I remain
Yours truly,
Richard

# "Natural" Solution
input = <<DATA
05Topic 05
1.0Topic 1.0
2.002.1Topic 2.2.1
3.15.26.37Topic 3.15.26.37
DATA

MaxDepth = 5
sRE = "^"
(1..MaxDepth).each { |i|
sRE << ' (\d*)(?:\.?)'
}
sRE += ' ([^\.\d].*)'
re = Regexp.new(sRE, Regexp::EXTENDED)

input.each { |line|
puts '='*10
puts line
puts '='*10

# puts re.to_s # Debug
md = line.match( re )
(0..MaxDepth+1).each { |i|
puts "md[#{i}] = " + md if md
}
puts
}

Raul Parolari · Nov 28, 2007

RichardOnRails said:
Thank you for your further support of my obstinacy ï¿½ï¿½ Your help has
guided me to the solution I wanted. Your original one is succinct,
perhaps even elegant in that it decomposes the problem into two sub-
problems which admit of essentially one-line solutions. While I truly
appreciate that approach, I wanted to find a "natural" solution,
which is the one included below.

Hi, Richard

you mention your 'obstinacy'... and indeed, you have found a way to
implement with Regexps your original design (you did not fool me!

;
great ingenuity.

However, as much as I am stunned by your progress (your original program
at the top of this post and this one seem like Dante going from Inferno
to Paradiso), I must be frank.

I do not like building arrays to meet some 'maximum treshold', leaving
portions of them empty; it just does not make me 'happy' (in the
Matz/Ruby sense, do you understand?). Of course, it is just an array of
5 positions, but it does not matter; it is just echologically wrong for
me.

I however understand your feeling towards my solution (and the contrast
you create with your 'natural one'); got it! but then I invite you to
explore scan with \G; look at this, that may be doing something more
'natural':

# \G 'anchors' start of next search to end of previous one
re_prefix = %r/\G (\d+ [.]?) /x

input.each { |line| p line.scan(re_prefix).flatten }

Output:
["05"]
["1", "0"]
["2", "002", "1"]
["3", "15", "26", "37"]

Small (1 line!), fast, precise: a beauty.

It is not the complete solution as '\G scan' in Ruby does not allow you
to change the regexp without interrupting the job (there are sad
workarounds for it), so the job is not complete. The library
StringScanner ('strscan') is of interest, as it solves this problem
nicely (and is in C, so is fast).

Perhaps, examine \G and/or StringScanner, and see if you can find a
solution that meets what you are looking for, without (as seen from me)
imperfections.

Congrats for your progress (in 1 week!)

Raul

Some people, when confronted with a problem, think:
"I know, I'll use regular expressions".
Now they have two problems.

Jamie Zawinski

Raul Parolari · Nov 28, 2007

Just to correct the position of a parenthesis above:

re_prefix = %r/\G (\d+ [.]?) /x

re_prefix = %r/\G (\d+) [.]? /x

Raul

RichardOnRails · Nov 29, 2007

Hi Raul,

re_prefix = %r/\G (\d+ [.]?) /x

Click to expand...

I have learned something: I saw immediately that the mistyped version
was not what you intended because the dots would have been captured.
I applied your correction and things worked as advertised.

... you have found a way to implement with Regexps your
original design (you did not fool me! ;
:BG

great ingenuity [snip] stunned by your progress

Thanks for the compliments, but they're not merited in this respect:

I wrote my first program circa 1955 (on paper only; no execution)
after receiving a letter from a former high-school classmate
announcing that he had encountered new-fangled machines at Princeton
called "computers". He furthermore recounted their instruction set.
I was hooked. After grinding out a degree from night-college and
earning an NSF graduate fellowship in math, I finally got a job
programming a real computer, which I continued until I retired a few
years ago.

I must be frank.

Absolutely. I've invited that, and am also impressed with your
gracious approach.

I do not like building arrays to meet some 'maximum treshold', leaving

portions of them empty; it just does not make me 'happy' (in the
Matz/Ruby sense, do you understand?). Of course, it is just an array
of
5 positions, but it does not matter; it is just echologically wrong
for
me.

I acknowledge and share the aesthetic validity of your displeasure.

I however understand your feeling towards my solution (and the contrast

you create with your 'natural one'); got it! but then I invite you to
explore scan with \G; look at this, that may be doing something more
'natural':

# \G 'anchors' start of next search to end of previous one

re_prefix = %r/\G (\d+ [.]?) /x
input.each { |line| p line.scan(re_prefix).flatten }

Small (1 line!), fast, precise: a beauty.

I agree fully!

Of course I couldn't leave "well enough" alone, so here's my mod:

input.each { |line| line.scan(re_prefix).flatten.each { |e|
printf("%s ",e)}; puts }

Perhaps, examine \G and/or StringScanner, ...

I took a fast look at ruby-doc.org/core/ ... looks good. Thanks

see if you can find a

solution that meets what you are looking for, without (as seen from
me)
imperfections.

Your latest approach suits my requirement (and taste) perfectly. I'm
off now to continue work on the project I'm developing, which you
might find interesting. Since it's off-topic, send me an email if you
want details. (My email-address looks artificial in order to deter
spammers, but I do have a mail box for it which I only check
sporadically unless I anticipate legitimate email.)

Some people, when confronted with a problem, think:

"I know, I'll use regular expressions".
Now they have two problems.

:BG

Again, thank you and
Best Wishes,
Richard

Interpreting "(.*?)" and "(?:\d+ [.]?)" in REs	2	Nov 22, 2007
SENTINEL CONTROL LOOP WHEN DEALING WITH TWO ARRAYS	1	Oct 26, 2023
C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
[QUIZ][SUMMARY] Restoring Data From SQL (#199)	1	Apr 14, 2009
nested dictionaries and functions in data structures.	0	Jan 7, 2014
problem with saving data in a text file	0	Apr 24, 2013
Req. for comments section "Basic Data" in intro book	1	Nov 28, 2009
Question about function failing with large number	0	Aug 13, 2013

False positives in editing data

Raul Parolari

RichardOnRails

RichardOnRails

RichardOnRails

RichardOnRails

RichardOnRails

Alex Young

Raul Parolari

RichardOnRails

RichardOnRails

Raul Parolari

MonkeeSage

RichardOnRails

MonkeeSage

RichardOnRails

RichardOnRails

Raul Parolari

Raul Parolari

RichardOnRails

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads