Method improvement request .--

C

Charles Hixson

I'm sure there must be a more idiomatic+efficient way to do this, but I
can't figure it out. Any suggestions?
Also, I'm not sure all of the tests are necessary. Many of them were
added to avoid "Nil class does not implement..." messages. Is there a
better approach?

# parse1 separates a chunk into the non-word stuff before it, the word
stuff, and the non-word stuff after it
# word stuff is letters, digits, hyphens, and periods
def parse1(chunk)
pb = /^([^-A-Za-z0-9]*)/
pe = /([^-A-Za-z0-9]*)$/
mtch = pb.match(chunk)
a = mtch[0]
mtch = pe.match(mtch.post_match)
b = mtch.pre_match
c = mtch[0]
#print " parse1:a #{a.inspect} " if a and
a.length > 0
yield a if a and a.length > 0
#print " parse1:b #{b.inspect} " if b and
b.length > 0
yield b if b and b.length > 0
#print " parse1:c #{c.inspect} " if c and
c.length > 0
yield c if c and c .length > 0
end

# parse2 takes a hunk of word stuff, and possibly separates it at a
double-hyphen
def parse2(chunk)
unless chunk.include?("--") then
yield chunk
else
val = chunk.split(/--/)
v2 = []
val.each do |v|
v2 << v
v2 << "--"
end
v2.delete_at(v2.length)
v2.each { |v| yield v }
end
end

# parse3 takes a hunk of word stuff, and possibly separates it at an
ellipsis (triple '.')
def parse3(chunk)
unless chunk.include?("...") then
yield chunk
else
val = chunk.split(/\.\.\./)
v2 = []
val.each do |v|
v2 << v
v2 << "..."
end
v2.delete_at(v2.length)
v2.each { |v| yield v }
end
end

def wrds(chunk)
return "" unless chunk.respond_to?("[]")
parse1(chunk) do |p1|
#print " wrds:p1 = #{p1.inspect} "
parse2(p1) do |p2|
#print " wrds:p2 = #{p2.inspect} "
parse3(p2) do |p3|
#print " wrds:p3 = #{p3.inspect} "
yield p3
end
end
end
end
 
J

James Edward Gray II

I'm sure there must be a more idiomatic+efficient way to do this, but
I can't figure it out. Any suggestions?
Also, I'm not sure all of the tests are necessary. Many of them were
added to avoid "Nil class does not implement..." messages. Is there a
better approach?

I'll probably get in trouble for this around here, but I came from
Perl... said:
# parse1 separates a chunk into the non-word stuff before it, the word
stuff, and the non-word stuff after it
# word stuff is letters, digits, hyphens, and periods
def parse1(chunk)
pb = /^([^-A-Za-z0-9]*)/
pe = /([^-A-Za-z0-9]*)$/
mtch = pb.match(chunk)
a = mtch[0]
mtch = pe.match(mtch.post_match)
b = mtch.pre_match
c = mtch[0]
#print " parse1:a #{a.inspect} " if a and
a.length > 0
yield a if a and a.length > 0
#print " parse1:b #{b.inspect} " if b and
b.length > 0
yield b if b and b.length > 0
#print " parse1:c #{c.inspect} " if c and
c.length > 0
yield c if c and c .length > 0
end

What about something like:

def parse1(chunk)
if chunk =~ / ^([^-A-Za-z0-9]*) # pre-match
(.*) # middle
([^-A-Za-z0-9]*)$ # post-match
/x
a, b, c = $1, $2, $3
end
end

Does that give you some ideas?

James Edward Gray II
 
J

James Edward Gray II

What about something like:

def parse1(chunk)
if chunk =~ / ^([^-A-Za-z0-9]*) # pre-match
(.*) # middle

Sorry, that probably needs to be:

(.*?)

My bad.

James Edward Gray II
([^-A-Za-z0-9]*)$ # post-match
/x
a, b, c = $1, $2, $3
end
end

Does that give you some ideas?

James Edward Gray II
 
D

David A. Black

Hi --

I'm sure there must be a more idiomatic+efficient way to do this, but I
can't figure it out. Any suggestions?
Also, I'm not sure all of the tests are necessary. Many of them were
added to avoid "Nil class does not implement..." messages. Is there a
better approach?

# parse1 separates a chunk into the non-word stuff before it, the word
stuff, and the non-word stuff after it
# word stuff is letters, digits, hyphens, and periods
def parse1(chunk)
pb = /^([^-A-Za-z0-9]*)/
pe = /([^-A-Za-z0-9]*)$/
mtch = pb.match(chunk)
a = mtch[0]
mtch = pe.match(mtch.post_match)
b = mtch.pre_match
c = mtch[0]
#print " parse1:a #{a.inspect} " if a and
a.length > 0
yield a if a and a.length > 0
#print " parse1:b #{b.inspect} " if b and
b.length > 0
yield b if b and b.length > 0
#print " parse1:c #{c.inspect} " if c and
c.length > 0
yield c if c and c .length > 0
end

The spacing got screwed up there, as you can see, but anyway --

I believe that pre_match and post_match will always be empty strings,
if there's no match, not nil. So the "if a" test is not necessary (if
I'm right). However, calling #[] on the results of a match will raise
an exception (trying to call #[] on nil) if there was no match, so you
have to be careful with the "a = mtch[0]" line.

I wonder also whether it's useful to yield only non-empty strings.
The caller then has to test the strings to see which of the positions
they're from. It might be better to yield three things every time, so
the caller knows what's being yielded.

All of which leads me to this probably over-simplified code:

def parse1(chunk)
chunk.scan(/^(\W*)(\w+)(\W*)$/).flatten.each {|s| yield s}
end

(I've used \W and \w where you'd need to use something more
custom-made -- though I don't think your character classes do what you
want, because they don't include periods.)


David
 
C

Charles Hixson

David said:
Hi --

I'm sure there must be a more idiomatic+efficient way to do this, but I
can't figure it out. Any suggestions?
Also, I'm not sure all of the tests are necessary. Many of them were
added to avoid "Nil class does not implement..." messages. Is there a
better approach?

# parse1 separates a chunk into the non-word stuff before it, the word
stuff, and the non-word stuff after it
# word stuff is letters, digits, hyphens, and periods
def parse1(chunk)
pb = /^([^-A-Za-z0-9]*)/
pe = /([^-A-Za-z0-9]*)$/
mtch = pb.match(chunk)
a = mtch[0]
mtch = pe.match(mtch.post_match)
b = mtch.pre_match
c = mtch[0]
#print " parse1:a #{a.inspect} " if a and
a.length > 0
yield a if a and a.length > 0
#print " parse1:b #{b.inspect} " if b and
b.length > 0
yield b if b and b.length > 0
#print " parse1:c #{c.inspect} " if c and
c.length > 0
yield c if c and c .length > 0
end

The spacing got screwed up there, as you can see, but anyway --

I believe that pre_match and post_match will always be empty strings,
if there's no match, not nil. So the "if a" test is not necessary (if
I'm right). However, calling #[] on the results of a match will raise
an exception (trying to call #[] on nil) if there was no match, so you
have to be careful with the "a = mtch[0]" line.

I wonder also whether it's useful to yield only non-empty strings.
The caller then has to test the strings to see which of the positions
they're from. It might be better to yield three things every time, so
the caller knows what's being yielded.

All of which leads me to this probably over-simplified code:

def parse1(chunk)
chunk.scan(/^(\W*)(\w+)(\W*)$/).flatten.each {|s| yield s}
end

(I've used \W and \w where you'd need to use something more
custom-made -- though I don't think your character classes do what you
want, because they don't include periods.)
David
That's a very interesting rewrite of the parse1 match patterns.
It's a bit more complicated than that, e.g., at the match boundary
apostrophe's aren't a part of the middle, but in the middle (of the
middle) they are. Think about 'don't'. And periods aren't a legitimate
part of most middle-chunks. But sometimes they are. This can't be
resolved by simple matching. So I'm going to need to pre-process to
identify known good values and replace them by something that will
pass...and then back convert them afterwards.

If I could make the start and the end greedy, and the middle
non-greedy....(I need to check this!) I could drastically simplify
parse1. But my main question is really about the nested routines that
yield values. This looks messy, but it does provide reasonably easy
extension (e.g., note the late addition of a routine to handle internal
elipsis. Again, I really should preprocess to convert an elipsis into a
single special character...and then return it to normal form later.)

OTOH, if I could write the correct pattern, I could go back one step
earlier, to where I originally break the chunks off the string with:
chunks = lin.chomp.split
and replace it with something like
chunks = line.scan(/(\W*)(\w*)(/W*)s+/).flatten

as you indicate I'll need to use a much fancier pattern than the default
wW, if for no other reason, then because they incluce spaces. (Figuring
out the proper pattern for the middle section will be QUITE an
interesting endeavor!)
P.S.: Does scan keep recycling it's pattern, or would I need to replace
it with (an elaboration of)
chunks = line.scan(/^((\W*)(\w*)(/W*)s+/)$).flatten
 
B

Bill Guindon

I'm sure there must be a more idiomatic+efficient way to do this, but I
can't figure it out. Any suggestions?

Might be easier if you could show a sample of what you're starting
with, and what you'd like it to become. If the sample is large, you
can always paste it into codepaste (http://www.codepaste.org), and
just put the link to it in the email
Also, I'm not sure all of the tests are necessary. Many of them were
added to avoid "Nil class does not implement..." messages. Is there a
better approach?

yes, those still drive me nuts too, but I'm learning :)
yield a if a and a.length > 0

this can be written as...
yield a if a unless a.empty?

it won't get to the 'unless' if 'a' is nil, so it dodges the missing
method error.
return "" unless chunk.respond_to?("[]")

you can avoid this by forcing the issue...
chunk = [] << chunk
chunk.flatten!

now chunk is an Array no matter what. could be an empty one, or could
contain useless objects, but it's an Array ;)

Also... I think parse2 and parse3 can be combined into something like this:

p2 = p1.split(/(\.\.\.|--)/)
p2.delete('--')
p2.delete('...')
p2.each {|p3| yield p3}

if you can change your editor from tabs to 2 spaces, you're code will
hold up better in the emails.
 
C

Charles Hixson

James said:
What about something like:

def parse1(chunk)
if chunk =~ / ^([^-A-Za-z0-9]*) # pre-match
(.*) # middle


Sorry, that probably needs to be:

(.*?)

My bad.

James Edward Gray II

That would only work if the postmatch pattern were included in the same
pattern as the prematch, thus:
if chunk =~ /^([^-A-Za-z0-9.]*)(.*?)[^-A-Za-z0-9.]*)$/

But that's a good idea. I hadn't realized it was so easy to specify
that a piece of a pattern was non-greedy.
Thanks.
([^-A-Za-z0-9]*)$ # post-match
/x
a, b, c = $1, $2, $3
end
end

Does that give you some ideas?

James Edward Gray II
 
D

David A. Black

Hi --

this can be written as...
yield a if a unless a.empty?

Or:

yield a unless a.to_s.empty? # nil.to_s is ""

(However, see my notes in my other post.)
it won't get to the 'unless' if 'a' is nil, so it dodges the missing
method error.
return "" unless chunk.respond_to?("[]")

you can avoid this by forcing the issue...
chunk = [] << chunk

Or:

chunk = [chunk]

(I personally dislike reusing variable names when things are changed
like this, but that's a side [non-]issue :)


David
 
D

David A. Black

Hi --

That's a very interesting rewrite of the parse1 match patterns.
It's a bit more complicated than that, e.g., at the match boundary
apostrophe's aren't a part of the middle, but in the middle (of the
middle) they are. Think about 'don't'.

Think about "'tis" :)
P.S.: Does scan keep recycling it's pattern, or would I need to replace
it with (an elaboration of)
chunks = line.scan(/^((\W*)(\w*)(/W*)s+/)$).flatten

scan keeps recycling its pattern, but note that ^ and $ will be
included each time. Note also that ^ and $ apply to lines, not the
entire string. Thus:

$ ruby -e 'p "a\nb\ncde\n".scan(/./)'
["a", "b", "c", "d", "e"]
$ ruby -e 'p "a\nb\ncde\n".scan(/^.$/)'
["a", "b"]

You can use \A and \Z to indicate start and end of string (or \z to
match end-except-for-possible-newline).


David
 
J

James Edward Gray II

James said:
What about something like:

def parse1(chunk)
if chunk =~ / ^([^-A-Za-z0-9]*) # pre-match
(.*) # middle


Sorry, that probably needs to be:

(.*?)

My bad.

James Edward Gray II

That would only work if the postmatch pattern were included in the
same pattern as the prematch, thus:
if chunk =~ /^([^-A-Za-z0-9.]*)(.*?)[^-A-Za-z0-9.]*)$/

I did include all three in the same pattern, I just used the whitespace
and comment modifier to pretty it up a bit.

/^([^-A-Za-z0-9.]*)(.*?)([^-A-Za-z0-9.]*)$/

Is the same as:

/
^([^-A-Za-z0-9.]*) # pre-match
(.*?) # middle
([^-A-Za-z0-9.]*)$ # post-match
/x

Note the trailing /x.

James Edward Gray II
 
Y

YANAGAWA Kazuhisa

In Message-Id: <[email protected]>
Charles Hixson said:
# parse1 separates a chunk into the non-word stuff before it, the word
stuff, and the non-word stuff after it

String#split includes a captured portion in a result array, so you can do:

irb(main):008:0> "!@#456&*(".split(/([\w\-.]+)/, 3)
=> ["!@#", "456", "&*("]
irb(main):009:0> "456&*(".split(/([\w\-.]+)/, 3)
=> ["", "456", "&*("]
irb(main):010:0> "!@#456".split(/([\w\-.]+)/, 3)
=> ["!@#", "456", ""]

If a string has extra word-nonword pairs, you should consider that.

# parse2 takes a hunk of word stuff, and possibly separates it at a (snip)
# parse3 takes a hunk of word stuff, and possibly separates it at an

So you just do:

def parse_chunk(chunk, regexp)
chunk.split(regexp, 3).each {|v| yield(v)}
end

def parse2(chunk, &block)
parse_chunk(chunk, /(\.\.\.)/, &block)
end

def parse3(chunk, &block)
parse_chunk(chunk, /(--)/, &block)
end
 
C

Charles Hixson

James said:
James said:
On Sep 18, 2004, at 5:04 PM, James Edward Gray II wrote:

What about something like:

def parse1(chunk)
if chunk =~ / ^([^-A-Za-z0-9]*) # pre-match
(.*) # middle



Sorry, that probably needs to be:

(.*?)

My bad.

James Edward Gray II


That would only work if the postmatch pattern were included in the
same pattern as the prematch, thus:
if chunk =~ /^([^-A-Za-z0-9.]*)(.*?)[^-A-Za-z0-9.]*)$/


I did include all three in the same pattern, I just used the
whitespace and comment modifier to pretty it up a bit.

/^([^-A-Za-z0-9.]*)(.*?)([^-A-Za-z0-9.]*)$/

Is the same as:

/
^([^-A-Za-z0-9.]*) # pre-match
(.*?) # middle
([^-A-Za-z0-9.]*)$ # post-match
/x

Note the trailing /x.

James Edward Gray II

Sorry. It took me a bit of digging to find the /.../x documentation
even after you explicitly pointed it out to me. (This won't work for
me, because my actual pre- and post- patterns also exclude spaces, but
it can certainly clarify the example, if one understands it!)
 
C

Charles Hixson

David said:
Hi --


Think about "'tis" :)
OUCH! An excellent point. I don't know how to handle that except by
another special case pre-processor. Sigh! I was *SO* hoping to
minimize special cases. (Of course, I knew that would be impossible,
but still...)
P.S.: Does scan keep recycling ...it with (an elaboration of)
chunks = line.scan(/^((\W*)(\w*)(/W*)s+/)$).flatten
scan keeps recycling its pattern, but note that ^ and $ will be
included each time. Note also that ^ and $ apply to lines, not the
entire string. Thus:

$ ruby -e 'p "a\nb\ncde\n".scan(/./)'
["a", "b", "c", "d", "e"]
$ ruby -e 'p "a\nb\ncde\n".scan(/^.$/)'
["a", "b"]

You can use \A and \Z to indicate start and end of string (or \z to
match end-except-for-possible-newline).
David
Mmmph... This may take a bit of experimentation to get right. OTOH, \z
could save a chomp invocation on each line. (I do think that handling a
line at a time is the appropriate choice. The other reasonable choice
would be to accumulate text until I encounter a blank line, and then
parsing it all in one go. But that's a bit more sensitive to the
formatting of the text, and could result in excessively large buffers.
(You can certainly tell I'm no expert at regexps.)
 
J

James Edward Gray II

Sorry. It took me a bit of digging to find the /.../x documentation
even after you explicitly pointed it out to me. (This won't work for
me, because my actual pre- and post- patterns also exclude spaces, but
it can certainly clarify the example, if one understands it!)

You can match space characters in an /.../x regex. The easiest way is
to use the whitespace character class escape \s.

Hope that helps.

James Edward Gray II
 
C

Charles Hixson

YANAGAWA said:
In Message-Id: <[email protected]>


# parse1 separates a chunk into the non-word stuff before it, the word
stuff, and the non-word stuff after it
String#split includes a captured portion in a result array, so you can do:
irb(main):008:0> "!@#456&*(".split(/([\w\-.]+)/, 3)
=> ["!@#", "456", "&*("]
irb(main):009:0> "456&*(".split(/([\w\-.]+)/, 3)
=> ["", "456", "&*("]
irb(main):010:0> "!@#456".split(/([\w\-.]+)/, 3)
=> ["!@#", "456", ""]
If a string has extra word-nonword pairs, you should consider that.
Unfortunately, when I take this approach, I get a lot of nil's (and the
string isn't actually only split into three's, though I suppose you are
implying I could to it iteratively on the tail...which could work).
So you just do:

def parse_chunk(chunk, regexp)
chunk.split(regexp, 3).each {|v| yield(v)}
end

def parse2(chunk, &block)
parse_chunk(chunk, /(\.\.\.)/, &block)
end

def parse3(chunk, &block)
parse_chunk(chunk, /(--)/, &block)
end
I'm still in the early stages of picking up Ruby (again...the first time
the libraries were too immature for me to use them), so I'm having
difficulty envisioning how one would chain those parse methods
together. &block should clearly be a block of code, but I don't see
*what* the code should be.

There are many Ruby idioms that I'm less than totally fluent with, and
that's one of them. (I just recently figured out that if a routine has
yields, then you probably shouldn't have returns. And I'm still not
certain that an error check at the start shouldn't just return an
invalid value...but for safety I've been rewriting things to do a logic
branch around the normal code and yield an invalid value. It works, but
I'm not really sure it's the best way.)

Sorry, I've done a lot of programming, but this is my first try at a
substantial piece in Ruby. So I can generally make things work, but
frequently go the long way around to do it.
 
C

Charles Hixson

James said:
You can match space characters in an /.../x regex. The easiest way is
to use the whitespace character class escape \s.

Hope that helps.

James Edward Gray II

Does that work inside character class definitions( [] delimited groups)?
 
C

Charles Hixson

Charles said:
James said:
You can match space characters in an /.../x regex. The easiest way
is to use the whitespace character class escape \s.
Hope that helps.
James Edward Gray II

Does that work inside character class definitions( [] delimited groups)?

Silly of me, of course not. /s *IS* a character class definition.

But this does mean that I won't be able to use /.../x in the code.
Still, it's great for clarifying the examples, now that I understand it.
 
J

James Edward Gray II

Does that work inside character class definitions( [] delimited
groups)?

Contrary to what you expect (judging by your later message), it sure
does.

[\saeiou]

Will match a whitespace or vowel character.

James Edward Gray II
 
C

Charles Hixson

Bill said:
I'm sure there must be a more idiomatic+efficient way to do this, but I
can't figure it out. Any suggestions?

Might be easier if you could show a sample of what you're starting
with, and what you'd like it to become. If the sample is large, you
can always paste it into codepaste (http://www.codepaste.org), and
just put the link to it in the email


Also, I'm not sure all of the tests are necessary. Many of them were
added to avoid "Nil class does not implement..." messages. Is there a
better approach?

yes, those still drive me nuts too, but I'm learning :)


yield a if a and a.length > 0

this can be written as...
yield a if a unless a.empty?

it won't get to the 'unless' if 'a' is nil, so it dodges the missing
method error.


return "" unless chunk.respond_to?("[]")

you can avoid this by forcing the issue...
chunk = [] << chunk
chunk.flatten!

now chunk is an Array no matter what. could be an empty one, or could
contain useless objects, but it's an Array ;)

Also... I think parse2 and parse3 can be combined into something like this:

p2 = p1.split(/(\.\.\.|--)/)
p2.delete('--')
p2.delete('...')
p2.each {|p3| yield p3}

if you can change your editor from tabs to 2 spaces, you're code will
hold up better in the emails.
OK, I've copied it over to codepaste here:
http://www.codepaste.org/paste/comment/218
http://www.codepaste.org/view/paste/163?show_comments=1
etc.
(How long does code stay up here? I never knew the site existed.)

But I think that I put the relevant pieces in the first e-mail.
However, I don't intend that elipsis and double-dashes be deleted.
They merely need to be parsed separately from the words that they appear
with. They do contain significant meaning, so merely deleting them
would be anti-productive.

Also: Is

chunk = [] << chunk
chunk.flatten!
return chunk
better in some way than
return "" unless chunk.respond_to?("[]")

I could see, perhaps,
return "" if not chunk or chunk.empty?
but I'd been reading that it was more Ruby-esque to use duck typing and the responds_to? test. That's one reason I didn't do
return "" if chunk.nil?

And I'm still not certain what I should really be doing in such a case. What I really want to do is avoid returning nil. I want to skip over the processing of this case without aborting the calling process (likely an each). If this were a loop, then the command would be next rather than break or retry.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,070
Latest member
BiogenixGummies

Latest Threads

Top