regexp help sought

R

rpardee

Hey All,

I'm trying to parse lines from my text editor's config file, which look
like this (pls watch for line wrap--there is one line per language,
starting with /L<<digit>>):

/L1"SAS" Line Comment = * Block Comment On = /* Block Comment Off = */
Block Comment On Alt = * Block Comment Off Alt = ; Nocase File
Extensions = SAS
/L2"Visual Basic" Line Comment = ' File Extensions = BAS FRM CLS VBS
CTL WSF
/L4"HTML" Nocase Noquote HTML_LANG Block Comment On = <!-- Block
Comment Off = --> Block Comment On Alt = <% Block Comment Off Alt = %>
String Chars = "' File Extensions = HTM HTML ASP SHTML HTT HTX JSP
/L11"Ruby" Line Comment Num = 2# Block Comment On = =begin Block
Comment Off = =end String Chars='" Escape Char = \ File Extensions = RB
RBW

I'm trying to write a method for extracting the comment markers & their
types (line/block & on/off). Regexps seemed the obvious tool, and I
eventually came up with this one:

c = Regexp.new("(Line|Block) Comment (On |Off |On Alt |Off Alt)*=
([^\s\t\r\n\f]+) ")

This is working well so far, except that it only grabs out the first
type of comment in each line. I'd hoped that I could make it get all
the comment types by putting an additional set of parens and a +
quantifier around the whole expression:

c = Regexp.new("((Line|Block) Comment (On |Off |On Alt |Off Alt)*=
([^\s\t\r\n\f]+))+ ")

But that just seems to break it--that version doesn't capture anything.

Anybody got a clue for me? I'm using v1.8 on windows. My code is
below. (And again, pls watch for line wrapping).

Thanks!

-Roy

def parse_comment_markers(line)
=begin

There are line comments & (2 different kinds of) block comments.

Line comments only have a start marker--EOL is the terminator.

Comment types are:
Line Comment = <<mark>>
Block Comment On = <<mark>>
Block Comment Off = <<mark>>
Block Comment On Alt = <<mark>>
Block Comment Off Alt = <<mark>>

Where <<mark>> can be any contiguous set of non-whitespace chars.

For Line comment marks, preceding digits specify the # of spaces
minus 1
required after the nondigit portion of the marker. So for ruby, the
line
comment mark is 2#, signifying that # is a comment only if it is
followed by
a space. Ignore this for now.

So--funky regexp time. We want to grab sequences centered around
the string " Comment ".
We want the single word prior to "Comment" and all words between
"Comment" and " = ", and
then of course the contiguous nonwhitespace following " = ".

=end
puts line
# Why doesn't the \S char class work?
# c = Regexp.new("(Line|Block) Comment (On |Off |On Alt |Off Alt)*=
(\S+)")
c = Regexp.new("(Line|Block) Comment (On |Off |On Alt |Off Alt)*=
([^\s\t\r\n\f]+) ")
cm = c.match(line)
if cm.nil?
puts "No match!"
else
puts cm.captures.join(" || ")
puts "Comment type is \"" + cm.captures[0] + "\", and comment
marker is \"" + cm.captures[2] + "\""
end
end

parse_comment_markers("/L2 \"Ruby\" Line Comment = # Block Comment On =
' File Extensions = RB RBW")
parse_comment_markers("/L2 \"Ruby\" Block Comment On = =begin Block
Comment Off = =end File Extensions = RB RBW")
 
R

Robert Klemme

Hey All,

I'm trying to parse lines from my text editor's config file, which look
like this (pls watch for line wrap--there is one line per language,
starting with /L<<digit>>):

/L1"SAS" Line Comment = * Block Comment On = /* Block Comment Off = */
Block Comment On Alt = * Block Comment Off Alt = ; Nocase File
Extensions = SAS
/L2"Visual Basic" Line Comment = ' File Extensions = BAS FRM CLS VBS
CTL WSF
/L4"HTML" Nocase Noquote HTML_LANG Block Comment On = <!-- Block
Comment Off = --> Block Comment On Alt = <% Block Comment Off Alt = %>
String Chars = "' File Extensions = HTM HTML ASP SHTML HTT HTX JSP
/L11"Ruby" Line Comment Num = 2# Block Comment On = =begin Block
Comment Off = =end String Chars='" Escape Char = \ File Extensions = RB
RBW

I'm trying to write a method for extracting the comment markers & their
types (line/block & on/off). Regexps seemed the obvious tool, and I
eventually came up with this one:

c = Regexp.new("(Line|Block) Comment (On |Off |On Alt |Off Alt)*=
([^\s\t\r\n\f]+) ")

This is working well so far, except that it only grabs out the first
type of comment in each line. I'd hoped that I could make it get all
the comment types by putting an additional set of parens and a +
quantifier around the whole expression:

c = Regexp.new("((Line|Block) Comment (On |Off |On Alt |Off Alt)*=
([^\s\t\r\n\f]+))+ ")

But that just seems to break it--that version doesn't capture anything.

Anybody got a clue for me? I'm using v1.8 on windows. My code is
below. (And again, pls watch for line wrapping).

You want String#scan

matches = line.scan(re)

or

line.scan(re) do |match|
....
end

Kind regards

robert
 
R

rpardee

Awesome--that's exactly what I needed. And much more readable than
elaborating the regexp.

Thanks!

-Roy
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,059
Latest member
cryptoseoagencies

Latest Threads

Top