Regex and non-greedy matching?

M

Marc Heiler

I have a slight problem. I have strings with some tags such as

'<b><lightblue>name:</></b>'

I need to match "name:" and "lightblue"
In other words:
- What is between <> </>
and
- What is inside the first <> right next to "name:"

The following regex does not work:

'<b><lightblue>name:</></b>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

$1 # => "b"
$2 # => "<lightblue>name:

$2 should only be name:
and $1 should only be lightblue
 
N

Nobuyoshi Nakada

Hi,

At Mon, 7 Apr 2008 08:25:28 +0900,
Marc Heiler wrote in [ruby-talk:297262]:
I have a slight problem. I have strings with some tags such as

'<b><lightblue>name:</></b>'

I need to match "name:" and "lightblue"
In other words:
- What is between <> </>
and
- What is inside the first <> right next to "name:"

The following regex does not work:

'<b><lightblue>name:</></b>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

/ said:
$2 should only be name:
and $1 should only be lightblue

Non-greedy matching doesn't mean the shortest result matching.
It matches at the leftmost position.
 
S

s.ross

Hi--

I have a slight problem. I have strings with some tags such as

'<b><lightblue>name:</></b>'

I need to match "name:" and "lightblue"
In other words:
- What is between <> </>
and
- What is inside the first <> right next to "name:"

The following regex does not work:

'<b><lightblue>name:</></b>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

$1 # => "b"
$2 # => "<lightblue>name:

$2 should only be name:
and $1 should only be lightblue

You might want to look into hpricot (http://code.whytheluckystiff.net/hpricot/
). It will give you pretty reliable parsing of XML markup. What you
have here is not valid XML because the closing tag for <lightblue> is
not </lightblue> but on the chance that it's a typo, I really
recommend giving hpricot a try.
 
7

7stud --

Marc said:
I have a slight problem. I have strings with some tags such as

'<b><lightblue>name:</></b>'

I need to match "name:" and "lightblue"
In other words:
- What is between <> </>
and
- What is inside the first <> right next to "name:"

The following regex does not work:

'<b><lightblue>name:</></b>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

$1 # => "b"

This is your string:

'<b><lightblue>name:</></b>'

and the first part of your regex says to look for a '<', followed by one
or more characters, followed by a '>'. That certainly describes the
$2 # => "<lightblue>name:

This is your string again:

'<b> <--already matched this
<lightblue>name:</></b>'

The second part of your regex says to look for a '<', followed by any
character one or more times, followed by '</>'. That certainly
describes the string '<lightblue>name</>'.

Note that since the characters '</>' only appear once in your string,
the non-greedy qualifier has no effect. By default, regex's are greedy,
so if your string looked like this:

'<b><lightblue>name:</></b>xxxxxxxxxxxxxxx</>'

then the greedy version of your regex:

/>(.+)<\/>/ <----(no '?')

would match:
<lightblue>name:</></b>xxxxxxxxxxxxxxx</>

That's because the portion:

<lightblue>name:</></b>xxxxxxxxxxxxxxx

is interpreted as "any character(.) one or more times(+)".

On the other hand, your non-greedy regex(i.e. with the '?') would match:

<lightblue>name:</>


If you examine your string again:

'<b><lightblue>name:</></b>'

the 'lightblue' substring is preceded by the characters '><', and that
is different from what precedes 'b'. You can use that fact to get
'lightblue' instead of 'b'. This regex will get 'lightblue':

That says to look for '><' followed by one or more characters that are
not a '>'. That will match:

'><lightblue'

To get 'name:', you can do something similar. This is the rest of the
string after 'lightblue':

'>name:</></b>'

Here is a regex to get 'name:':

That says to look for a '>', followed by one or more characters that are
not a '<'. Here it is altogether:


pattern = /><([^>]+)>([^<]+)/
str = "<b><lightblue>name:</></b>"

match_obj = pattern.match(str)
puts match_obj[1]
puts match_obj[2]

--output:--

lightblue
name:
 
R

Robert Klemme

2008/4/7 said:
I have a slight problem. I have strings with some tags such as

'<b><lightblue>name:</></b>'

I need to match "name:" and "lightblue"
In other words:
- What is between <> </>
and
- What is inside the first <> right next to "name:"

The following regex does not work:

'<b><lightblue>name:</></b>' =~ /<([a-zA-Z]+)>(.+?)<\/>/

$1 # => "b"
$2 # => "<lightblue>name:

$2 should only be name:
and $1 should only be lightblue

Constructing a regexp to match more specific often helps:

irb(main):001:0> s='<b><lightblue>name:</></b>'
=> "<b><lightblue>name:</></b>"

irb(main):002:0> md = %r{<b>\s*<([^>]*)>([^<]*)</>}.match s
=> #<MatchData:0x7ff973f4>
irb(main):003:0> md.to_a
=> ["<b><lightblue>name:</>", "lightblue", "name:"]

irb(main):004:0> md = %r{<b>\s*<([^>]*)>\s*([^<]*)</>}.match s
=> #<MatchData:0x7ff85b54>
irb(main):005:0> md.to_a
=> ["<b><lightblue>name:</>", "lightblue", "name:"]
irb(main):006:0>

See how this works without reluctant quantifier?

Cheers

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top