Regexp *-operator and multiple elements

M

Martijn Houtman

Hello,

I have an issue parsing a string with a regular exression. Here's a small
example:

@foobar = ("foobarbarbarfoo" =~ m/(foo)(bar)*(foo)/g);

this makes the array foobar contain:
{"foo", "bar", "foo"}
while I want it to be
{"foo", "bar", "bar", "bar", "foo"}

The *-operator seems to 'forget' the first few elements and just returns the
last element, which is stored in the $2 variable. Is there a way to make it
return the full list of elements?

I have been suggested to split the string into three pieces first, and then
parse them separately, but I'd still like to do it with a single regular
expression.

Thanks in advance!
Regards,
 
G

Gunnar Hjalmarsson

Martijn said:
@foobar = ("foobarbarbarfoo" =~ m/(foo)(bar)*(foo)/g);

this makes the array foobar contain:
{"foo", "bar", "foo"}
while I want it to be
{"foo", "bar", "bar", "bar", "foo"}

The *-operator seems to 'forget' the first few elements and just
returns the last element, which is stored in the $2 variable. Is
there a way to make it return the full list of elements?

This may be something in the right direction:

@foobar = 'foobarbarbarfoo' =~ /(foo)((?:bar)*)(foo)/;

It distinguishes between clustering and capturing, and '*' is captured
as well. The result is:

('foo', 'barbarbar', 'foo')

Of course, to get an array with five elements you can do:

@foobar = 'foobarbarbarfoo' =~ /(foo|bar)/g;

But that matches much more. ;-)
 
R

Ragnar Hafstað

Martijn Houtman said:
Hello,

I have an issue parsing a string with a regular exression. Here's a small
example:

@foobar = ("foobarbarbarfoo" =~ m/(foo)(bar)*(foo)/g);

this makes the array foobar contain:
{"foo", "bar", "foo"}
while I want it to be
{"foo", "bar", "bar", "bar", "foo"}

The *-operator seems to 'forget' the first few elements and just returns the
last element, which is stored in the $2 variable.
this is because the * is outside the capture brackets
Is there a way to make it
return the full list of elements?

yes
this can be done with a combination of look-ahead/behind assertions, along
with the \G assertion
I have been suggested to split the string into three pieces first, and then
parse them separately, but I'd still like to do it with a single regular
expression.

why not:
@foobar = ("foo",(barbarbarfoo" =~ /foo((?:bar)*)foo/ &&
$1=~/(bar)/g),"foo");

if you still want to do it with the assertions:
@foobar = ("foobarbarbarfoo" =~
/(foo(?=(?:bar)*foo)|\Gbar(?=(?:bar)*foo)|(?<=foo(?:bar))\Gfoo)/g;

gnari
 
R

Ragnar Hafstað

Ragnar Hafstað said:
@foobar = ("foobarbarbarfoo" =~
/(foo(?=(?:bar)*foo)|\Gbar(?=(?:bar)*foo)|(?<=foo(?:bar))\Gfoo)/g;

ooops, the cut-and-paste failed to include the closing parens
@foobar = ("foobarbarbarfoo" =~
/(foo(?=(?:bar)*foo)|\Gbar(?=(?:bar)*foo)|(?<=foo(?:bar))\Gfoo)/g);

gnari
 
M

Martijn Houtman

Ragnar said:
ooops, the cut-and-paste failed to include the closing parens
@foobar = ("foobarbarbarfoo" =~
/(foo(?=(?:bar)*foo)|\Gbar(?=(?:bar)*foo)|(?<=foo(?:bar))\Gfoo)/g);

Thanks, Gnari and Gunnar, for your suggestions. I fail to see what exactly
happens in the above example, though. I wished the answer would have been a
bit more trivial.

The problem is, the above might work for the above example, but my "real
life" situation is a bit more complex. Take a look at this url, if you're
interested: http://tinus.ath.cx/temp/form.txt. It's the code I currently
have.

It's meant to be a .java-file parser. The idea of this uni assignment is to
have the script count a few certain keywords, like 'private', 'class',
'new' etc. in the .java-file. Now, @imports is supposed to catch the bits
surrounded by '( )' in the regexps. It does, but where the
multiplier-operator * is used, it just counts the last, as explained in my
previous, smaller example.

I know there might be a better way to count the keywords, but I would still
like to finish the parser as it is. Suggestions are very welcome.

Thanks again. Kind regards,
 
R

Ragnar Hafstað

The problem is, the above might work for the above example, but my "real
life" situation is a bit more complex. Take a look at this url, if you're
interested: http://tinus.ath.cx/temp/form.txt. It's the code I currently
have.

It's meant to be a .java-file parser. The idea of this uni assignment is to
have the script count a few certain keywords, like 'private', 'class',
'new' etc. in the .java-file. Now, @imports is supposed to catch the bits
surrounded by '( )' in the regexps. It does, but where the
multiplier-operator * is used, it just counts the last, as explained in my
previous, smaller example.

I know there might be a better way to count the keywords, but I would still
like to finish the parser as it is. Suggestions are very welcome.

you might want to look at constructs like

$string=~s/somepattern_with_capture/func($1)/ge;

where func() is a sub that does your counting and optionally more
operations.
for example:
sub func {
my ($item)=@_;
$counters{$item}++ if $countable{$item};
return '' if deletable{$item};
$item;
}

hashes like %countable and %deletable would be preset to
control what action to take.

gnari
 
C

Charles DeRykus

Hello,

I have an issue parsing a string with a regular exression. Here's a small
example:

@foobar = ("foobarbarbarfoo" =~ m/(foo)(bar)*(foo)/g);

this makes the array foobar contain:
{"foo", "bar", "foo"}
while I want it to be
{"foo", "bar", "bar", "bar", "foo"}

The *-operator seems to 'forget' the first few elements and just returns the
last element, which is stored in the $2 variable. Is there a way to make it
return the full list of elements?

I have been suggested to split the string into three pieces first, and then
parse them separately, but I'd still like to do it with a single regular
expression.

Another possibility:

if ( "foobarbarbarfoo" =~ /^foo(.*?)foo$/g and
(my $match = $1) =~ /^(?:bar)+$/ )
{
@foobar = ('foo', $match =~ /(bar)/g, 'foo');
print join "\n",@foobar;
}


hth,
 
C

ctcgag

Martijn Houtman said:
Hello,

I have an issue parsing a string with a regular exression. Here's a small
example:

@foobar = ("foobarbarbarfoo" =~ m/(foo)(bar)*(foo)/g);

this makes the array foobar contain:
{"foo", "bar", "foo"}
while I want it to be
{"foo", "bar", "bar", "bar", "foo"}

The *-operator seems to 'forget' the first few elements and just returns
the last element, which is stored in the $2 variable. Is there a way to
make it return the full list of elements?

Others have give some solutions, but I don't think you understand why it
does what it does. The mapping between capturing parantheses and capture
variables is lexical, not dynamic. $2 hold the match of the capturing
parantheses which are lexically the second to be opened in the regex.
If that set matches multiple times, the $2 variable holds the last one of
these matches.

Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top