Regexp *-operator and multiple elements

Martijn Houtman · Dec 29, 2003

Hello,

I have an issue parsing a string with a regular exression. Here's a small
example:

@foobar = ("foobarbarbarfoo" =~ m/(foo)(bar)*(foo)/g);

this makes the array foobar contain:
{"foo", "bar", "foo"}
while I want it to be
{"foo", "bar", "bar", "bar", "foo"}

The *-operator seems to 'forget' the first few elements and just returns the
last element, which is stored in the $2 variable. Is there a way to make it
return the full list of elements?

I have been suggested to split the string into three pieces first, and then
parse them separately, but I'd still like to do it with a single regular
expression.

Thanks in advance!
Regards,

Gunnar Hjalmarsson · Dec 29, 2003

Martijn said:
@foobar = ("foobarbarbarfoo" =~ m/(foo)(bar)*(foo)/g);

this makes the array foobar contain:
{"foo", "bar", "foo"}
while I want it to be
{"foo", "bar", "bar", "bar", "foo"}

The *-operator seems to 'forget' the first few elements and just
returns the last element, which is stored in the $2 variable. Is
there a way to make it return the full list of elements?

This may be something in the right direction:

@foobar = 'foobarbarbarfoo' =~ /(foo)((?:bar)*)(foo)/;

It distinguishes between clustering and capturing, and '*' is captured
as well. The result is:

('foo', 'barbarbar', 'foo')

Of course, to get an array with five elements you can do:

@foobar = 'foobarbarbarfoo' =~ /(foo|bar)/g;

But that matches much more. ;-)

Ragnar Hafstað · Dec 29, 2003

Martijn Houtman said:
Hello,

I have an issue parsing a string with a regular exression. Here's a small
example:

@foobar = ("foobarbarbarfoo" =~ m/(foo)(bar)*(foo)/g);

this makes the array foobar contain:
{"foo", "bar", "foo"}
while I want it to be
{"foo", "bar", "bar", "bar", "foo"}

The *-operator seems to 'forget' the first few elements and just returns the
last element, which is stored in the $2 variable.

this is because the * is outside the capture brackets

Is there a way to make it
return the full list of elements?

yes
this can be done with a combination of look-ahead/behind assertions, along
with the \G assertion

I have been suggested to split the string into three pieces first, and then
parse them separately, but I'd still like to do it with a single regular
expression.

why not:
@foobar = ("foo",(barbarbarfoo" =~ /foo((?:bar)*)foo/ &&
$1=~/(bar)/g),"foo");

if you still want to do it with the assertions:
@foobar = ("foobarbarbarfoo" =~
/(foo(?=(?:bar)*foo)|\Gbar(?=(?:bar)*foo)|(?<=foo(?:bar))\Gfoo)/g;

gnari

Ragnar Hafstað · Dec 29, 2003

Ragnar Hafstað said:
@foobar = ("foobarbarbarfoo" =~
/(foo(?=(?:bar)*foo)|\Gbar(?=(?:bar)*foo)|(?<=foo(?:bar))\Gfoo)/g;

ooops, the cut-and-paste failed to include the closing parens
@foobar = ("foobarbarbarfoo" =~
/(foo(?=(?:bar)*foo)|\Gbar(?=(?:bar)*foo)|(?<=foo(?:bar))\Gfoo)/g);

gnari

Martijn Houtman · Dec 29, 2003

Ragnar said:
ooops, the cut-and-paste failed to include the closing parens
@foobar = ("foobarbarbarfoo" =~
/(foo(?=(?:bar)*foo)|\Gbar(?=(?:bar)*foo)|(?<=foo(?:bar))\Gfoo)/g);

Thanks, Gnari and Gunnar, for your suggestions. I fail to see what exactly
happens in the above example, though. I wished the answer would have been a
bit more trivial.

The problem is, the above might work for the above example, but my "real
life" situation is a bit more complex. Take a look at this url, if you're
interested: http://tinus.ath.cx/temp/form.txt. It's the code I currently
have.

It's meant to be a .java-file parser. The idea of this uni assignment is to
have the script count a few certain keywords, like 'private', 'class',
'new' etc. in the .java-file. Now, @imports is supposed to catch the bits
surrounded by '( )' in the regexps. It does, but where the
multiplier-operator * is used, it just counts the last, as explained in my
previous, smaller example.

I know there might be a better way to count the keywords, but I would still
like to finish the parser as it is. Suggestions are very welcome.

Thanks again. Kind regards,

Ragnar Hafstað · Dec 30, 2003

The problem is, the above might work for the above example, but my "real
life" situation is a bit more complex. Take a look at this url, if you're
interested: http://tinus.ath.cx/temp/form.txt. It's the code I currently
have.

It's meant to be a .java-file parser. The idea of this uni assignment is to
have the script count a few certain keywords, like 'private', 'class',
'new' etc. in the .java-file. Now, @imports is supposed to catch the bits
surrounded by '( )' in the regexps. It does, but where the
multiplier-operator * is used, it just counts the last, as explained in my
previous, smaller example.

I know there might be a better way to count the keywords, but I would still
like to finish the parser as it is. Suggestions are very welcome.

you might want to look at constructs like

$string=~s/somepattern_with_capture/func($1)/ge;

where func() is a sub that does your counting and optionally more
operations.
for example:
sub func {
my ($item)=@_;
$counters{$item}++ if $countable{$item};
return '' if deletable{$item};
$item;
}

hashes like %countable and %deletable would be preset to
control what action to take.

gnari

Charles DeRykus · Dec 30, 2003

Hello,

I have an issue parsing a string with a regular exression. Here's a small
example:

@foobar = ("foobarbarbarfoo" =~ m/(foo)(bar)*(foo)/g);

this makes the array foobar contain:
{"foo", "bar", "foo"}
while I want it to be
{"foo", "bar", "bar", "bar", "foo"}

The *-operator seems to 'forget' the first few elements and just returns the
last element, which is stored in the $2 variable. Is there a way to make it
return the full list of elements?

I have been suggested to split the string into three pieces first, and then
parse them separately, but I'd still like to do it with a single regular
expression.

Another possibility:

if ( "foobarbarbarfoo" =~ /^foo(.*?)foo$/g and
(my $match = $1) =~ /^(?:bar)+$/ )
{
@foobar = ('foo', $match =~ /(bar)/g, 'foo');
print join "\n",@foobar;
}

hth,

ctcgag · Jan 2, 2004

Martijn Houtman said:
Hello,

I have an issue parsing a string with a regular exression. Here's a small
example:

@foobar = ("foobarbarbarfoo" =~ m/(foo)(bar)*(foo)/g);

this makes the array foobar contain:
{"foo", "bar", "foo"}
while I want it to be
{"foo", "bar", "bar", "bar", "foo"}

The *-operator seems to 'forget' the first few elements and just returns
the last element, which is stored in the $2 variable. Is there a way to
make it return the full list of elements?

Others have give some solutions, but I don't think you understand why it
does what it does. The mapping between capturing parantheses and capture
variables is lexical, not dynamic. $2 hold the match of the capturing
parantheses which are lexically the second to be opened in the regex.
If that set matches multiple times, the $2 variable holds the last one of
these matches.

Xho

RegExp - Match specific words, but not if they're inside parenthesis (with or without other words within)	6	Jan 29, 2023
mixed cmp operator for sorting	22	Sep 22, 2013
Operator...	9	Oct 13, 2008
Replace an occurrence of a regexp with a function call on a substringof the match, multiple times on	4	Sep 16, 2013
Multiple substitutions with Regexp::Common?	1	May 15, 2007
Slicing iterables in sub-generators without loosing elements	19	Sep 29, 2012
Flushing and multiple pipes	2	Jul 20, 2012
regexp(ing) Backus-Naurish expressions ...	7	Mar 13, 2013

Regexp *-operator and multiple elements

Martijn Houtman

Gunnar Hjalmarsson

Ragnar Hafstað

Ragnar Hafstað

Martijn Houtman

Ragnar Hafstað

Charles DeRykus

ctcgag

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads