Simplify Variable Number of Regex Groups

W. Citoan · Jun 15, 2007

I need to reformat portions of text. The patterns I am matching and
their replacements are as follows:

word1 [note:word1] --> [word1]
word1 word2 [note:word1_word2] --> [word1 word2]
word1 word2 word3 [note:word1_word2_word3] --> [word1 word2 word3]

This pattern continues (word4, word5, ... wordN) with an indeterminate
maximum number of words. I know how to replace each individual case.

For example:

my @data = (
"word1 [note:word1]",
"word1 word2 [note:word1_word2]",
"word1 word2 word3 [note:word1_word2_word3]",
);

for (@data) {
print "$_\n";
s/(\w+)\s\[note

\1)\]/[$1]/g;
s/(\w+)\s(\w+)\s\[note

\1)_(\2)\]/[$1 $2]/g;
s/(\w+)\s(\w+)\s(\w+)\s\[note

\1)_(\2)_(\3)\]/[$1 $2 $3]/g;
print "$_\n\n";
}

produces:

word1 [note:word1]
[word1]

word1 word2 [note:word1_word2]
[word1 word2]

word1 word2 word3 [note:word1_word2_word3]
[word1 word2 word3]

I can obviously keep adding additional substitution lines until I have
up to the wordN case. I would have to guess at N and keep adding if I
find larger cases.

However, I was wondering if there was anyway to simplify this so that I
can use a single (or smaller set of) substitution. If it wasn't for the
underscores, I could use a larger grouping (example:

s/((\w+\s)*\w+)\s\[note

\1)\]/[$1]/g;

), but I don't see how to do it with the underscores. I cannot simply
strip the underscores out prior to doing the substitution as any cases
without the repeating words need to be left untouched.

Any ideas? Am I missing something obvious?

Thanks,

- W. Citoan

Ian Wilson · Jun 15, 2007

W. Citoan said:
I need to reformat portions of text. The patterns I am matching and
their replacements are as follows:

word1 [note:word1] --> [word1]
word1 word2 [note:word1_word2] --> [word1 word2]
word1 word2 word3 [note:word1_word2_word3] --> [word1 word2 word3]
....
I was wondering if there was anyway to ....
use a single (or smaller set of) substitution.

C:\temp>type citoan.txt
word1 [note:word1]
word1 word2 [note:word1_word2]
word1 word2 word3 [note:word1_word2_word3]

C:\temp>perl -p -e "s/^([^[]*) \[.*$/[$1]/" citoan.txt
[word1]
[word1 word2]
[word1 word2 word3]

You don't say why the above, obvious, solution would not apply. I
suspect you may not have fully stated the problem you are trying to solve.

For example, does the data contain lines like
word1 word2 [note

ther_thing]
word1 word2 [notanote:word1_word2]
which should be left unchanged?

W. Citoan · Jun 15, 2007

Purl said:
W. Citoan said:

I need to reformat portions of text. The patterns I am matching and
their replacements are as follows:

Click to expand...

word1 [note:word1] --> [word1]
word1 word2 [note:word1_word2] --> [word1 word2]
word1 word2 word3 [note:word1_word2_word3] --> [word1 word2 word3]

Click to expand...

print "[", substr ($_, 0, index ($_, "[") - 1), "]\n\n";

I appreciate the help, but this doesn't work.

1) These patterns occur in larger blocks of text and I need everything
else to remain unchanged. I apologize for not making that clear. In
the case of "keep word1 [note:word1] keep", I need "keep [word1] keep"
and not the "[keep]" your suggestion provides.

2) Your suggestion doesn't actually look for a pattern and produces
false matches. Example: "word1 word2 [note:word1_NOTMATCH]" becomes
"[word1 word2]".

Thanks,

- W. Citoan

Tad McClellan · Jun 15, 2007

W. Citoan said:
I need to reformat portions of text. The patterns I am matching and
their replacements are as follows:

word1 [note:word1] --> [word1]
word1 word2 [note:word1_word2] --> [word1 word2]
word1 word2 word3 [note:word1_word2_word3] --> [word1 word2 word3]

However, I was wondering if there was anyway to simplify this so that I
can use a single (or smaller set of) substitution. If it wasn't for the
underscores, I could use a larger grouping

So you need a bit of code to insert the underscores before matching:

s#(.*) \[note

??{ (local $_=$1)=~tr/ /_/; $_})]#[$1]#;

or if you hope for people to be able to read it more easily:

s#(.*)[ ]\[note:
(??{
(local $_ = $1) =~ tr/ /_/;
$_
})
]
#[$1]#x;

John W. Krahn · Jun 15, 2007

W. Citoan said:
I need to reformat portions of text. The patterns I am matching and
their replacements are as follows:

word1 [note:word1] --> [word1]
word1 word2 [note:word1_word2] --> [word1 word2]
word1 word2 word3 [note:word1_word2_word3] --> [word1 word2 word3]

This pattern continues (word4, word5, ... wordN) with an indeterminate
maximum number of words. I know how to replace each individual case.

For example:

my @data = (
"word1 [note:word1]",
"word1 word2 [note:word1_word2]",
"word1 word2 word3 [note:word1_word2_word3]",
);

for (@data) {
print "$_\n";
s/(\w+)\s\[note\1)\]/[$1]/g;
s/(\w+)\s(\w+)\s\[note\1)_(\2)\]/[$1 $2]/g;
s/(\w+)\s(\w+)\s(\w+)\s\[note\1)_(\2)_(\3)\]/[$1 $2 $3]/g;
print "$_\n\n";
}

produces:

word1 [note:word1]
[word1]

word1 word2 [note:word1_word2]
[word1 word2]

word1 word2 word3 [note:word1_word2_word3]
[word1 word2 word3]

$ perl -le'
my @data = (
"word1 [note:word1]",
"word1 word2 [note:word1_word2]",
"word1 word2 word3 [note:word1_word2_word3]",
);

for ( @data ) {
print;

s/\A\W+//, s/\W+\z//, s/\s+/ /g for my ( $x, $y ) = split /note:/, $_, 2;
$y =~ tr/_/ /;

print "[$x]" if $x eq $y;
}
'
word1 [note:word1]
[word1]
word1 word2 [note:word1_word2]
[word1 word2]
word1 word2 word3 [note:word1_word2_word3]
[word1 word2 word3]

John

Mirco Wahab · Jun 15, 2007

W. Citoan said:
I need to reformat portions of text. The patterns I am matching and
their replacements are as follows:

word1 [note:word1] --> [word1]
word1 word2 [note:word1_word2] --> [word1 word2]
word1 word2 word3 [note:word1_word2_word3] --> [word1 word2 word3]

Any ideas? Am I missing something obvious?

Thats not *that* obvious. After Johns and Tads
working solutions, I'll put another one:

==>

use strict;
use warnings;

my @data = (
'keep word1 [note:word1] as is', # --> [word1]
'word1 word2 [note:word1_word2]', # --> [word1 word2]
'word1 word2 word3 [note:word1_word2_word3]' # --> [word1 word2 word3]
);

print
join "\n",
map {
my ($v) = /(?<=\[note

.+?(?=\])/g; # extract bracketed text
(my $w=$v) =~ y/_/ /; # prepare match in front of [..]
s/(?<=$w\s\[)note:$v(?=\])/$w/; # substitute the brackets if match
} @data;

<==

Regards

Mirco

W. Citoan · Jun 15, 2007

Tad said:
So you need a bit of code to insert the underscores before matching:

s#(.*) \[note??{ (local $_=$1)=~tr/ /_/; $_})]#[$1]#;

That seems ideal for my needs. I knew code could go on the RHS, but I
never realized it could go on the LHS.

I appreciate the suggestions from the others as well.

Thanks,

- W. Citoan

Xicheng Jia · Jun 15, 2007

W. Citoan said:
W. Citoan said:

I need to reformat portions of text. The patterns I am matching and
their replacements are as follows:

Click to expand...

word1 [note:word1] --> [word1]
word1 word2 [note:word1_word2] --> [word1 word2]
word1 word2 word3 [note:word1_word2_word3] --> [word1 word2 word3]

Click to expand...

Any ideas? Am I missing something obvious?

Click to expand...

Thats not *that* obvious. After Johns and Tads
working solutions, I'll put another one:

==>

use strict;
use warnings;

my @data = (
'keep word1 [note:word1] as is', # --> [word1]
'word1 word2 [note:word1_word2]', # --> [word1 word2]
'word1 word2 word3 [note:word1_word2_word3]' # --> [word1 word2 word3]
);

print
join "\n",
map {
my ($v) = /(?<=\[note.+?(?=\])/g; # extract bracketed text
(my $w=$v) =~ y/_/ /; # prepare match in front of [..]
s/(?<=$w\s\[)note:$v(?=\])/$w/; # substitute the brackets if match
} @data;

<==

the following might be easier to read:

echo 'word1 [note:word1]
word1 word2 [note:word1_word2] site site site
site site word1 word2 word3 [note:word1_word2_word3] site
word1 word2 [note:not_word] site
' | perl -wpe '
if (/(\s\[note

.*?)\])/) {
my $x = $1; (my $y = $2) =~ tr/_/ /;
s/\Q$y$x/[note:$y]/;
}
'
[note:word1]
[note:word1 word2] site site site
site site [note:word1 word2 word3] site
word1 word2 [note:not_word] site

Regards,
Xicheng

Crossword	2	May 11, 2020
Crossword	14	May 13, 2020
Regex: deleting non-matching words	3	Aug 22, 2010
Searching for the first komma in a sentence	4	Aug 15, 2009
reg expression example...	7	Aug 29, 2010
looking for a better regexp	6	Nov 26, 2004
How to unencode a string	5	Aug 27, 2009
How to separate words by using tab in Java ?	5	Oct 21, 2008

Simplify Variable Number of Regex Groups

W. Citoan

Ian Wilson

W. Citoan

Tad McClellan

John W. Krahn

Mirco Wahab

W. Citoan

Xicheng Jia

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads