Simplify Variable Number of Regex Groups

W

W. Citoan

I need to reformat portions of text. The patterns I am matching and
their replacements are as follows:

word1 [note:word1] --> [word1]
word1 word2 [note:word1_word2] --> [word1 word2]
word1 word2 word3 [note:word1_word2_word3] --> [word1 word2 word3]

This pattern continues (word4, word5, ... wordN) with an indeterminate
maximum number of words. I know how to replace each individual case.

For example:

my @data = (
"word1 [note:word1]",
"word1 word2 [note:word1_word2]",
"word1 word2 word3 [note:word1_word2_word3]",
);

for (@data) {
print "$_\n";
s/(\w+)\s\[note:(\1)\]/[$1]/g;
s/(\w+)\s(\w+)\s\[note:(\1)_(\2)\]/[$1 $2]/g;
s/(\w+)\s(\w+)\s(\w+)\s\[note:(\1)_(\2)_(\3)\]/[$1 $2 $3]/g;
print "$_\n\n";
}

produces:

word1 [note:word1]
[word1]

word1 word2 [note:word1_word2]
[word1 word2]

word1 word2 word3 [note:word1_word2_word3]
[word1 word2 word3]

I can obviously keep adding additional substitution lines until I have
up to the wordN case. I would have to guess at N and keep adding if I
find larger cases.

However, I was wondering if there was anyway to simplify this so that I
can use a single (or smaller set of) substitution. If it wasn't for the
underscores, I could use a larger grouping (example:

s/((\w+\s)*\w+)\s\[note:(\1)\]/[$1]/g;

), but I don't see how to do it with the underscores. I cannot simply
strip the underscores out prior to doing the substitution as any cases
without the repeating words need to be left untouched.

Any ideas? Am I missing something obvious?

Thanks,

- W. Citoan
 
I

Ian Wilson

W. Citoan said:
I need to reformat portions of text. The patterns I am matching and
their replacements are as follows:

word1 [note:word1] --> [word1]
word1 word2 [note:word1_word2] --> [word1 word2]
word1 word2 word3 [note:word1_word2_word3] --> [word1 word2 word3]
....
I was wondering if there was anyway to ....
use a single (or smaller set of) substitution.

C:\temp>type citoan.txt
word1 [note:word1]
word1 word2 [note:word1_word2]
word1 word2 word3 [note:word1_word2_word3]

C:\temp>perl -p -e "s/^([^[]*) \[.*$/[$1]/" citoan.txt
[word1]
[word1 word2]
[word1 word2 word3]


You don't say why the above, obvious, solution would not apply. I
suspect you may not have fully stated the problem you are trying to solve.

For example, does the data contain lines like
word1 word2 [note:eek:ther_thing]
word1 word2 [notanote:word1_word2]
which should be left unchanged?
 
W

W. Citoan

Purl said:
W. Citoan said:
I need to reformat portions of text. The patterns I am matching and
their replacements are as follows:
word1 [note:word1] --> [word1]
word1 word2 [note:word1_word2] --> [word1 word2]
word1 word2 word3 [note:word1_word2_word3] --> [word1 word2 word3]
print "[", substr ($_, 0, index ($_, "[") - 1), "]\n\n";

I appreciate the help, but this doesn't work.

1) These patterns occur in larger blocks of text and I need everything
else to remain unchanged. I apologize for not making that clear. In
the case of "keep word1 [note:word1] keep", I need "keep [word1] keep"
and not the "[keep]" your suggestion provides.

2) Your suggestion doesn't actually look for a pattern and produces
false matches. Example: "word1 word2 [note:word1_NOTMATCH]" becomes
"[word1 word2]".

Thanks,

- W. Citoan
 
T

Tad McClellan

W. Citoan said:
I need to reformat portions of text. The patterns I am matching and
their replacements are as follows:

word1 [note:word1] --> [word1]
word1 word2 [note:word1_word2] --> [word1 word2]
word1 word2 word3 [note:word1_word2_word3] --> [word1 word2 word3]
However, I was wondering if there was anyway to simplify this so that I
can use a single (or smaller set of) substitution. If it wasn't for the
underscores, I could use a larger grouping


So you need a bit of code to insert the underscores before matching:

s#(.*) \[note:(??{ (local $_=$1)=~tr/ /_/; $_})]#[$1]#;

or if you hope for people to be able to read it more easily:

s#(.*)[ ]\[note:
(??{
(local $_ = $1) =~ tr/ /_/;
$_
})
]
#[$1]#x;
 
J

John W. Krahn

W. Citoan said:
I need to reformat portions of text. The patterns I am matching and
their replacements are as follows:

word1 [note:word1] --> [word1]
word1 word2 [note:word1_word2] --> [word1 word2]
word1 word2 word3 [note:word1_word2_word3] --> [word1 word2 word3]

This pattern continues (word4, word5, ... wordN) with an indeterminate
maximum number of words. I know how to replace each individual case.

For example:

my @data = (
"word1 [note:word1]",
"word1 word2 [note:word1_word2]",
"word1 word2 word3 [note:word1_word2_word3]",
);

for (@data) {
print "$_\n";
s/(\w+)\s\[note:(\1)\]/[$1]/g;
s/(\w+)\s(\w+)\s\[note:(\1)_(\2)\]/[$1 $2]/g;
s/(\w+)\s(\w+)\s(\w+)\s\[note:(\1)_(\2)_(\3)\]/[$1 $2 $3]/g;
print "$_\n\n";
}

produces:

word1 [note:word1]
[word1]

word1 word2 [note:word1_word2]
[word1 word2]

word1 word2 word3 [note:word1_word2_word3]
[word1 word2 word3]

$ perl -le'
my @data = (
"word1 [note:word1]",
"word1 word2 [note:word1_word2]",
"word1 word2 word3 [note:word1_word2_word3]",
);

for ( @data ) {
print;

s/\A\W+//, s/\W+\z//, s/\s+/ /g for my ( $x, $y ) = split /note:/, $_, 2;
$y =~ tr/_/ /;

print "[$x]" if $x eq $y;
}
'
word1 [note:word1]
[word1]
word1 word2 [note:word1_word2]
[word1 word2]
word1 word2 word3 [note:word1_word2_word3]
[word1 word2 word3]




John
 
M

Mirco Wahab

W. Citoan said:
I need to reformat portions of text. The patterns I am matching and
their replacements are as follows:

word1 [note:word1] --> [word1]
word1 word2 [note:word1_word2] --> [word1 word2]
word1 word2 word3 [note:word1_word2_word3] --> [word1 word2 word3]

Any ideas? Am I missing something obvious?

Thats not *that* obvious. After Johns and Tads
working solutions, I'll put another one:

==>

use strict;
use warnings;

my @data = (
'keep word1 [note:word1] as is', # --> [word1]
'word1 word2 [note:word1_word2]', # --> [word1 word2]
'word1 word2 word3 [note:word1_word2_word3]' # --> [word1 word2 word3]
);

print
join "\n",
map {
my ($v) = /(?<=\[note:).+?(?=\])/g; # extract bracketed text
(my $w=$v) =~ y/_/ /; # prepare match in front of [..]
s/(?<=$w\s\[)note:$v(?=\])/$w/; # substitute the brackets if match
} @data;


<==

Regards

Mirco
 
W

W. Citoan

Tad said:
So you need a bit of code to insert the underscores before matching:

s#(.*) \[note:(??{ (local $_=$1)=~tr/ /_/; $_})]#[$1]#;

That seems ideal for my needs. I knew code could go on the RHS, but I
never realized it could go on the LHS.

I appreciate the suggestions from the others as well.

Thanks,

- W. Citoan
 
X

Xicheng Jia

W. Citoan said:
I need to reformat portions of text. The patterns I am matching and
their replacements are as follows:
word1 [note:word1] --> [word1]
word1 word2 [note:word1_word2] --> [word1 word2]
word1 word2 word3 [note:word1_word2_word3] --> [word1 word2 word3]
Any ideas? Am I missing something obvious?

Thats not *that* obvious. After Johns and Tads
working solutions, I'll put another one:

==>

use strict;
use warnings;

my @data = (
'keep word1 [note:word1] as is', # --> [word1]
'word1 word2 [note:word1_word2]', # --> [word1 word2]
'word1 word2 word3 [note:word1_word2_word3]' # --> [word1 word2 word3]
);

print
join "\n",
map {
my ($v) = /(?<=\[note:).+?(?=\])/g; # extract bracketed text
(my $w=$v) =~ y/_/ /; # prepare match in front of [..]
s/(?<=$w\s\[)note:$v(?=\])/$w/; # substitute the brackets if match
} @data;

<==

the following might be easier to read:

echo 'word1 [note:word1]
word1 word2 [note:word1_word2] site site site
site site word1 word2 word3 [note:word1_word2_word3] site
word1 word2 [note:not_word] site
' | perl -wpe '
if (/(\s\[note:(.*?)\])/) {
my $x = $1; (my $y = $2) =~ tr/_/ /;
s/\Q$y$x/[note:$y]/;
}
'
[note:word1]
[note:word1 word2] site site site
site site [note:word1 word2 word3] site
word1 word2 [note:not_word] site


Regards,
Xicheng
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top