look-ahead search for overlapping

H

Huub

Hi,

I'm trying to realize this with a reg.exp.:

this is a test for fun -> this is a is a test a test for test for fun
for fun

I've tried reg.exp. like this:

s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)\g

but then it looks for letters and I get this:

(thi)(his)is is a (tes)(est)st (for)or (fun)un

I also tried \w\s, \w+\b, \w+?\b, \w\t etc. Where do I go wrong?

Thanks

Huub
 
H

Huub

In the description of what it is you want to achieve.
Ok, 1 single thing 1st: I want to search for a word of unknown length.
Using \w\b, it looks for a word-character and word-boundary. A
word-boundary is not the same as 'white space', right? Since \s is white
space. Then what's a word-boundary?
 
A

A. Sinan Unur

Ok, 1 single thing 1st: I want to search for a word of unknown length.
Using \w\b, it looks for a word-character and word-boundary. A
word-boundary is not the same as 'white space', right? Since \s is
white space. Then what's a word-boundary?

perldoc perlre

A word boundary ("\b") is a spot between two characters that has a "\w"
on one side of it and a "\W" on the other side of it (in either order),
counting the imaginary characters off the beginning and end of the
string as matching a "\W".

Do read the documentation. Do not consider this group a "read the
documentation for me" service.

Sinan
 
D

David K. Wall

Huub said:
Ok, 1 single thing 1st: I want to search for a word of unknown
length. Using \w\b, it looks for a word-character and
word-boundary. A word-boundary is not the same as 'white space',
right? Since \s is white space. Then what's a word-boundary?

When in doubt, consult the documentation.

perldoc perlre

A word boundary ("\b") is a spot between two characters
that has a "\w" on one side of it and a "\W" on the
other side of it (in either order), counting the
imaginary characters off the beginning and end of the
string as matching a "\W". (Within character classes
"\b" represents backspace rather than a word boundary,
just as it normally does in any double-quoted string.)
 
P

Paul Lalli

Huub said:
Ok, 1 single thing 1st: I want to search for a word of unknown length.

\w+

I have no idea how this desire relates to the code you posted above.

Have you read the Posting Guidelines for this group? Please post some
sample input along with the output you want to achieve.

Paul Lalli
 
H

Huub

Paul said:
\w+

I have no idea how this desire relates to the code you posted above.

Have you read the Posting Guidelines for this group? Please post some
sample input along with the output you want to achieve.

Paul Lalli

Please read my o.p. because I did.
Codesample: s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)/g
Input: this is a test for fun
Desired output: this is a is a test a test for test for fun for fun
 
P

Paul Lalli

Please read my o.p. because I did.
Codesample: s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)/g
Input: this is a test for fun
Desired output: this is a is a test a test for test for fun for fun

Apologies. I did not realize that random string of words represented
both your input and output.

I still, however, don't understand what you're trying to do. In
precisely what manner does the output relate to the input? It looks
like your output has random pieces of the input interspersed into the
input itself. You need to define how that output is generated.

Paul Lalli
 
H

Huub

Codesample: s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)/g
Apologies. I did not realize that random string of words represented
both your input and output.

I still, however, don't understand what you're trying to do. In
precisely what manner does the output relate to the input? It looks
like your output has random pieces of the input interspersed into the
input itself. You need to define how that output is generated.

Paul Lalli

What I'm trying to do is read 3 words, print the 3 words, loose the 1st
word, read the 4th word, print the 3 words, loose the new 1st word, read
the new 4th word, print the new 3 words, etc. What the script does is
basically the same, but for letters. Sofar I can't figure out how to do
it with words.
 
B

Babacio

Huub said:
Please read my o.p. because I did.
Codesample: S/(?=([\W\B\]{3}))[\W\B]{1}/(\1)/G

This is not correct. There is an extra \ before your first ].
Abviously that does not make this regexp make what you want to.
As a general advice, you shoud copy/paste code instead if copying it.
 
P

Paul Lalli

Huub said:
Codesample: s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)/g
Input: this is a test for fun
Desired output: this is a is a test a test for test for fun for fun
I still, however, don't understand what you're trying to do. In
precisely what manner does the output relate to the input? It looks
like your output has random pieces of the input interspersed into the
input itself. You need to define how that output is generated.
What I'm trying to do is read 3 words, print the 3 words, loose the 1st
word, read the 4th word, print the 3 words, loose the new 1st word, read
the new 4th word, print the new 3 words, etc. What the script does is
basically the same, but for letters. Sofar I can't figure out how to do
it with words.

Ahh, okay, now we're getting somewhere.
$ perl -le'$_ = q{this is a test for fun};
s/(\w+\W+)(?=((?:\w+(?:\W+|$)){2}))/$1$2/g; print;'
this is a is a test a test for test for funfor fun
$

"Search for (a word, and non-word characters) that are followed by two
instances of (a word, and (non-word characters or the end-of-string)).
Replace whatever we matched (ie, the first word and non-word
characters) with both the word-and-nonword we matched, and the
word-and-nonword's we peeked ahead into."

The lack of a space after the second-to-last 'fun' is due to the lack
of a space after the word 'fun' in the original string, and is
consistent with your description. (Your sample output is not).

Paul Lalli
 
H

Huub

Paul said:
Huub said:
Codesample: s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)/g
Input: this is a test for fun
Desired output: this is a is a test a test for test for fun for fun
I still, however, don't understand what you're trying to do. In
precisely what manner does the output relate to the input? It looks
like your output has random pieces of the input interspersed into the
input itself. You need to define how that output is generated.

What I'm trying to do is read 3 words, print the 3 words, loose the 1st
word, read the 4th word, print the 3 words, loose the new 1st word, read
the new 4th word, print the new 3 words, etc. What the script does is
basically the same, but for letters. Sofar I can't figure out how to do
it with words.


Ahh, okay, now we're getting somewhere.
$ perl -le'$_ = q{this is a test for fun};
s/(\w+\W+)(?=((?:\w+(?:\W+|$)){2}))/$1$2/g; print;'
this is a is a test a test for test for funfor fun
$

"Search for (a word, and non-word characters) that are followed by two
instances of (a word, and (non-word characters or the end-of-string)).
Replace whatever we matched (ie, the first word and non-word
characters) with both the word-and-nonword we matched, and the
word-and-nonword's we peeked ahead into."

The lack of a space after the second-to-last 'fun' is due to the lack
of a space after the word 'fun' in the original string, and is
consistent with your description. (Your sample output is not).

Paul Lalli

Ok, thank you. Maybe you can tell me what a non-"word" character is?
Characters like !,@,#,$,% ?
 
P

Paul Lalli

Huub said:
Ok, thank you. Maybe you can tell me what a non-"word" character is?
Characters like !,@,#,$,% ?

from perldoc perlre (which I believe you said you were reading):
\w Match a "word" character (alphanumeric plus "_")
\W Match a non-"word" character

So if \w matches anything that's alphabetic, numeric, or _ (ie,
[a-zA-Z_]), then \W matches anything that's NOT alphabetic, numeric, or
_ (ie, [^a-zA-Z_]).

Paul Lalli
 
P

Paul Lalli

Paul said:
Huub said:
Ok, thank you. Maybe you can tell me what a non-"word" character is?
Characters like !,@,#,$,% ?

from perldoc perlre (which I believe you said you were reading):
\w Match a "word" character (alphanumeric plus "_")
\W Match a non-"word" character

So if \w matches anything that's alphabetic, numeric, or _ (ie,
[a-zA-Z_]), then \W matches anything that's NOT alphabetic, numeric, or
_ (ie, [^a-zA-Z_]).

Arg. Those should, of course, be:
[a-zA-Z0-9_] and [^a-zA-Z0-9_], respectively.

Paul Lalli
 
D

Dr.Ruud

Paul Lalli:
So if \w matches anything that's alphabetic, numeric, or _ (ie,
[a-zA-Z_]), then \W matches anything that's NOT alphabetic, numeric,
or _ (ie, [^a-zA-Z_]).

Not all alphabets are limited to [A-Za-z].
 
P

Paul Lalli

Dr.Ruud said:
Paul Lalli:
So if \w matches anything that's alphabetic, numeric, or _ (ie,
[a-zA-Z_]), then \W matches anything that's NOT alphabetic, numeric,
or _ (ie, [^a-zA-Z_]).

Not all alphabets are limited to [A-Za-z].

True. I should have specified "assuming 'use locale;' is not in
effect"

Paul Lalli
 
D

Dr.Ruud

Paul Lalli:
Dr.Ruud:
Paul Lalli:
So if \w matches anything that's alphabetic, numeric, or _ (ie,
[a-zA-Z_]), then \W matches anything that's NOT alphabetic, numeric,
or _ (ie, [^a-zA-Z_]).

Not all alphabets are limited to [A-Za-z].

True. I should have specified "assuming 'use locale;' is not in
effect"

Xor 'use utf8;' ("Use of locales with Unicode is discouraged.").

Or an I/O layer. (encoding pragma)
 
J

John W. Krahn

Huub said:
Codesample: s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)/g
Input: this is a test for fun
Desired output: this is a is a test a test for test for fun for fun


Apologies. I did not realize that random string of words represented
both your input and output.

I still, however, don't understand what you're trying to do. In
precisely what manner does the output relate to the input? It looks
like your output has random pieces of the input interspersed into the
input itself. You need to define how that output is generated.

Paul Lalli

What I'm trying to do is read 3 words, print the 3 words, loose the 1st
word, read the 4th word, print the 3 words, loose the new 1st word, read
the new 4th word, print the new 3 words, etc. What the script does is
basically the same, but for letters. Sofar I can't figure out how to do
it with words.

$ perl -le'
$_ = q/this is a test for fun/;
print;
s/(\w+)(?=(\W+\w+\W+\w+))/$1$2/g;
print;
'
this is a test for fun
this is a is a test a test for test for fun for fun



John
 
D

Dr.Ruud

John W. Krahn:
Huub:

s/(\w+)(?=(\W+\w+\W+\w+))/$1$2/g;

Nice translation!


An extra dying echo:

$ perl -le'
$_ = q/this is a test for fun/;
print;
s/(\w+)(?=((?:\W+\w+){1,2}))/$1$2/g;
print;
'
this is a test for fun
this is a is a test a test for test for fun for fun fun
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top