look-ahead search for overlapping

Huub · Oct 3, 2005

Hi,

I'm trying to realize this with a reg.exp.:

this is a test for fun -> this is a is a test a test for test for fun
for fun

I've tried reg.exp. like this:

s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)\g

but then it looks for letters and I get this:

(thi)(his)is is a (tes)(est)st (for)or (fun)un

I also tried \w\s, \w+\b, \w+?\b, \w\t etc. Where do I go wrong?

Thanks

Huub

Gunnar Hjalmarsson · Oct 3, 2005

Huub said:
I'm trying to realize this with a reg.exp.:

Where do I go wrong?

In the description of what it is you want to achieve.

Huub · Oct 3, 2005

In the description of what it is you want to achieve.
Ok, 1 single thing 1st: I want to search for a word of unknown length.
Using \w\b, it looks for a word-character and word-boundary. A
word-boundary is not the same as 'white space', right? Since \s is white
space. Then what's a word-boundary?

A. Sinan Unur · Oct 3, 2005

Ok, 1 single thing 1st: I want to search for a word of unknown length.
Using \w\b, it looks for a word-character and word-boundary. A
word-boundary is not the same as 'white space', right? Since \s is
white space. Then what's a word-boundary?

perldoc perlre

A word boundary ("\b") is a spot between two characters that has a "\w"
on one side of it and a "\W" on the other side of it (in either order),
counting the imaginary characters off the beginning and end of the
string as matching a "\W".

Do read the documentation. Do not consider this group a "read the
documentation for me" service.

Sinan

David K. Wall · Oct 3, 2005

Huub said:
Ok, 1 single thing 1st: I want to search for a word of unknown
length. Using \w\b, it looks for a word-character and
word-boundary. A word-boundary is not the same as 'white space',
right? Since \s is white space. Then what's a word-boundary?

When in doubt, consult the documentation.

perldoc perlre

A word boundary ("\b") is a spot between two characters
that has a "\w" on one side of it and a "\W" on the
other side of it (in either order), counting the
imaginary characters off the beginning and end of the
string as matching a "\W". (Within character classes
"\b" represents backspace rather than a word boundary,
just as it normally does in any double-quoted string.)

Paul Lalli · Oct 3, 2005

Huub said:
Ok, 1 single thing 1st: I want to search for a word of unknown length.

\w+

I have no idea how this desire relates to the code you posted above.

Have you read the Posting Guidelines for this group? Please post some
sample input along with the output you want to achieve.

Paul Lalli

Huub · Oct 3, 2005

Do read the documentation. Do not consider this group a "read the
documentation for me" service.

I have been reading the docs on
http://search.cpan.org/dist/perl/pod/perlre.pod. I just can't figure it
out, so I thought someone might give a hint.

Huub · Oct 3, 2005

Paul said:
\w+

I have no idea how this desire relates to the code you posted above.

Have you read the Posting Guidelines for this group? Please post some
sample input along with the output you want to achieve.

Paul Lalli

Please read my o.p. because I did.
Codesample: s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)/g
Input: this is a test for fun
Desired output: this is a is a test a test for test for fun for fun

Paul Lalli · Oct 3, 2005

Please read my o.p. because I did.
Codesample: s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)/g
Input: this is a test for fun
Desired output: this is a is a test a test for test for fun for fun

Apologies. I did not realize that random string of words represented
both your input and output.

I still, however, don't understand what you're trying to do. In
precisely what manner does the output relate to the input? It looks
like your output has random pieces of the input interspersed into the
input itself. You need to define how that output is generated.

Paul Lalli

Huub · Oct 3, 2005

Codesample: s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)/g

Apologies. I did not realize that random string of words represented
both your input and output.

I still, however, don't understand what you're trying to do. In
precisely what manner does the output relate to the input? It looks
like your output has random pieces of the input interspersed into the
input itself. You need to define how that output is generated.

Paul Lalli

What I'm trying to do is read 3 words, print the 3 words, loose the 1st
word, read the 4th word, print the 3 words, loose the new 1st word, read
the new 4th word, print the new 3 words, etc. What the script does is
basically the same, but for letters. Sofar I can't figure out how to do
it with words.

Babacio · Oct 3, 2005

Huub said:
Please read my o.p. because I did.
Codesample: S/(?=([\W\B\]{3}))[\W\B]{1}/(\1)/G

This is not correct. There is an extra \ before your first ].
Abviously that does not make this regexp make what you want to.
As a general advice, you shoud copy/paste code instead if copying it.

Paul Lalli · Oct 3, 2005

Huub said:
Codesample: s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)/g
Input: this is a test for fun
Desired output: this is a is a test a test for test for fun for fun

Click to expand...

I still, however, don't understand what you're trying to do. In
precisely what manner does the output relate to the input? It looks
like your output has random pieces of the input interspersed into the
input itself. You need to define how that output is generated.

Click to expand...

What I'm trying to do is read 3 words, print the 3 words, loose the 1st
word, read the 4th word, print the 3 words, loose the new 1st word, read
the new 4th word, print the new 3 words, etc. What the script does is
basically the same, but for letters. Sofar I can't figure out how to do
it with words.

Ahh, okay, now we're getting somewhere.
$ perl -le'$_ = q{this is a test for fun};
s/(\w+\W+)(?=((?:\w+(?:\W+|$)){2}))/$1$2/g; print;'
this is a is a test a test for test for funfor fun
$

"Search for (a word, and non-word characters) that are followed by two
instances of (a word, and (non-word characters or the end-of-string)).
Replace whatever we matched (ie, the first word and non-word
characters) with both the word-and-nonword we matched, and the
word-and-nonword's we peeked ahead into."

The lack of a space after the second-to-last 'fun' is due to the lack
of a space after the word 'fun' in the original string, and is
consistent with your description. (Your sample output is not).

Paul Lalli

Huub · Oct 3, 2005

Paul said:
Huub said:

Codesample: s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)/g
Input: this is a test for fun
Desired output: this is a is a test a test for test for fun for fun

Click to expand...

I still, however, don't understand what you're trying to do. In
precisely what manner does the output relate to the input? It looks
like your output has random pieces of the input interspersed into the
input itself. You need to define how that output is generated.

Click to expand...

What I'm trying to do is read 3 words, print the 3 words, loose the 1st
word, read the 4th word, print the 3 words, loose the new 1st word, read
the new 4th word, print the new 3 words, etc. What the script does is
basically the same, but for letters. Sofar I can't figure out how to do
it with words.

Click to expand...

Ahh, okay, now we're getting somewhere.
$ perl -le'$_ = q{this is a test for fun};
s/(\w+\W+)(?=((?:\w+(?:\W+|$)){2}))/$1$2/g; print;'
this is a is a test a test for test for funfor fun
$

"Search for (a word, and non-word characters) that are followed by two
instances of (a word, and (non-word characters or the end-of-string)).
Replace whatever we matched (ie, the first word and non-word
characters) with both the word-and-nonword we matched, and the
word-and-nonword's we peeked ahead into."

The lack of a space after the second-to-last 'fun' is due to the lack
of a space after the word 'fun' in the original string, and is
consistent with your description. (Your sample output is not).

Paul Lalli

Ok, thank you. Maybe you can tell me what a non-"word" character is?
Characters like !,@,#,$,% ?

Paul Lalli · Oct 3, 2005

Huub said:
Ok, thank you. Maybe you can tell me what a non-"word" character is?
Characters like !,@,#,$,% ?

from perldoc perlre (which I believe you said you were reading):
\w Match a "word" character (alphanumeric plus "_")
\W Match a non-"word" character

So if \w matches anything that's alphabetic, numeric, or _ (ie,
[a-zA-Z_]), then \W matches anything that's NOT alphabetic, numeric, or
_ (ie, [^a-zA-Z_]).

Paul Lalli

Paul Lalli · Oct 3, 2005

Paul said:
Huub said:

Ok, thank you. Maybe you can tell me what a non-"word" character is?
Characters like !,@,#,$,% ?

Click to expand...

from perldoc perlre (which I believe you said you were reading):
\w Match a "word" character (alphanumeric plus "_")
\W Match a non-"word" character

So if \w matches anything that's alphabetic, numeric, or _ (ie,
[a-zA-Z_]), then \W matches anything that's NOT alphabetic, numeric, or
_ (ie, [^a-zA-Z_]).

Arg. Those should, of course, be:
[a-zA-Z0-9_] and [^a-zA-Z0-9_], respectively.

Paul Lalli

Dr.Ruud · Oct 3, 2005

Paul Lalli:

So if \w matches anything that's alphabetic, numeric, or _ (ie,
[a-zA-Z_]), then \W matches anything that's NOT alphabetic, numeric,
or _ (ie, [^a-zA-Z_]).

Not all alphabets are limited to [A-Za-z].

Paul Lalli · Oct 3, 2005

Dr.Ruud said:
Paul Lalli:

So if \w matches anything that's alphabetic, numeric, or _ (ie,
[a-zA-Z_]), then \W matches anything that's NOT alphabetic, numeric,
or _ (ie, [^a-zA-Z_]).

Click to expand...

Not all alphabets are limited to [A-Za-z].

True. I should have specified "assuming 'use locale;' is not in
effect"

Paul Lalli

Dr.Ruud · Oct 3, 2005

Paul Lalli:

Dr.Ruud:

Paul Lalli:

So if \w matches anything that's alphabetic, numeric, or _ (ie,
[a-zA-Z_]), then \W matches anything that's NOT alphabetic, numeric,
or _ (ie, [^a-zA-Z_]).

Click to expand...

Not all alphabets are limited to [A-Za-z].

Click to expand...

True. I should have specified "assuming 'use locale;' is not in
effect"

Xor 'use utf8;' ("Use of locales with Unicode is discouraged.").

Or an I/O layer. (encoding pragma)

John W. Krahn · Oct 3, 2005

Huub said:
Codesample: s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)/g
Input: this is a test for fun
Desired output: this is a is a test a test for test for fun for fun

Click to expand...

Apologies. I did not realize that random string of words represented
both your input and output.

I still, however, don't understand what you're trying to do. In
precisely what manner does the output relate to the input? It looks
like your output has random pieces of the input interspersed into the
input itself. You need to define how that output is generated.

Paul Lalli

Click to expand...

What I'm trying to do is read 3 words, print the 3 words, loose the 1st
word, read the 4th word, print the 3 words, loose the new 1st word, read
the new 4th word, print the new 3 words, etc. What the script does is
basically the same, but for letters. Sofar I can't figure out how to do
it with words.

$ perl -le'
$_ = q/this is a test for fun/;
print;
s/(\w+)(?=(\W+\w+\W+\w+))/$1$2/g;
print;
'
this is a test for fun
this is a is a test a test for test for fun for fun

John

Dr.Ruud · Oct 3, 2005

John W. Krahn:

Huub:

s/(\w+)(?=(\W+\w+\W+\w+))/$1$2/g;

Nice translation!

An extra dying echo:

$ perl -le'
$_ = q/this is a test for fun/;
print;
s/(\w+)(?=((?:\W+\w+){1,2}))/$1$2/g;
print;
'
this is a test for fun
this is a is a test a test for test for fun for fun fun

Problem: perl negative look-ahead assertion in multi-line mode	2	May 22, 2013
Blue J Ciphertext Program	2	Nov 22, 2023
Need Help: Program to Accept 2 Matrices and Show their Sum	0	Aug 21, 2022
Filter sober in c++ don't pass test	0	Dec 2, 2023
I need help fixing my website	2	Oct 15, 2023
mixed cmp operator for sorting	22	Sep 22, 2013
Search a Large files backwards	7	Mar 2, 2010
Generate one HTML from API based on the object key language and their value	2	Aug 19, 2022

look-ahead search for overlapping

Huub

Gunnar Hjalmarsson

Huub

A. Sinan Unur

David K. Wall

Paul Lalli

Huub

Huub

Paul Lalli

Huub

Babacio

Paul Lalli

Huub

Paul Lalli

Paul Lalli

Dr.Ruud

Paul Lalli

Dr.Ruud

John W. Krahn

Dr.Ruud

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads