look-ahead search for overlapping

Discussion in 'Perl Misc' started by Huub, Oct 3, 2005.

  1. Huub

    Huub Guest

    Hi,

    I'm trying to realize this with a reg.exp.:

    this is a test for fun -> this is a is a test a test for test for fun
    for fun

    I've tried reg.exp. like this:

    s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)\g

    but then it looks for letters and I get this:

    (thi)(his)is is a (tes)(est)st (for)or (fun)un

    I also tried \w\s, \w+\b, \w+?\b, \w\t etc. Where do I go wrong?

    Thanks

    Huub
    Huub, Oct 3, 2005
    #1
    1. Advertising

  2. Huub wrote:
    > I'm trying to realize this with a reg.exp.:


    <various random things snipped>

    > Where do I go wrong?


    In the description of what it is you want to achieve.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Oct 3, 2005
    #2
    1. Advertising

  3. Huub

    Huub Guest

    > In the description of what it is you want to achieve.
    >


    Ok, 1 single thing 1st: I want to search for a word of unknown length.
    Using \w\b, it looks for a word-character and word-boundary. A
    word-boundary is not the same as 'white space', right? Since \s is white
    space. Then what's a word-boundary?
    Huub, Oct 3, 2005
    #3
  4. Huub <"h.v.niekerk at hccnet.nl"> wrote in
    news:43413188$0$771$:

    >> In the description of what it is you want to achieve.
    >>

    >
    > Ok, 1 single thing 1st: I want to search for a word of unknown length.
    > Using \w\b, it looks for a word-character and word-boundary. A
    > word-boundary is not the same as 'white space', right? Since \s is
    > white space. Then what's a word-boundary?


    perldoc perlre

    A word boundary ("\b") is a spot between two characters that has a "\w"
    on one side of it and a "\W" on the other side of it (in either order),
    counting the imaginary characters off the beginning and end of the
    string as matching a "\W".

    Do read the documentation. Do not consider this group a "read the
    documentation for me" service.

    Sinan

    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
    A. Sinan Unur, Oct 3, 2005
    #4
  5. Huub <"h.v.niekerk at hccnet.nl"> wrote:

    >> In the description of what it is you want to achieve.
    >>

    >
    > Ok, 1 single thing 1st: I want to search for a word of unknown
    > length. Using \w\b, it looks for a word-character and
    > word-boundary. A word-boundary is not the same as 'white space',
    > right? Since \s is white space. Then what's a word-boundary?


    When in doubt, consult the documentation.

    perldoc perlre

    A word boundary ("\b") is a spot between two characters
    that has a "\w" on one side of it and a "\W" on the
    other side of it (in either order), counting the
    imaginary characters off the beginning and end of the
    string as matching a "\W". (Within character classes
    "\b" represents backspace rather than a word boundary,
    just as it normally does in any double-quoted string.)
    David K. Wall, Oct 3, 2005
    #5
  6. Huub

    Paul Lalli Guest

    Huub wrote:
    > > In the description of what it is you want to achieve.
    > >

    >
    > Ok, 1 single thing 1st: I want to search for a word of unknown length.


    \w+

    I have no idea how this desire relates to the code you posted above.

    Have you read the Posting Guidelines for this group? Please post some
    sample input along with the output you want to achieve.

    Paul Lalli
    Paul Lalli, Oct 3, 2005
    #6
  7. Huub

    Huub Guest


    > Do read the documentation. Do not consider this group a "read the
    > documentation for me" service.


    I have been reading the docs on
    http://search.cpan.org/dist/perl/pod/perlre.pod. I just can't figure it
    out, so I thought someone might give a hint.
    Huub, Oct 3, 2005
    #7
  8. Huub

    Huub Guest

    Paul Lalli wrote:
    > Huub wrote:
    >
    >>>In the description of what it is you want to achieve.
    >>>

    >>
    >>Ok, 1 single thing 1st: I want to search for a word of unknown length.

    >
    >
    > \w+
    >
    > I have no idea how this desire relates to the code you posted above.
    >
    > Have you read the Posting Guidelines for this group? Please post some
    > sample input along with the output you want to achieve.
    >
    > Paul Lalli
    >


    Please read my o.p. because I did.
    Codesample: s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)/g
    Input: this is a test for fun
    Desired output: this is a is a test a test for test for fun for fun
    Huub, Oct 3, 2005
    #8
  9. Huub

    Paul Lalli Guest

    Huub wrote:
    > Paul Lalli wrote:
    > > Have you read the Posting Guidelines for this group? Please post some
    > > sample input along with the output you want to achieve.


    > Please read my o.p. because I did.
    > Codesample: s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)/g
    > Input: this is a test for fun
    > Desired output: this is a is a test a test for test for fun for fun


    Apologies. I did not realize that random string of words represented
    both your input and output.

    I still, however, don't understand what you're trying to do. In
    precisely what manner does the output relate to the input? It looks
    like your output has random pieces of the input interspersed into the
    input itself. You need to define how that output is generated.

    Paul Lalli
    Paul Lalli, Oct 3, 2005
    #9
  10. Huub

    Huub Guest

    >>Codesample: s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)/g
    >>Input: this is a test for fun
    >>Desired output: this is a is a test a test for test for fun for fun

    >
    >
    > Apologies. I did not realize that random string of words represented
    > both your input and output.
    >
    > I still, however, don't understand what you're trying to do. In
    > precisely what manner does the output relate to the input? It looks
    > like your output has random pieces of the input interspersed into the
    > input itself. You need to define how that output is generated.
    >
    > Paul Lalli
    >


    What I'm trying to do is read 3 words, print the 3 words, loose the 1st
    word, read the 4th word, print the 3 words, loose the new 1st word, read
    the new 4th word, print the new 3 words, etc. What the script does is
    basically the same, but for letters. Sofar I can't figure out how to do
    it with words.
    Huub, Oct 3, 2005
    #10
  11. Huub

    Babacio Guest

    Huub <"h.v.niekerk at hccnet.nl"> writes:

    > Please read my o.p. because I did.
    > Codesample: S/(?=([\W\B\]{3}))[\W\B]{1}/(\1)/G


    This is not correct. There is an extra \ before your first ].
    Abviously that does not make this regexp make what you want to.
    As a general advice, you shoud copy/paste code instead if copying it.

    > Input: this is a test for fun
    > Desired output: this is a is a test a test for test for fun for fun


    --
    Bé erre hue ixe eu elle, Bruxelles.
    Babacio, Oct 3, 2005
    #11
  12. Huub

    Paul Lalli Guest

    Huub wrote:
    > >>Codesample: s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)/g
    > >>Input: this is a test for fun
    > >>Desired output: this is a is a test a test for test for fun for fun


    > > I still, however, don't understand what you're trying to do. In
    > > precisely what manner does the output relate to the input? It looks
    > > like your output has random pieces of the input interspersed into the
    > > input itself. You need to define how that output is generated.


    > What I'm trying to do is read 3 words, print the 3 words, loose the 1st
    > word, read the 4th word, print the 3 words, loose the new 1st word, read
    > the new 4th word, print the new 3 words, etc. What the script does is
    > basically the same, but for letters. Sofar I can't figure out how to do
    > it with words.


    Ahh, okay, now we're getting somewhere.
    $ perl -le'$_ = q{this is a test for fun};
    s/(\w+\W+)(?=((?:\w+(?:\W+|$)){2}))/$1$2/g; print;'
    this is a is a test a test for test for funfor fun
    $

    "Search for (a word, and non-word characters) that are followed by two
    instances of (a word, and (non-word characters or the end-of-string)).
    Replace whatever we matched (ie, the first word and non-word
    characters) with both the word-and-nonword we matched, and the
    word-and-nonword's we peeked ahead into."

    The lack of a space after the second-to-last 'fun' is due to the lack
    of a space after the word 'fun' in the original string, and is
    consistent with your description. (Your sample output is not).

    Paul Lalli
    Paul Lalli, Oct 3, 2005
    #12
  13. Huub

    Huub Guest

    Paul Lalli wrote:
    > Huub wrote:
    >
    >>>>Codesample: s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)/g
    >>>>Input: this is a test for fun
    >>>>Desired output: this is a is a test a test for test for fun for fun

    >
    >
    >>>I still, however, don't understand what you're trying to do. In
    >>>precisely what manner does the output relate to the input? It looks
    >>>like your output has random pieces of the input interspersed into the
    >>>input itself. You need to define how that output is generated.

    >
    >
    >>What I'm trying to do is read 3 words, print the 3 words, loose the 1st
    >>word, read the 4th word, print the 3 words, loose the new 1st word, read
    >>the new 4th word, print the new 3 words, etc. What the script does is
    >>basically the same, but for letters. Sofar I can't figure out how to do
    >>it with words.

    >
    >
    > Ahh, okay, now we're getting somewhere.
    > $ perl -le'$_ = q{this is a test for fun};
    > s/(\w+\W+)(?=((?:\w+(?:\W+|$)){2}))/$1$2/g; print;'
    > this is a is a test a test for test for funfor fun
    > $
    >
    > "Search for (a word, and non-word characters) that are followed by two
    > instances of (a word, and (non-word characters or the end-of-string)).
    > Replace whatever we matched (ie, the first word and non-word
    > characters) with both the word-and-nonword we matched, and the
    > word-and-nonword's we peeked ahead into."
    >
    > The lack of a space after the second-to-last 'fun' is due to the lack
    > of a space after the word 'fun' in the original string, and is
    > consistent with your description. (Your sample output is not).
    >
    > Paul Lalli
    >


    Ok, thank you. Maybe you can tell me what a non-"word" character is?
    Characters like !,@,#,$,% ?
    Huub, Oct 3, 2005
    #13
  14. Huub

    Paul Lalli Guest

    Huub wrote:

    > Ok, thank you. Maybe you can tell me what a non-"word" character is?
    > Characters like !,@,#,$,% ?


    from perldoc perlre (which I believe you said you were reading):
    \w Match a "word" character (alphanumeric plus "_")
    \W Match a non-"word" character

    So if \w matches anything that's alphabetic, numeric, or _ (ie,
    [a-zA-Z_]), then \W matches anything that's NOT alphabetic, numeric, or
    _ (ie, [^a-zA-Z_]).

    Paul Lalli
    Paul Lalli, Oct 3, 2005
    #14
  15. Huub

    Paul Lalli Guest

    Paul Lalli wrote:
    > Huub wrote:
    >
    > > Ok, thank you. Maybe you can tell me what a non-"word" character is?
    > > Characters like !,@,#,$,% ?

    >
    > from perldoc perlre (which I believe you said you were reading):
    > \w Match a "word" character (alphanumeric plus "_")
    > \W Match a non-"word" character
    >
    > So if \w matches anything that's alphabetic, numeric, or _ (ie,
    > [a-zA-Z_]), then \W matches anything that's NOT alphabetic, numeric, or
    > _ (ie, [^a-zA-Z_]).
    >


    Arg. Those should, of course, be:
    [a-zA-Z0-9_] and [^a-zA-Z0-9_], respectively.

    Paul Lalli
    Paul Lalli, Oct 3, 2005
    #15
  16. Huub

    Dr.Ruud Guest

    Paul Lalli:

    > So if \w matches anything that's alphabetic, numeric, or _ (ie,
    > [a-zA-Z_]), then \W matches anything that's NOT alphabetic, numeric,
    > or _ (ie, [^a-zA-Z_]).


    Not all alphabets are limited to [A-Za-z].

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Oct 3, 2005
    #16
  17. Huub

    Paul Lalli Guest

    Dr.Ruud wrote:
    > Paul Lalli:
    >
    > > So if \w matches anything that's alphabetic, numeric, or _ (ie,
    > > [a-zA-Z_]), then \W matches anything that's NOT alphabetic, numeric,
    > > or _ (ie, [^a-zA-Z_]).

    >
    > Not all alphabets are limited to [A-Za-z].


    True. I should have specified "assuming 'use locale;' is not in
    effect"

    Paul Lalli
    Paul Lalli, Oct 3, 2005
    #17
  18. Huub

    Dr.Ruud Guest

    Paul Lalli:
    > Dr.Ruud:
    >> Paul Lalli:


    >>> So if \w matches anything that's alphabetic, numeric, or _ (ie,
    >>> [a-zA-Z_]), then \W matches anything that's NOT alphabetic, numeric,
    >>> or _ (ie, [^a-zA-Z_]).

    >>
    >> Not all alphabets are limited to [A-Za-z].

    >
    > True. I should have specified "assuming 'use locale;' is not in
    > effect"


    Xor 'use utf8;' ("Use of locales with Unicode is discouraged.").

    Or an I/O layer. (encoding pragma)

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Oct 3, 2005
    #18
  19. Huub wrote:
    >>> Codesample: s/(?=([\w\b\]{3}))[\w\b]{1}/(\1)/g
    >>> Input: this is a test for fun
    >>> Desired output: this is a is a test a test for test for fun for fun

    >>
    >>
    >> Apologies. I did not realize that random string of words represented
    >> both your input and output.
    >>
    >> I still, however, don't understand what you're trying to do. In
    >> precisely what manner does the output relate to the input? It looks
    >> like your output has random pieces of the input interspersed into the
    >> input itself. You need to define how that output is generated.
    >>
    >> Paul Lalli
    >>

    >
    > What I'm trying to do is read 3 words, print the 3 words, loose the 1st
    > word, read the 4th word, print the 3 words, loose the new 1st word, read
    > the new 4th word, print the new 3 words, etc. What the script does is
    > basically the same, but for letters. Sofar I can't figure out how to do
    > it with words.


    $ perl -le'
    $_ = q/this is a test for fun/;
    print;
    s/(\w+)(?=(\W+\w+\W+\w+))/$1$2/g;
    print;
    '
    this is a test for fun
    this is a is a test a test for test for fun for fun



    John
    --
    use Perl;
    program
    fulfillment
    John W. Krahn, Oct 3, 2005
    #19
  20. Huub

    Dr.Ruud Guest

    John W. Krahn:
    > Huub:


    >> read 3 words, print the 3 words, loose the
    >> 1st word, read the 4th word, print the 3 words, loose the new 1st
    >> word, read the new 4th word, print the new 3 words, etc. What the
    >> script does is basically the same, but for letters. Sofar I can't
    >> figure out how to do it with words.



    > s/(\w+)(?=(\W+\w+\W+\w+))/$1$2/g;


    Nice translation!


    An extra dying echo:

    $ perl -le'
    $_ = q/this is a test for fun/;
    print;
    s/(\w+)(?=((?:\W+\w+){1,2}))/$1$2/g;
    print;
    '
    this is a test for fun
    this is a is a test a test for test for fun for fun fun

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Oct 3, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Prabh
    Replies:
    1
    Views:
    419
    Wendy S
    Sep 11, 2003
  2. Aubrey Hutchison

    GO AHEAD -MAKE ME LOOK DUMB- Please

    Aubrey Hutchison, Dec 31, 2003, in forum: Python
    Replies:
    4
    Views:
    312
    Aubrey Hutchison
    Dec 31, 2003
  3. inhahe
    Replies:
    3
    Views:
    2,346
    Diez B. Roggisch
    Jan 28, 2005
  4. Neil Cerutti

    An iterator with look-ahead

    Neil Cerutti, Jan 10, 2007, in forum: Python
    Replies:
    5
    Views:
    659
    Paddy
    Jan 10, 2007
  5. Replies:
    4
    Views:
    169
Loading...

Share This Page