regular expression for wc

Discussion in 'Perl Misc' started by Zeh Mau, Apr 23, 2007.

  1. Zeh Mau

    Zeh Mau Guest

    Zeh Mau, Apr 23, 2007
    #1
    1. Advertising

  2. Zeh Mau

    Thomas J. Guest

    REs are not able to "count".

    so the Answer must be: No.

    However they may help you to separate words like "wc", but you have to
    count those words by yourself (your program).

    Thomas
     
    Thomas J., Apr 23, 2007
    #2
    1. Advertising

  3. Zeh Mau

    Zeh Mau Guest

    Hello Thomas,

    I use LEX to count the results of the REs. So I have only to define
    the correct REs which I don't know how they could look like.

    Zeh
     
    Zeh Mau, Apr 23, 2007
    #3
  4. Zeh Mau

    Mirco Wahab Guest

    Thomas J. wrote:
    > REs are not able to "count".
    >
    > so the Answer must be: No.
    >
    > However they may help you to separate words like "wc", but you have to
    > count those words by yourself (your program).


    First shot:
    <===

    use strict;
    use warnings;

    my $text='Hello,

    is it possible to create a regular expression,
    which does exactly the same as the UNIX tool wc,
    which means counting
    lines, words and all signs of a file?

    Thanks,
    Zeh Mau';

    my %count = (lines=>0, words=>0, characters=>0);
    my $re = qr/(?:
    \b(?{$count{words}+=0.25})
    |
    \n(?{++$count{lines}})
    |
    .(?{++$count{characters}})
    )
    /xms;

    1 while $text =~ /$re/g;

    print "$_ => $count{$_}\n" for keys %count;

    <===

    Needs some more thinking (will look
    at it today on evening again ;-)

    Regards

    M.
     
    Mirco Wahab, Apr 23, 2007
    #4
  5. Zeh Mau

    Zeh Mau Guest

    > Well, that's quite rude.

    Sorry, I did not know where to reach most of the people,
    so I have chosen the groups which seems reasonable for me. I hope to
    have not offended anyone by doing this so :)
     
    Zeh Mau, Apr 23, 2007
    #5
  6. Zeh Mau

    Zeh Mau Guest

    > If you restrict yourself to what the regular expression engine can without
    > falling back to Perl, than the answer is "no", for a very simple reason:
    > you can only match what is present in the string you match against. And
    > usually, the number of lines, words, or characters isn't present in
    > the file.


    In LEX, I may specify
    &&
    \n {CountLines++;}

    So I get the numbers of lines. So every match increments the variable
    CountLines++;

    But how do can I separate whole words from the rest of the text?

    Zeh
     
    Zeh Mau, Apr 23, 2007
    #6
  7. Zeh Mau

    Mirco Wahab Guest

    Mirco Wahab wrote:
    >
    > Needs some more thinking (will look
    > at it today on evening again ;-)


    As Abigail mentioned in another post,
    Perls Regexes allow code assertions,
    so this task isn't too hard.

    The following should work as
    poor-mans wc ;-)

    [wc.pl] ==>

    use strict;
    use warnings;

    my %wc = (lines=>1, words=>0, chars=>0);
    my $re = qr/ \b (?{ $wc{words} += 0.25 })
    | \n (?{ $wc{lines} ++ })
    | . (?{ $wc{chars} ++ })
    /x;

    my $text = do { local$/; <> };

    print map "$wc{$_} $_, ", keys %wc
    if () = $text =~ /$re/g;

    <==


    Regards

    M.
     
    Mirco Wahab, Apr 23, 2007
    #7
  8. Zeh Mau

    Ala Qumsieh Guest

    Ala Qumsieh, Apr 23, 2007
    #8
  9. Zeh Mau

    -berlin.de Guest

    Mirco Wahab <> wrote in comp.lang.perl.misc:
    > Mirco Wahab wrote:
    > >
    > > Needs some more thinking (will look
    > > at it today on evening again ;-)

    >
    > As Abigail mentioned in another post,
    > Perls Regexes allow code assertions,
    > so this task isn't too hard.
    >
    > The following should work as
    > poor-mans wc ;-)
    >
    > [wc.pl] ==>
    >
    > use strict;
    > use warnings;
    >
    > my %wc = (lines=>1, words=>0, chars=>0);
    > my $re = qr/ \b (?{ $wc{words} += 0.25 })
    > | \n (?{ $wc{lines} ++ })
    > | . (?{ $wc{chars} ++ })
    > /x;
    >
    > my $text = do { local$/; <> };
    >
    > print map "$wc{$_} $_, ", keys %wc
    > if () = $text =~ /$re/g;


    Nice.

    I don't understand why it finds four /\b/ for each word, but that's
    apparently what happens.

    You're initializing the line count to one. For me, that makes it
    come out one high.

    The character count will be missing the line feeds. Make the
    second alternative

    | \n (?{ $wc{lines} ++; $wc{chars} ++})

    Anno
     
    -berlin.de, Apr 24, 2007
    #9
  10. Zeh Mau

    Mirco Wahab Guest

    -berlin.de wrote:
    > Mirco Wahab <> wrote in comp.lang.perl.misc:
    >> my %wc = (lines=>1, words=>0, chars=>0);
    >> my $re = qr/ \b (?{ $wc{words} += 0.25 })
    >> | \n (?{ $wc{lines} ++ })
    >> | . (?{ $wc{chars} ++ })
    >> /x;

    > I don't understand why it finds four /\b/ for each word, but that's
    > apparently what happens.


    I struggled over this too, but each word has two ends
    and the first character *in front* of a word is
    /on a word boundary/, as is the first character
    *of the word*. Makes #4 \b's.

    > You're initializing the line count to one. For me, that makes it
    > come out one high.


    If you have any text, you start already on line #1,
    thats why I modified this. What you see is probably
    the last \n of a text.

    > The character count will be missing the line feeds. Make the
    > second alternative
    >
    > | \n (?{ $wc{lines} ++; $wc{chars} ++})


    OK, you are possibly right. But - I did take them out
    because "word processors" don't count them (checked in
    Word 97 under wine).

    Regards & Thanks

    Mirco
     
    Mirco Wahab, Apr 24, 2007
    #10
  11. Zeh Mau

    -berlin.de Guest

    Mirco Wahab <> wrote in comp.lang.perl.misc:
    > -berlin.de wrote:
    > > Mirco Wahab <> wrote in comp.lang.perl.misc:
    > >> my %wc = (lines=>1, words=>0, chars=>0);
    > >> my $re = qr/ \b (?{ $wc{words} += 0.25 })
    > >> | \n (?{ $wc{lines} ++ })
    > >> | . (?{ $wc{chars} ++ })
    > >> /x;

    > > I don't understand why it finds four /\b/ for each word, but that's
    > > apparently what happens.

    >
    > I struggled over this too, but each word has two ends
    > and the first character *in front* of a word is
    > /on a word boundary/, as is the first character
    > *of the word*. Makes #4 \b's.


    Generally a zero-width pattern doesn't match twice in the same
    place. There must be something else going on. Following the /\b/
    like this

    my $str = 'aaa bbb ccc';
    while ( $str =~ /\b/g ) {
    print "$str\n";
    print ' ' x $-[ 0], "^\n";
    }

    shows the expected number of 6 (not 12) matches.

    Anno
     
    -berlin.de, Apr 24, 2007
    #11
  12. Zeh Mau

    Mirco Wahab Guest

    -berlin.de wrote:
    > Generally a zero-width pattern doesn't match twice in the same
    > place. There must be something else going on. Following the /\b/
    > like this
    >
    > my $str = 'aaa bbb ccc';
    > while ( $str =~ /\b/g ) {
    > print "$str\n";
    > print ' ' x $-[ 0], "^\n";
    > }
    >
    > shows the expected number of 6 (not 12) matches.


    Hmmm, seem so ..

    But, putting out pos() during the match shows
    how the regex engine pecks 2x around each word
    boundary:

    ...
    my $re = qr/ \b (?{ $wc{words} += 0.25, print pos().',' })
    | \n (?{ $wc{lines} ++ })
    | . (?{ $wc{chars} ++ })
    /x;
    ...

    I can't assess what's the 'deep' reason
    for such behavior, maybe somebody can
    shed light on this.

    Regards

    M.
     
    Mirco Wahab, Apr 24, 2007
    #12
  13. [A complimentary Cc of this posting was sent to

    <-berlin.de>], who wrote in article <>:
    > > my $re = qr/ \b (?{ $wc{words} += 0.25 })
    > > | \n (?{ $wc{lines} ++ })
    > > | . (?{ $wc{chars} ++ })
    > > /x;


    > I don't understand why it finds four /\b/ for each word, but that's
    > apparently what happens.


    It finds two \b per word. It also FAILS to match \b at each boundary
    - but due to bugs in the REx above, even failing attempts run += code
    (there is no "undoing" for failing attempts).

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, Apr 24, 2007
    #13
  14. Zeh Mau

    Mirco Wahab Guest

    Ilya Zakharevich wrote:
    > <-berlin.de>], who wrote in article <>:
    >> I don't understand why it finds four /\b/ for each word, but that's
    >> apparently what happens.

    >
    > It finds two \b per word. It also FAILS to match \b at each boundary
    > - but due to bugs in the REx above, even failing attempts run += code
    > (there is no "undoing" for failing attempts).


    What's meant with 'bugs in the REx'?

    Can you help out w/explanation why the
    following prints the "pseudo correct"
    word boundaries:

    ...

    my $chars ='Ilya Zakharevich';

    my $re = qr/ \b
    | \b (?{ print '\b:'.pos().',' })
    /x;

    () = $chars =~ /$re/g;

    ...

    Hmmm ...

    Thanks & Regards

    Mirco
     
    Mirco Wahab, Apr 24, 2007
    #14
  15. [A complimentary Cc of this posting was sent to
    Mirco Wahab
    <>], who wrote in article <f0lhek$l3d$-halle.de>:
    > Ilya Zakharevich wrote:
    > > <-berlin.de>], who wrote in article <>:
    > >> I don't understand why it finds four /\b/ for each word, but that's
    > >> apparently what happens.

    > >
    > > It finds two \b per word. It also FAILS to match \b at each boundary
    > > - but due to bugs in the REx above, even failing attempts run += code
    > > (there is no "undoing" for failing attempts).

    >
    > What's meant with 'bugs in the REx'?


    As I said: there is no "undoing" for failing attempts. It does +=
    even in the cases when the match will fail immediately after this.

    > Can you help out w/explanation why the
    > following prints the "pseudo correct"
    > word boundaries:
    >
    > ...
    >
    > my $chars ='Ilya Zakharevich';
    >
    > my $re = qr/ \b
    > | \b (?{ print '\b:'.pos().',' })
    > /x;


    Did you try

    use re 'debugcolor';

    ?

    Yours,
    Ilya
     
    Ilya Zakharevich, Apr 24, 2007
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Keith-Earl
    Replies:
    1
    Views:
    468
    Mary Chipman
    Jun 15, 2004
  2. VSK
    Replies:
    2
    Views:
    2,353
  3. =?iso-8859-1?B?bW9vcJk=?=

    Matching abitrary expression in a regular expression

    =?iso-8859-1?B?bW9vcJk=?=, Dec 1, 2005, in forum: Java
    Replies:
    8
    Views:
    872
    Alan Moore
    Dec 2, 2005
  4. GIMME
    Replies:
    3
    Views:
    12,024
    vforvikash
    Dec 29, 2008
  5. Noman Shapiro
    Replies:
    0
    Views:
    245
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page