regular expression for wc

Zeh Mau · Apr 23, 2007

Please go to this thread:

http://groups.google.de/group/regex/browse_thread/thread/e25c3e39aaafcd30?hl=de

Thanks for your support,

Zeh Mau

Thomas J. · Apr 23, 2007

REs are not able to "count".

so the Answer must be: No.

However they may help you to separate words like "wc", but you have to
count those words by yourself (your program).

Thomas

Zeh Mau · Apr 23, 2007

Hello Thomas,

I use LEX to count the results of the REs. So I have only to define
the correct REs which I don't know how they could look like.

Zeh

Mirco Wahab · Apr 23, 2007

Thomas said:
REs are not able to "count".

so the Answer must be: No.

However they may help you to separate words like "wc", but you have to
count those words by yourself (your program).

First shot:
<===

use strict;
use warnings;

my $text='Hello,

is it possible to create a regular expression,
which does exactly the same as the UNIX tool wc,
which means counting
lines, words and all signs of a file?

Thanks,
Zeh Mau';

my %count = (lines=>0, words=>0, characters=>0);
my $re = qr/(?:
\b(?{$count{words}+=0.25})
|
\n(?{++$count{lines}})
|
.(?{++$count{characters}})
)
/xms;

1 while $text =~ /$re/g;

print "$_ => $count{$_}\n" for keys %count;

<===

Needs some more thinking (will look
at it today on evening again ;-)

Regards

M.

Zeh Mau · Apr 23, 2007

Well, that's quite rude.

Sorry, I did not know where to reach most of the people,
so I have chosen the groups which seems reasonable for me. I hope to
have not offended anyone by doing this so

Zeh Mau · Apr 23, 2007

If you restrict yourself to what the regular expression engine can without

falling back to Perl, than the answer is "no", for a very simple reason:
you can only match what is present in the string you match against. And
usually, the number of lines, words, or characters isn't present in
the file.

In LEX, I may specify
&&
\n {CountLines++;}

So I get the numbers of lines. So every match increments the variable
CountLines++;

But how do can I separate whole words from the rest of the text?

Zeh

Mirco Wahab · Apr 23, 2007

Mirco said:
Needs some more thinking (will look
at it today on evening again ;-)

As Abigail mentioned in another post,
Perls Regexes allow code assertions,
so this task isn't too hard.

The following should work as
poor-mans wc ;-)

[wc.pl] ==>

use strict;
use warnings;

my %wc = (lines=>1, words=>0, chars=>0);
my $re = qr/ \b (?{ $wc{words} += 0.25 })
| \n (?{ $wc{lines} ++ })
| . (?{ $wc{chars} ++ })
/x;

my $text = do { local$/; <> };

print map "$wc{$_} $_, ", keys %wc
if () = $text =~ /$re/g;

<==

Regards

M.

Ala Qumsieh · Apr 23, 2007

Zeh said:
Please go to this thread:

http://groups.google.de/group/regex/browse_thread/thread/e25c3e39aaafcd30?hl=de

If you want to recreate wc in Perl, then it has been already done for you:

http://ppt.cvs.sourceforge.net/*checkout*/ppt/ppt/bin/wc

--Ala

anno4000 · Apr 24, 2007

Mirco Wahab said:
Mirco said:

Needs some more thinking (will look
at it today on evening again ;-)

Click to expand...

As Abigail mentioned in another post,
Perls Regexes allow code assertions,
so this task isn't too hard.

The following should work as
poor-mans wc ;-)

[wc.pl] ==>

use strict;
use warnings;

my %wc = (lines=>1, words=>0, chars=>0);
my $re = qr/ \b (?{ $wc{words} += 0.25 })
| \n (?{ $wc{lines} ++ })
| . (?{ $wc{chars} ++ })
/x;

my $text = do { local$/; <> };

print map "$wc{$_} $_, ", keys %wc
if () = $text =~ /$re/g;

Nice.

I don't understand why it finds four /\b/ for each word, but that's
apparently what happens.

You're initializing the line count to one. For me, that makes it
come out one high.

The character count will be missing the line feeds. Make the
second alternative

| \n (?{ $wc{lines} ++; $wc{chars} ++})

Anno

Mirco Wahab · Apr 24, 2007

I don't understand why it finds four /\b/ for each word, but that's
apparently what happens.

I struggled over this too, but each word has two ends
and the first character *in front* of a word is
/on a word boundary/, as is the first character
*of the word*. Makes #4 \b's.

You're initializing the line count to one. For me, that makes it
come out one high.

If you have any text, you start already on line #1,
thats why I modified this. What you see is probably
the last \n of a text.

The character count will be missing the line feeds. Make the
second alternative

| \n (?{ $wc{lines} ++; $wc{chars} ++})

OK, you are possibly right. But - I did take them out
because "word processors" don't count them (checked in
Word 97 under wine).

Regards & Thanks

Mirco

anno4000 · Apr 24, 2007

Mirco Wahab said:
I struggled over this too, but each word has two ends
and the first character *in front* of a word is
/on a word boundary/, as is the first character
*of the word*. Makes #4 \b's.

Generally a zero-width pattern doesn't match twice in the same
place. There must be something else going on. Following the /\b/
like this

my $str = 'aaa bbb ccc';
while ( $str =~ /\b/g ) {
print "$str\n";
print ' ' x $-[ 0], "^\n";
}

shows the expected number of 6 (not 12) matches.

Anno

Mirco Wahab · Apr 24, 2007

Generally a zero-width pattern doesn't match twice in the same
place. There must be something else going on. Following the /\b/
like this

my $str = 'aaa bbb ccc';
while ( $str =~ /\b/g ) {
print "$str\n";
print ' ' x $-[ 0], "^\n";
}

shows the expected number of 6 (not 12) matches.

Hmmm, seem so ..

But, putting out pos() during the match shows
how the regex engine pecks 2x around each word
boundary:

...
my $re = qr/ \b (?{ $wc{words} += 0.25, print pos().',' })
| \n (?{ $wc{lines} ++ })
| . (?{ $wc{chars} ++ })
/x;
...

I can't assess what's the 'deep' reason
for such behavior, maybe somebody can
shed light on this.

Regards

M.

Ilya Zakharevich · Apr 24, 2007

[A complimentary Cc of this posting was sent to

I don't understand why it finds four /\b/ for each word, but that's
apparently what happens.

It finds two \b per word. It also FAILS to match \b at each boundary
- but due to bugs in the REx above, even failing attempts run += code
(there is no "undoing" for failing attempts).

Hope this helps,
Ilya

Mirco Wahab · Apr 24, 2007

Ilya said:
It finds two \b per word. It also FAILS to match \b at each boundary
- but due to bugs in the REx above, even failing attempts run += code
(there is no "undoing" for failing attempts).

What's meant with 'bugs in the REx'?

Can you help out w/explanation why the
following prints the "pseudo correct"
word boundaries:

...

my $chars ='Ilya Zakharevich';

my $re = qr/ \b
| \b (?{ print '\b:'.pos().',' })
/x;

() = $chars =~ /$re/g;

...

Hmmm ...

Thanks & Regards

Mirco

Ilya Zakharevich · Apr 24, 2007

[A complimentary Cc of this posting was sent to
Mirco Wahab

What's meant with 'bugs in the REx'?

As I said: there is no "undoing" for failing attempts. It does +=
even in the cases when the match will fail immediately after this.

Can you help out w/explanation why the
following prints the "pseudo correct"
word boundaries:

...

my $chars ='Ilya Zakharevich';

my $re = qr/ \b
| \b (?{ print '\b:'.pos().',' })
/x;

Did you try

use re 'debugcolor';

?

Yours,
Ilya

How to implement IPC with scalars?	2	Sep 14, 2007
FAQ 6.24 How do I match a regular expression that's in a variable?	0	Apr 19, 2011
Recursion regular expression (xtended)	1	Aug 16, 2010
JavaScript Challenge: Validating Email Addresses	1	Oct 6, 2023
Regular expression problem	13	Mar 10, 2013
Regular expression for BOM required	6	Jan 12, 2013
Regular Expression Problem	9	Jul 16, 2008
Regular expression help	2	Sep 24, 2009

regular expression for wc

Zeh Mau

Thomas J.

Zeh Mau

Mirco Wahab

Zeh Mau

Zeh Mau

Mirco Wahab

Ala Qumsieh

anno4000

Mirco Wahab

anno4000

Mirco Wahab

Ilya Zakharevich

Mirco Wahab

Ilya Zakharevich

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads