regular expression for wc

T

Thomas J.

REs are not able to "count".

so the Answer must be: No.

However they may help you to separate words like "wc", but you have to
count those words by yourself (your program).

Thomas
 
Z

Zeh Mau

Hello Thomas,

I use LEX to count the results of the REs. So I have only to define
the correct REs which I don't know how they could look like.

Zeh
 
M

Mirco Wahab

Thomas said:
REs are not able to "count".

so the Answer must be: No.

However they may help you to separate words like "wc", but you have to
count those words by yourself (your program).

First shot:
<===

use strict;
use warnings;

my $text='Hello,

is it possible to create a regular expression,
which does exactly the same as the UNIX tool wc,
which means counting
lines, words and all signs of a file?

Thanks,
Zeh Mau';

my %count = (lines=>0, words=>0, characters=>0);
my $re = qr/(?:
\b(?{$count{words}+=0.25})
|
\n(?{++$count{lines}})
|
.(?{++$count{characters}})
)
/xms;

1 while $text =~ /$re/g;

print "$_ => $count{$_}\n" for keys %count;

<===

Needs some more thinking (will look
at it today on evening again ;-)

Regards

M.
 
Z

Zeh Mau

Well, that's quite rude.

Sorry, I did not know where to reach most of the people,
so I have chosen the groups which seems reasonable for me. I hope to
have not offended anyone by doing this so :)
 
Z

Zeh Mau

If you restrict yourself to what the regular expression engine can without
falling back to Perl, than the answer is "no", for a very simple reason:
you can only match what is present in the string you match against. And
usually, the number of lines, words, or characters isn't present in
the file.

In LEX, I may specify
&&
\n {CountLines++;}

So I get the numbers of lines. So every match increments the variable
CountLines++;

But how do can I separate whole words from the rest of the text?

Zeh
 
M

Mirco Wahab

Mirco said:
Needs some more thinking (will look
at it today on evening again ;-)

As Abigail mentioned in another post,
Perls Regexes allow code assertions,
so this task isn't too hard.

The following should work as
poor-mans wc ;-)

[wc.pl] ==>

use strict;
use warnings;

my %wc = (lines=>1, words=>0, chars=>0);
my $re = qr/ \b (?{ $wc{words} += 0.25 })
| \n (?{ $wc{lines} ++ })
| . (?{ $wc{chars} ++ })
/x;

my $text = do { local$/; <> };

print map "$wc{$_} $_, ", keys %wc
if () = $text =~ /$re/g;

<==


Regards

M.
 
A

anno4000

Mirco Wahab said:
Mirco said:
Needs some more thinking (will look
at it today on evening again ;-)

As Abigail mentioned in another post,
Perls Regexes allow code assertions,
so this task isn't too hard.

The following should work as
poor-mans wc ;-)

[wc.pl] ==>

use strict;
use warnings;

my %wc = (lines=>1, words=>0, chars=>0);
my $re = qr/ \b (?{ $wc{words} += 0.25 })
| \n (?{ $wc{lines} ++ })
| . (?{ $wc{chars} ++ })
/x;

my $text = do { local$/; <> };

print map "$wc{$_} $_, ", keys %wc
if () = $text =~ /$re/g;

Nice.

I don't understand why it finds four /\b/ for each word, but that's
apparently what happens.

You're initializing the line count to one. For me, that makes it
come out one high.

The character count will be missing the line feeds. Make the
second alternative

| \n (?{ $wc{lines} ++; $wc{chars} ++})

Anno
 
M

Mirco Wahab

I don't understand why it finds four /\b/ for each word, but that's
apparently what happens.

I struggled over this too, but each word has two ends
and the first character *in front* of a word is
/on a word boundary/, as is the first character
*of the word*. Makes #4 \b's.
You're initializing the line count to one. For me, that makes it
come out one high.

If you have any text, you start already on line #1,
thats why I modified this. What you see is probably
the last \n of a text.
The character count will be missing the line feeds. Make the
second alternative

| \n (?{ $wc{lines} ++; $wc{chars} ++})

OK, you are possibly right. But - I did take them out
because "word processors" don't count them (checked in
Word 97 under wine).

Regards & Thanks

Mirco
 
A

anno4000

Mirco Wahab said:
I struggled over this too, but each word has two ends
and the first character *in front* of a word is
/on a word boundary/, as is the first character
*of the word*. Makes #4 \b's.

Generally a zero-width pattern doesn't match twice in the same
place. There must be something else going on. Following the /\b/
like this

my $str = 'aaa bbb ccc';
while ( $str =~ /\b/g ) {
print "$str\n";
print ' ' x $-[ 0], "^\n";
}

shows the expected number of 6 (not 12) matches.

Anno
 
M

Mirco Wahab

Generally a zero-width pattern doesn't match twice in the same
place. There must be something else going on. Following the /\b/
like this

my $str = 'aaa bbb ccc';
while ( $str =~ /\b/g ) {
print "$str\n";
print ' ' x $-[ 0], "^\n";
}

shows the expected number of 6 (not 12) matches.

Hmmm, seem so ..

But, putting out pos() during the match shows
how the regex engine pecks 2x around each word
boundary:

...
my $re = qr/ \b (?{ $wc{words} += 0.25, print pos().',' })
| \n (?{ $wc{lines} ++ })
| . (?{ $wc{chars} ++ })
/x;
...

I can't assess what's the 'deep' reason
for such behavior, maybe somebody can
shed light on this.

Regards

M.
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to

I don't understand why it finds four /\b/ for each word, but that's
apparently what happens.

It finds two \b per word. It also FAILS to match \b at each boundary
- but due to bugs in the REx above, even failing attempts run += code
(there is no "undoing" for failing attempts).

Hope this helps,
Ilya
 
M

Mirco Wahab

Ilya said:
It finds two \b per word. It also FAILS to match \b at each boundary
- but due to bugs in the REx above, even failing attempts run += code
(there is no "undoing" for failing attempts).

What's meant with 'bugs in the REx'?

Can you help out w/explanation why the
following prints the "pseudo correct"
word boundaries:

...

my $chars ='Ilya Zakharevich';

my $re = qr/ \b
| \b (?{ print '\b:'.pos().',' })
/x;

() = $chars =~ /$re/g;

...

Hmmm ...

Thanks & Regards

Mirco
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Mirco Wahab
What's meant with 'bugs in the REx'?

As I said: there is no "undoing" for failing attempts. It does +=
even in the cases when the match will fail immediately after this.
Can you help out w/explanation why the
following prints the "pseudo correct"
word boundaries:

...

my $chars ='Ilya Zakharevich';

my $re = qr/ \b
| \b (?{ print '\b:'.pos().',' })
/x;

Did you try

use re 'debugcolor';

?

Yours,
Ilya
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top