s modifier doesn't seem to work


F

fmassion

Hi everybody,

I am currently testing a string search over line breaks.

My file is UTF-8 encoded.

This is my test text (with linebreaks at the end):
----------
Das ist ein Beispiel mit 3 Sätzen
Das ist ein 1122-22-11 Format
Hier ist keine Zahl.
Hier ist kein Punkt
nur Text Hier ist nur Text ist aber nur Text
----------

This is a code extract:

foreach $satz (@satz) {
chomp $satz;
if ($satz =~ m/\d(?s)(.*)keine/g) {
$satz =~ s/$&/xxxx/g;
}
print "$satz\n";
}

I would expect the following result for the first three lines:
'Das ist ein Beispiel mit xxxxx Zahl.'

With this search string, I get however no match. I have entered the same expression in UltraEdit (Regex-Perl-Search) and it works correctly.

What is wrong here?
 
Ad

Advertisements

P

Peter J. Holzer

I am currently testing a string search over line breaks. [...]
This is my test text (with linebreaks at the end):
----------
Das ist ein Beispiel mit 3 Sätzen
Das ist ein 1122-22-11 Format
Hier ist keine Zahl.
Hier ist kein Punkt
nur Text Hier ist nur Text ist aber nur Text
---------- [...]
if ($satz =~ m/\d(?s)(.*)keine/g) { [...]
With this search string, I get however no match. [...]
What is wrong here?

Read the section "Modifiers" in perldoc perlre.

hp
 
G

George Mpouras

I would expect the following result for the first three lines:
'Das ist ein Beispiel mit xxxxx Zahl.'

With this search string, I get however no match. I have entered the same expression in UltraEdit (Regex-Perl-Search) and it works correctly.

What is wrong here?



while (<DATA>)
{
s/(\d|-|keine)+/xxxx/g;
print "$_"
}

__DATA__
Das ist ein Beispiel mit 3 Sätzen
Das ist ein 1122-22-11 Format
Hier ist keine Zahl.
Hier ist kein Punkt
nur Text Hier ist nur Text ist aber nur Text
 
P

Peter J. Holzer

Quoth "Peter J. Holzer said:
if ($satz =~ m/\d(?s)(.*)keine/g) { [...]
With this search string, I get however no match. [...]
What is wrong here?

Read the section "Modifiers" in perldoc perlre.

Read the section '(?adlupimsx-imsx)' in perldoc perlre :).

I've cancelled that article. Either I wasn't fast enough or your
Newsserver doesn't honor cancels (without cancel-lock).

hp
 
F

fmassion

I think Ben has the right hint. Indeed I read the file into the array (@satz) and then I go
'foreach $satz (@satz)'
Geaorge's code doesn't work though. It returns the following result for thefirst 3 lines:

Das ist ein Beispiel mit xxxx Sätzen
Das ist ein xxxx Format
Hier ist xxxx Zahl.

The solution is still pending but thanks for the help.

Am Samstag, 10. August 2013 11:16:58 UTC+2 schrieb (e-mail address removed):
 
F

fmassion

This works as expected, but I don't quite understand what happens


undef $/;
while (<DATA>) {
chomp;
print "$_<<\n";
s/\d(.*)Zahl/xxxx/sg;
print "\n$_\n"
}
It searches over the first 3 lines and outputs as expected:
'Das ist ein Beispiel mit xxxx'


Am Samstag, 10. August 2013 11:16:58 UTC+2 schrieb (e-mail address removed):
 
Ad

Advertisements

G

George Mpouras

please explain again more detailed the requirements. I can not
understand what you expect
 
C

Charles DeRykus

This works as expected, but I don't quite understand what happens


undef $/;
while (<DATA>) {
chomp;
print "$_<<\n";
s/\d(.*)Zahl/xxxx/sg;
print "\n$_\n"
}
It searches over the first 3 lines and outputs as expected:
'Das ist ein Beispiel mit xxxx'

See: perldoc perlvar --> $/

See: perldoc perlretut --> why '.' matches everything but "\n"
or
See: perldoc perlre -> Modifiers --> s Treat string as single line
 
F

fmassion

Am Samstag, 10. August 2013 21:57:07 UTC+2 schrieb Ben Morrow:
[Please quote properly: that is, put your reply underneath the bit of

text you are replying to. It's also not helpful to keep replying to

yourself; instead you should reply to the article you are, um, replying

to. You appear to be using Google Groups, which has recently started

inserting extra blank lines whenever it quotes something; if you can't

find any way of turning this off you need to remove them by hand before

posting.]



Quoth (e-mail address removed):
Am Samstag, 10. August 2013 11:16:58 UTC+2 schrieb (e-mail address removed):
I am currently testing a string search over line breaks.
[...]
This is a code extract:

foreach $satz (@satz) {
chomp $satz;
if ($satz =~ m/\d(?s)(.*)keine/g) {
$satz =~ s/$&/xxxx/g;
}
print "$satz\n";
}



I would expect the following result for the first three lines:
'Das ist ein Beispiel mit xxxxx Zahl.'

With this search string, I get however no match. I have entered the
This works as expected, but I don't quite understand what happens
undef $/;



This is documented in perldoc perlvar, under $/. Setting $/ to undef

causes <> to read the whole file in one go. This means you now have your

whole file in one string, so the s/// works over multiple lines.


while (<DATA>) {



Since you are reading the whole file, there will only ever be one entry

to loop over, so you don't really need a loop.





With $/=undef chomp doesn't do anything.


print "$_<<\n";
s/\d(.*)Zahl/xxxx/sg;
print "\n$_\n"
It searches over the first 3 lines and outputs as expected:
'Das ist ein Beispiel mit xxxx'



Since you're only doing one substitution it would be better to use an

ordinary named variable and no loop:



my $text = <DATA>;

print "$text<<\n";



$text =~ s/\d(.*)Zahl/xxxx/sg;

print "\n$text\n";



Ben

[Sorry for not replying properly. I hope this is OK now]

I understand what 'undef $/' does but it seems to be a workaround. Basically my goal is:

1) Read a text in an array
2) Iterate through the variables of the array: 'foreach $satz (@satz)'
3) Test various search and replace Regex (as a matter of fact I am working through the Regex Cookbook of Jan Goyvaerts & Steven Levithan). In this context, one of several tests concerns the s modifier. I just wonder why it isn't possible to search for an expressions which spread over more than one line if I add this modifier. It works in UltraEdit. It works in a few other tools as well but I can't make it function in my perl script. If I use the undefine-workaround, other search expressions (e.g. with $ to mark the end of the string) won't work.

In one of the tools I use (Expresso), I see that the EOL is coded as [CR][LF]. Is this a reason for the problem with the s modifier?
 
P

Peter J. Holzer

Am Samstag, 10. August 2013 21:57:07 UTC+2 schrieb Ben Morrow:
[Please quote properly: that is, put your reply underneath the bit of

text you are replying to. It's also not helpful to keep replying to

yourself; instead you should reply to the article you are, um, replying

to. You appear to be using Google Groups, which has recently started

inserting extra blank lines whenever it quotes something; if you can't

find any way of turning this off you need to remove them by hand before

posting.]



Quoth (e-mail address removed):
Am Samstag, 10. August 2013 11:16:58 UTC+2 schrieb (e-mail address removed):
I am currently testing a string search over line breaks.
[...]

This is a code extract:
foreach $satz (@satz) {
chomp $satz;
if ($satz =~ m/\d(?s)(.*)keine/g) {
$satz =~ s/$&/xxxx/g;

print "$satz\n";




I would expect the following result for the first three lines:
'Das ist ein Beispiel mit xxxxx Zahl.'

With this search string, I get however no match. I have entered the

This works as expected, but I don't quite understand what happens

undef $/;
[...]
[Sorry for not replying properly. I hope this is OK now]

Not really. You are still quoting everything (whether it is relevant or
not) and you haven't removed the empty lines inserted by google. So we
have scroll/read through 130 lines on quotes which may or may not be
relevant. I dare say that not every one of us has the patience.

Do yourself and us a favour, get a real Newsreader and use one of the
free news servers (e.g. albasani).

I understand what 'undef $/' does but it seems to be a workaround.
Basically my goal is:

1) Read a text in an array

What are the elements of the array? Lines?

2) Iterate through the variables of the array: 'foreach $satz (@satz)'

So in each iteration of the loop you are looking at one line in
isolation.

3) Test various search and replace Regex (as a matter of fact I am
working through the Regex Cookbook of Jan Goyvaerts & Steven
Levithan). In this context, one of several tests concerns the s
modifier. I just wonder why it isn't possible to search for an
expressions which spread over more than one line if I add this
modifier.

That's what the /s modifier does. But there have to be actually several
lines in the variable you are looking at for this to work. If the other
lines are in different variables, how can perl know that you would want
to match those other variables, too, especially if to tell it
explicitely to look only at this variable?
It works in UltraEdit. It works in a few other tools as well

That's because UltraEdit and those other tools treat the whole text as
unit. But your script (not Perl - *your* script) splits it into many
small units and looks at each of them in isolation. None of these small
units matches.

hp
 
F

fmassion

Am Samstag, 10. August 2013 11:16:58 UTC+2 schrieb (e-mail address removed):
[Sorry for not replying properly. I hope this is OK now]
Nope, the blank lines are still there.
Sorry to Peter, Ben and all of you. I hope it's fine now
[...]
Why do you want it in an array, rather than a single string?
Because I may want to do things only with the $satz variables which meet the regex. E.g. send them to another array or whatever. This isn't possible when I read only one big large string.
 
Ad

Advertisements

P

Peter J. Holzer

Am Samstag, 10. August 2013 11:16:58 UTC+2 schrieb (e-mail address removed):
[Sorry for not replying properly. I hope this is OK now]
Nope, the blank lines are still there.
Sorry to Peter, Ben and all of you. I hope it's fine now

Not quite, but a lot better, thanks.

Because I may want to do things only with the $satz variables which
meet the regex.

Apparently none of them does.

hp
 
C

Charles DeRykus

Am Samstag, 10. August 2013 11:16:58 UTC+2 schrieb (e-mail address removed):
[Sorry for not replying properly. I hope this is OK now]
Nope, the blank lines are still there.
Sorry to Peter, Ben and all of you. I hope it's fine now
[...]
Why do you want it in an array, rather than a single string?
Because I may want to do things only with the $satz variables which meet the regex.

E.g. send them to another array or whatever. This isn't possible when I
read only one big large string.
The problem is the match may extend over several $satz. If you wanted
to identify those individual $satz which are part of the match, you
could do something like this:

my @satz = <DATA>;
my $alles = join('', @satz);

my $match;
if ( $alles =~ /^.*\d.*Zahl.*?\n/gsmap ) {
$match = ${^MATCH};
foreach my $satz (@satz) {
if ( $match =~ /$satz/ ) {
#print "sentence is part of match: $satz"
...
}
}
}
 
C

Charles DeRykus

...

my @satz = <DATA>;
my $alles = join('', @satz);

my $match;
if ( $alles =~ /^.*\d.*Zahl.*?\n/gsmap ) {
$match = ${^MATCH};
foreach my $satz (@satz) {
if ( $match =~ /$satz/ ) {
#print "sentence is part of match: $satz"
...
}
}
}

You could omit /p too:

my $match;
if ( $alles =~ /^(.*\d.*Zahl.*?\n)/gsma ) {
$match = $1;
foreach my $satz (@satz) {
if ( $match =~ /$satz/ ) {
#print "sentence is part of match: $satz"
...
}
}
}
 
C

Charles DeRykus

Quoth Charles DeRykus <[email protected]>:


/^\Q$satz/m

Also, this may pick up lines that were not part of the originally-
matched text. Given that the match is anchored to a whole line fore-and-
aft ($satz will contain a trailing newline) this can only happen with
whole duplicated lines, but it may still be a problem.

And it's been bothering me since posting... here's a messier solution
to ensure sentences overlap the target begin/end positions:

use 5.012; # so each will work on arrays
....
my @satz = <DATA>;
my $alles = join('', @satz);

my ($b, $e) = (0, 0);
my @pos = map { $b= $e+1 if $e; $e += (length($_)-1); [$b,$e] } @satz;

if ( $alles =~ /^(.*\d.*Zahl.*?\n)/gsma ) {
my($match, $begin, $end) = ($1, $-[0], $+[0]);

while( my($i,$satz) = each @satz ) {
next unless ($pos[$i][0] >= $begin and $pos[$i][0] <= $end)
or ($pos[$i][1] >= $begin and $pos[$i][1] <= $end);

if ( $match =~ /$satz/ ) {
print "sentence is part of match: $satz\n\n"
...
}
}
}
 
C

Charles DeRykus

...
use 5.012; # so each will work on arrays
...
my @satz = <DATA>;
my $alles = join('', @satz);

my ($b, $e) = (0, 0);
my @pos = map { $b= $e+1 if $e; $e += (length($_)-1); [$b,$e] } @satz;

if ( $alles =~ /^(.*\d.*Zahl.*?\n)/gsma ) {

That .* will match across newlines, so the ^ (and the /m) does nothing.
my($match, $begin, $end) = ($1, $-[0], $+[0]);

while( my($i,$satz) = each @satz ) {
next unless ($pos[$i][0] >= $begin and $pos[$i][0] <= $end)
or ($pos[$i][1] >= $begin and $pos[$i][1] <= $end);

if ( $match =~ /$satz/ ) {

You're still not quoting $satz. It's really important to quote user data
before interpolating it into a pattern. You're also not anchoring the
match at the beginning, so a line will match if it only ends with $satz.

Thanks, I'm missed that... very important to be there.
print "sentence is part of match: $satz\n\n"
...
}
}
}

But this is still a great deal more complicated than

my $alles = slurp \*DATA;

while (my ($match) =
$alles =~ /([^\n]* \d .* Zahl [^\n]*)/gsx
# or perhaps /(.* \d (?s:.)* Zahl .*)/gx
# or /(\N* \d .* Zahl \N*)/gsx if you've got 5.12
) {
for my $satz (split /\n/, $match) {
# make that /(?<=\n)/ if you don't want to chomp
print "sentence is part of match: $satz\n\n";
}
}

Yes, I agree that's conciser and clearer to many.
But, it'll loop endlessly :)

I think you probably meant:

while ( $alles =~ /([^\n]* \d .* Zahl [^\n]*)/gsx;
my $match = $1;
...
}
 
Ad

Advertisements

F

fmassion

Thanks to all of you for your efforts and ideas. Let me summarize the lessons I've learned in this discussion.
The task was: Import a text, apply a regex which extends over a linebreak and display/modify the lines matching the expression.
The original approach failed because the text was not read in one string, but split into lines in an array.
I then wanted to be able to print each individual line of the array and to use ^ and $ in line-based regular expression.
I have tried all the suggested code. Not everything has worked. This is my current code with which I manage to get the matched lines and the entire text:

use utf8; # damit lassen sich UTF8 Dateien bearbeiten
binmode STDIN, ":utf8"; # input
binmode STDOUT, ":utf8"; # output

#undef $/; # is not required as <DATA> read into array and then joined
open(DATA,'D:\temp\a.txt') || die("Datei kann nicht geöffnet werden!\n");
seek(DATA, 3, 0);
my @satz = <DATA>;
my $alles = join('', @satz);
my $match;
if ( $alles =~ /^.*\d.*Zahl.*?\n/gsma ) {
$match = ${^MATCH}; # I don't understand what is this ${^MATCH}
# $match = $1; # doesn't work
print "$match<<\n"; # prints only the match
foreach my $satz (@satz) {
# if ( $match =~ /$satz/ ) { # if activated prints nothing
print "sentence is part of match: $satz\n"; # prints the entire text
# }
}
}
 
R

Rainer Weikusat

(e-mail address removed) writes:

[...]
use utf8; # damit lassen sich UTF8 Dateien bearbeiten

This is only needed if your source code contains UTF-*.
binmode STDIN, ":utf8"; # input
binmode STDOUT, ":utf8"; # output

#undef $/; # is not required as <DATA> read into array and then joined

Except in 'short files' (as here), it is usually better to use

local $/;

instead. This creates a new binding for $/ while preserving the old
one which will be restored after the containing block.
open(DATA,'D:\temp\a.txt') || die("Datei kann nicht geöffnet werden!\n");

This "it didn't work" style of error reporting is a bit useless. The
message should also contain the system error code/ message.
seek(DATA, 3, 0);
my @satz = <DATA>;
my $alles = join('', @satz);
my $match;
if ( $alles =~ /^.*\d.*Zahl.*?\n/gsma ) {
$match = ${^MATCH}; # I don't understand what is this ${^MATCH}

As 'perldoc perlvar' could have told you: The text which matched the
regex. At least for the perl version I'm using (5.10.1), the
documentation also says the /p match modifier is needed in order to
use this builtin variable.
# $match = $1; # doesn't work

Since the regex isn't capturing anyhing, that is to be expected.
 
Ad

Advertisements

C

Charles DeRykus

Thanks to all of you for your efforts and ideas. Let me summarize the lessons I've learned in this discussion.
The task was: Import a text, apply a regex which extends over a linebreak and display/modify the lines matching the expression.
The original approach failed because the text was not read in one string, but split into lines in an array.
I then wanted to be able to print each individual line of the array and to use ^ and $ in line-based regular expression.
I have tried all the suggested code. Not everything has worked. This is my current code with which I manage to get the matched lines and the entire text:

use utf8; # damit lassen sich UTF8 Dateien bearbeiten
binmode STDIN, ":utf8"; # input
binmode STDOUT, ":utf8"; # output

#undef $/; # is not required as <DATA> read into array and then joined
open(DATA,'D:\temp\a.txt') || die("Datei kann nicht geöffnet werden!\n");
seek(DATA, 3, 0);
my @satz = <DATA>;
my $alles = join('', @satz);
my $match;
if ( $alles =~ /^.*\d.*Zahl.*?\n/gsma ) {
$match = ${^MATCH}; # I don't understand what is this ${^MATCH}
# $match = $1; # doesn't work
print "$match<<\n"; # prints only the match
foreach my $satz (@satz) {
# if ( $match =~ /$satz/ ) { # if activated prints nothing
print "sentence is part of match: $satz\n"; # prints the entire text
# }
}
}

The $^{MATCH} is only valid with /p and was not needed. I'm not certain
it's at all relevant to what you're doing now either.

I think Ben's suggestion is the most promising if you want to identify
the sentences over which the match extends:

while (my ($match) =
$alles =~ /([^\n]* \d .* Zahl [^\n]*)/gsx
# or perhaps /(.* \d (?s:.)* Zahl .*)/gx
# or /(\N* \d .* Zahl \N*)/gsx if you've got 5.12
) {
for my $satz (split /\n/, $match) {
# make that /(?<=\n)/ if you don't want to chomp
print "sentence is part of match: $satz\n\n";
}
}
 
Ad

Advertisements


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top