unicode (hebrew) regexp search for new line headaches

M

mitchell_laks

Dear Perl Gurus!

I wrote the following script with the intention to

slurp in a utf8 encoded hebrew text file
then search for a regexp
( i hard coded here an example word),
and then selects the
"from the beginning of the line just before the start of the regexp
until the end of the line after the end of the regexp".

i did this by searching for the "\n" new line character in the full
file, encoded as a single string. Hey! memory is cheap :)

Now this script works on individual files. However when i incorporate
it into a bigger script that uses the File::Find module, it begins to
go haywire for "random" file inputs. When I ran the bigger program
through the perl debugger using ddd, I find the problem seems to be
the lines

$index1=rindex(substr($singleline_file_contents,0,$length_a),"\n");
$index2=index($singleline_file_contents,"\n",$length_b);

which gives me bad indexes - they seem to find more new lines than I
see in the file.

so i am stuck. I wonder if new line is my problem.

How shall I represent "new line" in this script so that I can search
for it in a string with many new lines...

So without further ado here is my beastly script! Thank you all!

use utf8;
BEGIN { $ENV{LC_ALL} = "he_IL"; }
use open ':utf8';
open (IN,"/home/mlaks/hebrew/data/BVL/BAH-DHYH.LL2.utf8");

my $find_text="ירושלי×";

my @multiline_file_contents = <IN>;

my $length_a=0;
my $length_b=0;
my $index1;
my $index2;


my $singleline_file_contents =join("",@multiline_file_contents);

my $temp_file_contents =$singleline_file_contents;

while ( $temp_file_contents =~ /(.*?)($find_text)/gos) {
$length_a = length($1) + $length_b ;
$length_b = length($2) + $length_a ;



$index1=rindex(substr($singleline_file_contents,0,$length_a),"\n");
$index2=index($singleline_file_contents,"\n",$length_b);


# this will get the lines of the regexp match and include a single new
line on each end

my $blug = substr( $singleline_file_contents,
$index1,$index2-$index1+1) ;

print "$blug \n";

}
 
D

Dr.Ruud

mitchell_laks schreef:
Now this script works on individual files. However when i incorporate
it into a bigger script that uses the File::Find module, it begins to
go haywire for "random" file inputs.

Without File::Find, it doesn't go haywire on the same individual files?

WinDOS or a Unix-variant?


Why do you create a separate $temp_file_contents?
while ( $temp_file_contents =~ /(.*?)($find_text)/gos) {

Since you are already using a regex, why not

while ( $singleline_file_contents =~ /((?=\n).*$find_text.*\n)/g) {
my $blug = $1;

(so without the /os modifiers)


On WinDOS you could try

binmode IN, ':encoding(utf8)' or die "... $!";
or
binmode IN, ':utf8' or die "... $!";

after the open(), in stead of the "use open ':utf8';" before it.
 
M

mitchell_laks

Dear Uuud,

I am running debian linux sid.

You wrote:

Since you are already using a regex, why not
while ( $singleline_file_contents =~ /((?=\n).*$find_text.*\n)/g) {
my $blug = $1;


i looked for regexp solutions, however note that your suggestion grabs
a whole section of the file (while not the whole thing).

I ran your code. Notice your regexp is much too greedy!

/try5.pl|wc
Wide character in print at ./try5.pl line 26, <IN> line 867.
594 6801 61932
while
wc /home/mlaks/hebrew/data/BVL/BAH-DHYH.LL2.utf8
867 10014 91483 /home/mlaks/hebrew/data/BVL/BAH-DHYH.LL2.utf8

thus your code grabs about 3/4 of the file, not the individual lines i
need!

My code which i presented successfully selects 2 individual lines in
case the regexp lives on one line and only selects the lines that the
regexp "encompases".

On the other hand it seems to go haywire when run as a part of a loop,
which i have never seen before. The problem seems to be in the code

$index1=rindex(substr($singleline_file_contents,0,$length_a),"\n");
$index2=index($singleline_file_contents,"\n",$length_b);

as this is the only thing that goes wrong when running in the loop. My
only thought is that "\n" is not valid when i use unicode strings? is
this true or false?

Thanks for your thoughts
Mitchell
 
A

Anno Siegel

mitchell_laks said:
Dear Perl Gurus!

I wrote the following script with the intention to

slurp in a utf8 encoded hebrew text file
then search for a regexp
( i hard coded here an example word),
and then selects the
"from the beginning of the line just before the start of the regexp
until the end of the line after the end of the regexp".

i did this by searching for the "\n" new line character in the full
file, encoded as a single string. Hey! memory is cheap :)

Is your regex itself a multiline affair? Your use of "word" above indicates
otherwise.

For a single-line regex your approach is unnecessarily roundabout. Read
the file(s) line-wise and select the lines that contain a match.

If you must use the described method, there are better ways of finding
the position of a match than getting the length of the pre-match,
and also better ways of restricting index() and rindex() than having it
work on substrings. That makes your code inefficient, but not wrong.

Before discussing it further, I'd like to be sure it is really necessary.
Now this script works on individual files. However when i incorporate
it into a bigger script that uses the File::Find module, it begins to
go haywire for "random" file inputs. When I ran the bigger program
through the perl debugger using ddd, I find the problem seems to be
the lines

$index1=rindex(substr($singleline_file_contents,0,$length_a),"\n");
$index2=index($singleline_file_contents,"\n",$length_b);

which gives me bad indexes - they seem to find more new lines than I
see in the file.

Are you sure you're opening those files in UTF8 mode? Spurious newlines
might well appear if a multi-byte file is opened in single byte mode.
so i am stuck. I wonder if new line is my problem.

How shall I represent "new line" in this script so that I can search
for it in a string with many new lines...

use utf8;
BEGIN { $ENV{LC_ALL} = "he_IL"; }
use open ':utf8';
open (IN,"/home/mlaks/hebrew/data/BVL/BAH-DHYH.LL2.utf8");

my $find_text="ירושלי×";

That didn't survive Usenet propagation very well. I can make out a shin
and a resh among lots of garbage (but no newline).

[rest of code snipped until further clarification]

Anno
 
M

mitchell_laks

Is your regex itself a multiline affair? Your use of "word" above indicates
otherwise.

Aha! I want to use multiline regexps of course, otherwise i would
simple slurp in and search line by line! I demonstrated via a simple 1
word example searching through the freely downloadable text of the
talmud bavli/yerushalmi/rambam/tanach/onquelos for the word Jerusalem.
However i want to allow users to look for general regexps in the GUI
GTK2-Perl app i will be distributing freely via sourceforge. This will
be a free "talmud grep" with gtk2 perl gui (cool ha!).

I have a previously working crappy version that breaks up the files
into 10 (or n) line groups searches n lines at a time - however here i
want to search whole files with regexps.

I wonder if the index and rindex functions are broken. I dont
understand why the program as written and submitted works with an
individual file while the program messes up as it loops over the (1-2
hundred files) of the full set of files.

I run it through ddd using the inferior perl debugger and the only
variable that is getting messed up is the index1 an index2 and i
wonder if the "\n" is the wrong way to refer to the new line in a
unicode index expression....

I can send you all the code and the files if you want to play with it.
its about 40mb as a tar. it works if you have a gtk2-perl install on
debian or redhat too i bet.

mitchell

1>
For a single-line regex your approach is unnecessarily roundabout. Read
the file(s) line-wise and select the lines that contain a match.

If you must use the described method, there are better ways of finding
the position of a match than getting the length of the pre-match,
and also better ways of restricting index() and rindex() than having it
work on substrings. That makes your code inefficient, but not wrong.

Before discussing it further, I'd like to be sure it is really necessary.


Are you sure you're opening those files in UTF8 mode? Spurious newlines

of course - look at the code itself i tell it to use utf8 mode to open
and it does.
might well appear if a multi-byte file is opened in single byte mode.
so i am stuck. I wonder if new line is my problem.

How shall I represent "new line" in this script so that I can search
for it in a string with many new lines...

use utf8;
BEGIN { $ENV{LC_ALL} = "he_IL"; }
use open ':utf8';
open (IN,"/home/mlaks/hebrew/data/BVL/BAH-DHYH.LL2.utf8");

my $find_text="ירושלי×";

That didn't survive Usenet propagation very well. I can make out a shin
and a resh among lots of garbage (but no newline).

[rest of code snipped until further clarification]

Anno
--
If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.
 
D

Dr.Ruud

mitchell_laks:
Ruud:


i looked for regexp solutions, however note that your suggestion grabs
a whole section of the file (while not the whole thing).

Maybe because my '(?=\n)' should have been '(?<=\n)'.

Some test-code to show offsets and lengths:

#!/usr/bin/perl

use strict;
use warnings;

binmode STDIN, ':utf8';

my $find_text = "\x{0640}"x3; # change to whatever

local ($,, $/) = ("\t");
my $slurp = <STDIN>;
print length $slurp, "\n\n";

while ($slurp =~ /((?<=\n).*${find_text}.*\n)/g) {
print $-[0], length $1, "\n";
}
 
A

Anno Siegel

mitchell_laks said:
Aha! I want to use multiline regexps of course, otherwise i would
simple slurp in and search line by line!

Meaning you will *not* have to slurp the file.
I demonstrated via a simple 1
word example searching through the freely downloadable text of the
talmud bavli/yerushalmi/rambam/tanach/onquelos for the word Jerusalem.
However i want to allow users to look for general regexps in the GUI
GTK2-Perl app i will be distributing freely via sourceforge. This will
be a free "talmud grep" with gtk2 perl gui (cool ha!).

I have a previously working crappy version that breaks up the files
into 10 (or n) line groups searches n lines at a time - however here i
want to search whole files with regexps.

You could refine that: First determine how many lines the regex requires,
then make your groups as big as that.
I wonder if the index and rindex functions are broken. I dont
understand why the program as written and submitted works with an
individual file while the program messes up as it loops over the (1-2
hundred files) of the full set of files.

Since you haven't shown code (and data) that demonstrate your problem
we can't help you understand. A blatant bug in index/rindex seems unlikely.
I run it through ddd using the inferior perl debugger and the only
^^^^^^^^
It's open software. Feel free to improve it.
variable that is getting messed up is the index1 an index2 and i
wonder if the "\n" is the wrong way to refer to the new line in a
unicode index expression....

I can send you all the code and the files if you want to play with it.
its about 40mb as a tar. it works if you have a gtk2-perl install on
debian or redhat too i bet.

Thanks, but no thanks.

Here is a better way to match a pattern and extend the match to complete
lines in a multi-line string. A regular expression can do it just fine,
no need to fiddle with index/rindex after the match. Just capture as
many non-linefeeds as possible before or after the pattern:

my $pat = qr/It\nwasn't/;

my $text = do { local $/; <DATA> };
while ( $text =~ /([^\n]*$pat[^\n]*)/g ) {
print "$1\n";
}

__DATA__
I said nothing.
She smiled -- with some satisfaction, I thought. "It
wasn't exactly a best-seller."
"Is she from New England originally?"
"No, originally she's from here -- California. We both
grew up in Fresno.

Anno
 
M

mitchell_laks

Dear Anno and Dr. Ruud,

Thank you for your thoughtful comments.

Thanks to both of you for your help.
Here is a better way to match a pattern and extend the match to complete
lines in a multi-line string. A regular expression can do it just fine,
no need to fiddle with index/rindex after the match. Just capture as
many non-linefeeds as possible before or after the pattern:

my $pat = qr/It\nwasn't/;

my $text = do { local $/; <DATA> };
while ( $text =~ /([^\n]*$pat[^\n]*)/g ) {
print "$1\n";
}

I discovered that i had many problems. Thanks for your help.

1) My original data came from a dos machine and i had to clean out a
bunch of control-m characters that were in the files (from \r\n as new
line). This may have been messing up my line end matches.

2. Anno I agree and got rid of my "cargo cult" code and now slurp in
the full file as your recommend and use
the regexp method exclusively. I slurp your way and then I use:

while($text=~/^(.*?$find_text.*?$/gom) {

in order to match and include "the whole line beginning at the
beginning of the line where my regexp starts to match and ending at end
of the line where my regexp stops matching.

what if I want to match "starting at 1 (or say 2 or n) lines before
that line of the first match and similarly include n "context" other
lines after it - as in gnu grep where we have the -C n option which
allows n lines of "matching context".

Thank you all for your help!
Mitchell
 
A

Anno Siegel

[...]
2. Anno I agree and got rid of my "cargo cult" code and now slurp in
the full file as your recommend and use
the regexp method exclusively. I slurp your way and then I use:

while($text=~/^(.*?$find_text.*?$/gom) {

in order to match and include "the whole line beginning at the
beginning of the line where my regexp starts to match and ending at end
of the line where my regexp stops matching.

This approach has a problem, as I noted after posting. If there are
more than one matches (of $find_text) on a single line, the full regex
will show (and count, if that's done) only one match. All matches are
still shown, along with the one that actually matched, but if matches
are highlighted in some way, some will never be highlighted.

If that's a problem, go back to matching the string directly and
extend the match later. It is still easier to do with a regex
than using index and rindex. Example:

use Term::ANSIColor qw( colored);;

my $text = do { local $/; <DATA> };

my $pat = qr/originally/;
while ( $text =~ /$pat/g ) {
my ( $pre, $match, $post) = $text =~ /([^\n]*)($pat)\G([^\n]*)/;
print $1, colored( $2, 'bold'), $3, "\n";
}

__DATA__
I said nothing.
She smiled -- with some satisfaction, I thought. "It
wasn't exactly a best-seller."
"Is she from New England originally?" "No, originally she's
from here -- California. We both grew up in Fresno.

This prints the line with two "originally" twice, each time highlighting
one copy. The match inside the loop is forced by \G to pick up the
same position of $pat as the original match.
what if I want to match "starting at 1 (or say 2 or n) lines before
that line of the first match and similarly include n "context" other
lines after it - as in gnu grep where we have the -C n option which
allows n lines of "matching context".

You can modify the method. For instance, making the inner match

my ( $pre, $match, $post) =
$text =~ /([^\n]*\n[^\n]*)($pat)\G([^\n]*)/;

would show one line extra before the actual match. I'll leave it to
you to extend the method to the trailing part, and for arbitrary numbers
of context lines.

Anno
 
M

mitchell_laks

Dear Anno,

Thanks again for your help. I got it working, now I need to think about
it more...
A few points:

Anno Siegel wrote earlier:

my $pat = qr/originally/;
while ( $text =~ /$pat/g ) {
my ( $pre, $match, $post) = $text =~ /([^\n]*)($pat)\G([^\n]*)/;
print $1, colored( $2, 'bold'), $3, "\n";

I think you mean
my $pat = qr/originally/;
while ( $text =~ /$pat/gs ) {
my ( $pre, $match, $post) = $text =~
/([^\n]*)($pat)\G([^\n]*)/s;
print $1, colored( $2, 'bold'), $3, "\n";

in order to match across lines with the regexp. This works for me.
Thank you very much.

Moreover, I think you misunderstood an earlier comment I made.

When i talked about using ddd and the "inferior perl debugger", the
"inferior" modifies the debugger, not the "perl debugger". I meant no
slur on the wonderful powerful perl debugger...

In the lingo of "ddd" (the graphical debugger frontend) the debugger
(whether gdb for c or c++ or perl -d for perl) is called the
"inferior" debugger - i don't think that they meant "inferior to the
"superior" front end that they wrote". :)

Perhaps the ddd people should have been more respectful and said
"foundational debugger" not 'inferior debugger". :)

Mitchell
 
A

Anno Siegel

mitchell_laks said:
Dear Anno,

Thanks again for your help. I got it working, now I need to think about
it more...
A few points:

Anno Siegel wrote earlier:

my $pat = qr/originally/;
while ( $text =~ /$pat/g ) {
my ( $pre, $match, $post) = $text =~ /([^\n]*)($pat)\G([^\n]*)/;
print $1, colored( $2, 'bold'), $3, "\n";

I think you mean
my $pat = qr/originally/;
while ( $text =~ /$pat/gs ) {
my ( $pre, $match, $post) = $text =~
/([^\n]*)($pat)\G([^\n]*)/s;
print $1, colored( $2, 'bold'), $3, "\n";

in order to match across lines with the regexp. This works for me.
Thank you very much.

Oh, I forgot the essential /s, didn't I. Sorry for that.
Moreover, I think you misunderstood an earlier comment I made.

When i talked about using ddd and the "inferior perl debugger", the
"inferior" modifies the debugger, not the "perl debugger". I meant no
slur on the wonderful powerful perl debugger...

Yes, I misunderstood. I never use the debugger and am unacquainted
with its terminology.

We do get this kind of critique-in-a-subordinate-clause occasionally
and I find it annoying. Apologies.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top