M
mitchell_laks
Dear Perl Gurus!
I wrote the following script with the intention to
slurp in a utf8 encoded hebrew text file
then search for a regexp
( i hard coded here an example word),
and then selects the
"from the beginning of the line just before the start of the regexp
until the end of the line after the end of the regexp".
i did this by searching for the "\n" new line character in the full
file, encoded as a single string. Hey! memory is cheap
Now this script works on individual files. However when i incorporate
it into a bigger script that uses the File::Find module, it begins to
go haywire for "random" file inputs. When I ran the bigger program
through the perl debugger using ddd, I find the problem seems to be
the lines
$index1=rindex(substr($singleline_file_contents,0,$length_a),"\n");
$index2=index($singleline_file_contents,"\n",$length_b);
which gives me bad indexes - they seem to find more new lines than I
see in the file.
so i am stuck. I wonder if new line is my problem.
How shall I represent "new line" in this script so that I can search
for it in a string with many new lines...
So without further ado here is my beastly script! Thank you all!
use utf8;
BEGIN { $ENV{LC_ALL} = "he_IL"; }
use open ':utf8';
open (IN,"/home/mlaks/hebrew/data/BVL/BAH-DHYH.LL2.utf8");
my $find_text="ירושלי×";
my @multiline_file_contents = <IN>;
my $length_a=0;
my $length_b=0;
my $index1;
my $index2;
my $singleline_file_contents =join("",@multiline_file_contents);
my $temp_file_contents =$singleline_file_contents;
while ( $temp_file_contents =~ /(.*?)($find_text)/gos) {
$length_a = length($1) + $length_b ;
$length_b = length($2) + $length_a ;
$index1=rindex(substr($singleline_file_contents,0,$length_a),"\n");
$index2=index($singleline_file_contents,"\n",$length_b);
# this will get the lines of the regexp match and include a single new
line on each end
my $blug = substr( $singleline_file_contents,
$index1,$index2-$index1+1) ;
print "$blug \n";
}
I wrote the following script with the intention to
slurp in a utf8 encoded hebrew text file
then search for a regexp
( i hard coded here an example word),
and then selects the
"from the beginning of the line just before the start of the regexp
until the end of the line after the end of the regexp".
i did this by searching for the "\n" new line character in the full
file, encoded as a single string. Hey! memory is cheap
Now this script works on individual files. However when i incorporate
it into a bigger script that uses the File::Find module, it begins to
go haywire for "random" file inputs. When I ran the bigger program
through the perl debugger using ddd, I find the problem seems to be
the lines
$index1=rindex(substr($singleline_file_contents,0,$length_a),"\n");
$index2=index($singleline_file_contents,"\n",$length_b);
which gives me bad indexes - they seem to find more new lines than I
see in the file.
so i am stuck. I wonder if new line is my problem.
How shall I represent "new line" in this script so that I can search
for it in a string with many new lines...
So without further ado here is my beastly script! Thank you all!
use utf8;
BEGIN { $ENV{LC_ALL} = "he_IL"; }
use open ':utf8';
open (IN,"/home/mlaks/hebrew/data/BVL/BAH-DHYH.LL2.utf8");
my $find_text="ירושלי×";
my @multiline_file_contents = <IN>;
my $length_a=0;
my $length_b=0;
my $index1;
my $index2;
my $singleline_file_contents =join("",@multiline_file_contents);
my $temp_file_contents =$singleline_file_contents;
while ( $temp_file_contents =~ /(.*?)($find_text)/gos) {
$length_a = length($1) + $length_b ;
$length_b = length($2) + $length_a ;
$index1=rindex(substr($singleline_file_contents,0,$length_a),"\n");
$index2=index($singleline_file_contents,"\n",$length_b);
# this will get the lines of the regexp match and include a single new
line on each end
my $blug = substr( $singleline_file_contents,
$index1,$index2-$index1+1) ;
print "$blug \n";
}