Known issues with Perl under Cygwin?

W

weston

Are there any known issues with Perl under Cygwin? I've just
encountered a circumstance where a script (in particular, some regular
expressions within it) is behaving oddly -- but only under Cygwin.

The script essentially reads an entire HTML file into a string, and
then attempts to bring any <span> tags that are broken across multiple
lines onto a same line, using this regular expression:

s/(<span.*)\n+([^>]*>)/$1\ $2/mig;

This appears to work under Active State Perl 5.8.4 and Perl 5.8.6 under
OpenBSD. When I operate on this text:

<span lang=JA style='font-family:
&quot;MS Mincho&quot;'>(</span>

I get this result:

<span lang=JA style='font-family: &quot;MS
Mincho&quot;'>(</span>

Under Cygwin (Perl 5.8.5), however, I get this result:

&quot;MS Mincho&quot;'>(</span>

ie, it appears to be throwing away the first backreference.

So, I decided to see what the backreferences looked like, with the
following code:

print "\n 0: [$0]";
print "\n 1: [$1]";
print "\n 2: [$2]";
print "\n 3: [$3]";
print "\n ";
print "\n ".$1.$2;
print "\n";
print "\n".$_;
print "\nEnd";

Under Cygwin, this yields some interesting output:


0: [./stripWord.pl]
]1: [<span lang=JA style='font-family:
2: [&quot;MS Mincho&quot;'>]
3: []

&quot;MS Mincho&quot;'>ont-family:

&quot;MS Mincho&quot;'>(</span>
End

Note that I have not made a mistake in placing the ] for backreference
1 at the beginning of the line -- that's how the output was given.

Note also that the concatenation of $1 and $2 is not correct.

Under the other platforms, the output is as expected:

0: [D:\HTMLCleaners\stripWord.pl]
1: [<span lang=JA style='font-family:]
2: [&quot;MS Mincho&quot;'>]
3: []

<span lang=JA style='font-family:&quot;MS Mincho&quot;'>

<span lang=JA style='font-family: &quot;MS
Mincho&quot;'>(</span>
End


Has anyone ever seen anything like this?


I'm not sure how to get cygwin version information, but from the
prompt, 'uname -a' yields:

CYGWIN_NT-5.1 Hermes 1.5.12(0.116/4/2) 2004-11-10 08:34 i686 unknown
unknown Cygwin

And the entire script I've got is available at:

http://weston.canncentral.org/misc/procword/regexProblem.txt

And the sample html is also available at:

http://weston.canncentral.org/misc/procword/example.txt
 
W

William James

It looks as though you're feeding the Cygwin Perl a file
that contains not just linefeeds but carriage returns.
 
J

Joe Smith

weston said:
&quot;MS Mincho&quot;'>ont-family:

Has anyone ever seen anything like this?

Yep, that's what happens for
print "<span lang=JA style='font-family:\r&quot;MS Mincho&quot;".
The first backreference is not being thrown away; it is just
being overwritten when shown. Pipe the output to 'od -c' to see.
-Joe
 
D

Dave Weaver

weston said:
The script essentially reads an entire HTML file into a string, and
then attempts to bring any <span> tags that are broken across multiple
lines onto a same line, using this regular expression:

s/(<span.*)\n+([^>]*>)/$1\ $2/mig;
Under Cygwin (Perl 5.8.5), however, I get this result:

&quot;MS Mincho&quot;'>(</span>

As others have pointed out, this is a line-ending problem.
In Windows, the end-of-line is marked by "\r\n", in *nix systems it's
usually just "\n". It looks like Cygwin is using the *nix definition,
but your data file has the Windows line-end sequence, so when your
regex removes the "\n", you're left with a "\r" at the end of the
line.

The simplest solution is to change your regexp to:

s/(<span.*)[\r\n]+([^>]*>)/$1\ $2/mig;

(untested)

Then it won't matter what combination of "\r" and "\n" occur.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top