Known issues with Perl under Cygwin?

weston · Aug 27, 2005

Are there any known issues with Perl under Cygwin? I've just
encountered a circumstance where a script (in particular, some regular
expressions within it) is behaving oddly -- but only under Cygwin.

The script essentially reads an entire HTML file into a string, and
then attempts to bring any tags that are broken across multiple
lines onto a same line, using this regular expression:

s/(<span.*)\n+([^>]*>)/$1\ $2/mig;

This appears to work under Active State Perl 5.8.4 and Perl 5.8.6 under
OpenBSD. When I operate on this text:

（

I get this result:

（

Under Cygwin (Perl 5.8.5), however, I get this result:

"MS Mincho"'>（

ie, it appears to be throwing away the first backreference.

So, I decided to see what the backreferences looked like, with the
following code:

print "\n 0: [$0]";
print "\n 1: [$1]";
print "\n 2: [$2]";
print "\n 3: [$3]";
print "\n ";
print "\n ".$1.$2;
print "\n";
print "\n".$_;
print "\nEnd";

Under Cygwin, this yields some interesting output:

0: [./stripWord.pl]
]1: []
3: []

"MS Mincho"'>ont-family:

"MS Mincho"'>（
End

Note that I have not made a mistake in placing the ] for backreference
1 at the beginning of the line -- that's how the output was given.

Note also that the concatenation of $1 and $2 is not correct.

Under the other platforms, the output is as expected:

0: [D:\HTMLCleaners\stripWord.pl]
1: []
3: []



（
End

Has anyone ever seen anything like this?

I'm not sure how to get cygwin version information, but from the
prompt, 'uname -a' yields:

CYGWIN_NT-5.1 Hermes 1.5.12(0.116/4/2) 2004-11-10 08:34 i686 unknown
unknown Cygwin

And the entire script I've got is available at:

http://weston.canncentral.org/misc/procword/regexProblem.txt

And the sample html is also available at:

http://weston.canncentral.org/misc/procword/example.txt

William James · Aug 27, 2005

It looks as though you're feeding the Cygwin Perl a file
that contains not just linefeeds but carriage returns.

Joe Smith · Aug 28, 2005

weston said:
"MS Mincho"'>ont-family:

Has anyone ever seen anything like this?

Yep, that's what happens for
print "<span lang=JA style='font-family:\r"MS Mincho"".
The first backreference is not being thrown away; it is just
being overwritten when shown. Pipe the output to 'od -c' to see.
-Joe

Dave Weaver · Sep 5, 2005

weston said:
The script essentially reads an entire HTML file into a string, and
then attempts to bring any tags that are broken across multiple
lines onto a same line, using this regular expression:

s/(<span.*)\n+([^>]*>)/$1\ $2/mig;

Under Cygwin (Perl 5.8.5), however, I get this result:

"MS Mincho"'>（

As others have pointed out, this is a line-ending problem.
In Windows, the end-of-line is marked by "\r\n", in *nix systems it's
usually just "\n". It looks like Cygwin is using the *nix definition,
but your data file has the Windows line-end sequence, so when your
regex removes the "\n", you're left with a "\r" at the end of the
line.

The simplest solution is to change your regexp to:

s/(<span.*)[\r\n]+([^>]*>)/$1\ $2/mig;

(untested)

Then it won't matter what combination of "\r" and "\n" occur.

Looking for Regexp that strips newlines inside of a tag	4	Aug 26, 2005
Range / empty list issues??	1	Dec 11, 2023
Remove Start Button from Clock	2	Jan 16, 2021
Help with my responsive home page	2	Dec 14, 2022
Help with code	0	Jun 12, 2022
Slideshow not working properly	2	Jan 7, 2023
I Need Help with making a function that draws in a canvas using location data.	1	Dec 17, 2021
SendGrid email issue in responsive Gmail	1	Nov 4, 2021

Known issues with Perl under Cygwin?

weston

William James

Joe Smith

Dave Weaver

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads