Paragraph separation for DOS files on Unix

  • Thread starter Oliver Heidelbach
  • Start date
O

Oliver Heidelbach

Hi,

is there any way to get Unix Perl to work on DOS files (CRLF) when
reading in paragraphs and without to convert the line endings
before processing the file?

The usual

$/ = "";
@paras = split(/""/,$_);

works fine for DOS files on systems with DOS format and DOS-aware
Perl and for Unix files on Unix systems and Unix-aware Perl.

However, it does not work for DOS files on Unix systems. I tried
everything which came to my mind, but nothing really works.

e.g.:
$/ = "^\r*$";
@paras = split(/"^\r*$"/,$_);

$/ = "^\r\n$";
@paras = split(/"^\r\n"/,$_);

$/ = "^\s*$";
@paras = split(/"^\s*$"/,$_);

I am regularly shifting DOS formatted files to Unix for
further processing and having to convert those before
touching with Perl is a major annoyance.

If the paragraph separation cannot be done on that format
on Unix properly, is there any easy way to slurp in the file,
convert it in memory and then process the data as if it were
read from a file directly?

Thanks for any help with this.

Mit freundlichen Gruessen
Oliver Heidelbach
 
A

Andrew Hamm

Assuming you are using a sufficiently modern version of Perl, have a fresh
look at the "IO layers" or "disciplines" of the open call. I'm actually
struggling to find a suitable perldoc section (someone else will surely
post one), but if you have the 3rd edition of the Camel book, look at
pages 754,755 which actually appears to show almost exactly what you are
trying to do

open FH, "<:para:crlf", $filename

:para means put the file into paragraph mode, :crlf means to handle CR/LF
line terminators on your behalf.
 
U

Uri Guttman

OH> The usual

what usual?

OH> $/ = "";

that is for reading records.

OH> @paras = split(/""/,$_);

no records are read there. and does $_ really have "" (pairs of double
quotes) in it. for that is what you are splitting on.

OH> works fine for DOS files on systems with DOS format and DOS-aware
OH> Perl and for Unix files on Unix systems and Unix-aware Perl.

so what? the split is wacky on both platforms unless you use "" as a
paragraph separator.

OH> However, it does not work for DOS files on Unix systems. I tried
OH> everything which came to my mind, but nothing really works.

OH> e.g.:
OH> $/ = "^\r*$";
OH> @paras = split(/"^\r*$"/,$_);

OH> $/ = "^\r\n$";
OH> @paras = split(/"^\r\n"/,$_);

OH> $/ = "^\s*$";
OH> @paras = split(/"^\s*$"/,$_);

where did you learn to use "" inside a regex? burn that book. a regex is
ALREADY a double quotish string and can interpolate variables and most
string escape sequences just fine.

uri
 
O

Oliver Heidelbach

OH> @paras = split(/"^\s*$"/,$_);

where did you learn to use "" inside a regex? burn that book. a regex is
ALREADY a double quotish string and can interpolate variables and most
string escape sequences just fine.

uri

Thanks a lot, that was the problem.

$/ = "";
@paras = split(/^\r*$/,$_);

works as it should.

I don't know why I wanted the regexp inside the.quotes :(

Mit freundlichen Gruessen
Oliver Heidelbach
 
T

Tad McClellan

Oliver Heidelbach said:
I am regularly shifting DOS formatted files to Unix for
further processing and having to convert those before
touching with Perl is a major annoyance.


If you use the correct mode ("ASCII" or "text") when FTPing the
files, then you won't have any annoying conversions to do because
the FTP program will do them for you.
 
B

Ben Morrow

Quoth "Andrew Hamm said:
Assuming you are using a sufficiently modern version of Perl, have a fresh
look at the "IO layers" or "disciplines" of the open call. I'm actually
struggling to find a suitable perldoc section (someone else will surely
post one),

Err... the obvious place? perldoc -f open points you at perldoc open and
perldoc PerlIO.
but if you have the 3rd edition of the Camel book, look at
pages 754,755 which actually appears to show almost exactly what you are
trying to do

open FH, "<:para:crlf", $filename

:para means put the file into paragraph mode, :crlf means to handle CR/LF
line terminators on your behalf.

The stuff in the Camel about IO 'disciplines' is mostly incorrect in
detail, as it hadn't been written at that time. They are now called
'layers', and there is no :para. :crlf, however, will do what the OP
wants (though not what I think it should do... :) ).

Ben
 
A

Andrew Hamm

Ben said:
Err... the obvious place? perldoc -f open points you at perldoc open
and perldoc PerlIO.

My perldoc doesn't detail much about the disciplines/layers... maybe I
didn't hunt long enough.
The stuff in the Camel about IO 'disciplines' is mostly incorrect in
detail, as it hadn't been written at that time. They are now called
'layers', and there is no :para. :crlf, however, will do what the OP
wants (though not what I think it should do... :) ).

ok. It did sound really bleeding edge from the Camel book - in fact I
think it includes a disclaimer that it's not even complete yet. I haven't
had need to use any disciplines in my work.

hahahah - PerlIO literally. Damn. Interesting stuff - it's quite changed
from Camel book days.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top