fromdos dos2unix in ruby

K

Krzysztof Cierpisz

how can I achieve in ruby the result of running:
fromdos dos_file.txt unix_file.txt

or in vim:
set ff=unix

?

thanks,
chris
 
K

krzysztof cierpisz

how can I achieve in ruby the result of running:
fromdos dos_file.txt unix_file.txt

or in vim:
set ff=unix

?

thanks,
chris

just to add, I need to do that conversion under windows.

thanks,
chris
 
D

Dominik Honnef

krzysztof cierpisz said:
just to add, I need to do that conversion under windows.

Well, you would read from the input file, replace the dos/windows line
endings with unix ones and write to the output file.
 
K

krzysztof cierpisz

Well, you would read from the input file, replace the dos/windows line
endings with unix ones and write to the output file.

I tried with following dos2unix.rb script

###### dos2unix.rb ######################
out = File.open(ARGV[1],"w")

File.open(ARGV[0]).each {|line|
out << line.gsub!(/\r$/,'')
}

out.close
#########################################

this:
ruby dos2unix.rb u8nl_utf8_tab.dos.txt d

works fine on Linux (d with length 408 bytes) but not on Windows, on
Windows d is a file with 0 bytes

input file u8nl_utf8_tab.dos.txt looks like this:
col1,col2|~|
"first line of cell 1
second line of cell 1",only line in 2|~|
"Czy specjalny telefon przeznaczony dla dzieci w wieku od 3 do 7 lat
podbije rynek? Jest prosty, bezpieczny i ma tylko 4 klawisze.
Sprzedawać go chce między innymi telefonia ojca Rydzyka. więcej
","Copyright © World Group.
Реклама
Help
Сделать World Ñтартовой"|~|
äöüб фыва,"asdf,Ñжх"|~|

thanks,
chris
 
R

Robert Klemme

2009/8/18 krzysztof cierpisz said:
Well, you would read from the input file, replace the dos/windows line
endings with unix ones and write to the output file.

I tried with following dos2unix.rb script

###### dos2unix.rb ######################
out =3D File.open(ARGV[1],"w")

File.open(ARGV[0]).each {|line|
=A0out << line.gsub!(/\r$/,'')
}

out.close
#########################################

this:
ruby dos2unix.rb u8nl_utf8_tab.dos.txt d

works fine on Linux (d with length 408 bytes) but not on Windows, on
Windows d is a file with 0 bytes

You are not closing the File object properly so your output might
never get flushed to disk...

Cheers

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
K

krzysztof cierpisz

I tried with following dos2unix.rb script
###### dos2unix.rb ######################
out = File.open(ARGV[1],"w")
File.open(ARGV[0]).each {|line|
 out << line.gsub!(/\r$/,'')
}
out.close
#########################################

this:
ruby dos2unix.rb u8nl_utf8_tab.dos.txt d
works fine on Linux (d with length 408 bytes) but not on Windows, on
Windows d is a file with 0 bytes

You are not closing the File object properly so your output might
never get flushed to disk...

Cheers

robert

can you let me know how to close it properly?

thanks,
chris
 
R

Rob Biedenharn

2009/8/18 krzysztof cierpisz said:
Well, you would read from the input file, replace the dos/windows
line
endings with unix ones and write to the output file.

I tried with following dos2unix.rb script

###### dos2unix.rb ######################
out = File.open(ARGV[1],"w")

File.open(ARGV[0]).each {|line|
out << line.gsub!(/\r$/,'')

You open the file with the default mode of 'r' here so the File class
is going to do the line-ending conversion for you. Then you use
String#gsub! which returns nil when no changes are made. You are never
going to get output this way.
You are not closing the File object properly so your output might
never get flushed to disk...

Cheers

robert


Try something like this:

buffer = ''
File.open(ARGV[1], 'wb') do |out| # open for writing binary
File.open(ARGV[0], 'rb') do |in| # open for reading binary
while in.read(1024, buffer) # read upto 1024 bytes into
buffer
out.write buffer.gsub(/\r\n/, "\n") # change ending and write
out
end
end # end of block closes input
end # end of block closes output

-Rob

P.S. This is untested straight from my head.


Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)
 
K

krzysztof cierpisz

Try something like this:
buffer = ''
File.open(ARGV[1], 'wb') do |out|     # open for writing binary
   File.open(ARGV[0], 'rb') do |in|    # open for reading binary
     while in.read(1024, buffer)       # read upto 1024 bytesinto  
buffer
       out.write buffer.gsub(/\r\n/, "\n")  # change ending andwrite  
out
     end
   end                               # end of block closes input
end                                 # endof block closes output

-Rob

thanks Rob,

I just added binary mode to what I had, and now it's working under
Windows as well.
I am always forgetting about "b" mode under windows.

thanks
chris
 
R

Robert Dober

2009/8/18 krzysztof cierpisz said:
Well, you would read from the input file, replace the dos/windows line
endings with unix ones and write to the output file.


I tried with following dos2unix.rb script

###### dos2unix.rb ######################
out =3D File.open(ARGV[1],"w")

File.open(ARGV[0]).each {|line|
=A0out << line.gsub!(/\r$/,'')

You open the file with the default mode of 'r' here so the File class is
going to do the line-ending conversion for you. Then you use String#gsub!
which returns nil when no changes are made. You are never going to get
output this way.
You are not closing the File object properly so your output might
never get flushed to disk...

Cheers

robert


Try something like this:

buffer =3D ''
File.open(ARGV[1], 'wb') do |out| =A0 =A0 # open for writing binary
=A0File.open(ARGV[0], 'rb') do |in| =A0 =A0# open for reading binary
=A0 =A0while in.read(1024, buffer) =A0 =A0 =A0 # read upto 1024 bytes int= o buffer
=A0 =A0 =A0out.write buffer.gsub(/\r\n/, "\n") =A0# change ending and wri=
te out
fancy little bug here Rob, do you spot it?











What if \r is the 1024th char?
This will happen, one day ;)
Unless the file is hugh I would try
File.open....
File.open ...
out.print in.read.gsub( /\r\n/, /\n/ )
end
end

If performance can be an issue we could use File#each with 10.chr as a sepe=
rator

in.each 10.chr do | line |
out.print line.sub( /\r\n\z/, 10.chr )
end

HTH
Robert
--=20
module Kernel
alias_method :=EB, :lambda
end
 
X

Xavier Noria

2009/8/18 Robert Dober said:
If performance can be an issue we could use File#each with 10.chr as a se= perator

=C2=A0 =C2=A0 in.each 10.chr do | line |
=C2=A0 =C2=A0 =C2=A0 =C2=A0 out.print line.sub( /\r\n\z/, 10.chr )
=C2=A0 =C2=A0 end

Just for the record... in Ruby "\n" =3D=3D 10.chr in all platforms. I find
"\n" to be more obvious.
 
R

Robert Dober

Just for the record... in Ruby "\n" =3D=3D 10.chr in all platforms. I fin= d
"\n" to be more obvious.
I wanted to point out the subtle bug because I thought it useful. But
I hate backslashes and use 10.chr often, this however is not good
practice, because it is unconventional, it is just me ;).
In the infinitesimal hope that 10.chr is useful for some folks anyway.
Cheers
Robert




--=20
module Kernel
alias_method :=EB, :lambda
end
 
R

Robert Klemme

I wanted to point out the subtle bug because I thought it useful. But
I hate backslashes and use 10.chr often, this however is not good
practice, because it is unconventional, it is just me ;).
In the infinitesimal hope that 10.chr is useful for some folks anyway.

I would let Ruby do the line detection to avoid the issue Robert pointed
out. For the record, this is what I'd probably be doing:

WIN_LE = "\r\n".freeze

File.open ARGV[0] do |in|
File.open ARGV[1], "wb" do |out|
in.each do |line|
line.chomp!
out.print line, WIN_LE
# or:
# out.write(line)
# out.write(WIN_LE)
end
end
end

In this particular case I would not use File.foreach because then "out"
is created even if "in" isn't there.

Kind regards

robert
 
X

Xavier Noria

WIN_LE =3D "\r\n".freeze

File.open ARGV[0] do |in|
=C2=A0File.open ARGV[1], "wb" do |out|
=C2=A0 =C2=A0in.each do |line|
=C2=A0 =C2=A0 =C2=A0line.chomp!
=C2=A0 =C2=A0 =C2=A0out.print line, WIN_LE

Hey but this is dos2unix :).

You can't read in text-mode just like that in a portable way, because
chomp! only chomps "\n".

If you can assume the program is gonna run only on Windows then the
solution is trivial: read in text-mode, and write in binary mode. No
chomping or gsubs needed, just read and write.

If the program has to be portable then you need to deal with the
spurious \015 that may come up.
 
R

Robert Dober

WIN_LE =3D "\r\n".freeze

File.open ARGV[0] do |in|
=A0File.open ARGV[1], "wb" do |out|
=A0 =A0in.each do |line|
=A0 =A0 =A0line.chomp!
=A0 =A0 =A0out.print line, WIN_LE

Hey but this is dos2unix :).

You can't read in text-mode just like that in a portable way, because
chomp! only chomps "\n".

If you can assume the program is gonna run only on Windows then the
solution is trivial: read in text-mode, and write in binary mode. No
chomping or gsubs needed, just read and write.

If the program has to be portable then you need to deal with the
spurious \015 that may come up.

Yup, I thought my code solved the issue, tell Ruby that a line ends
with "\n" ( that was tough to type ;) in each and replace a potential
"\r" before?
But maybe this does not work on binary files under Windows, no way to
test, sorry.

Cheers
Robert
 
X

Xavier Noria

Yup, I thought my code solved the issue, tell Ruby that a line ends
with "\n" ( that was tough to type ;) in each and replace a potential
"\r" before?
But maybe this does not work on binary files under Windows, no way to
test, sorry.

The idea is good, but this topic is brittle (though easy when you get
the facts straight).

Problem is on CRLF platforms the I/O system filters out the CR of any
pair CRLF before the string arrives to Ruby land. That is, if you work
in text-mode. In fact that is the definition of text-mode, that the
conversion is on.

When you write in text mode in a CRLF platform, the I/O system
monitors the stream of bytes, and inserts a CR every time he sees an
LF. Unconditionally.

On Unix these conversions do not happen, text-mode and binary-mode are
the same, and Unix uses LF on disk to mean a newline.

And the point is those conversions happen in text-mode *no matter
which is the input record separator*, so in those solution the file
opened for reading should be opened in binary mode anyway. If you
don't do this, a file that has on disk

\r\r\n

will go up as \r\n on Windows, and that gsubed to \n, so you've lost a
\r that didn't belong to the newline.

In a portable script you have to work in binary mode, and in a
Windows-only script it is enough to read in text-mode and write
verbatim in binary-mode.
 
R

Robert Klemme

2009/8/18 Xavier Noria said:
WIN_LE =3D "\r\n".freeze

File.open ARGV[0] do |in|
=A0File.open ARGV[1], "wb" do |out|
=A0 =A0in.each do |line|
=A0 =A0 =A0line.chomp!
=A0 =A0 =A0out.print line, WIN_LE

Hey but this is dos2unix :).

Ooops, make that then

LE =3D "\n".freeze

and of course

out.print line, LE
You can't read in text-mode just like that in a portable way, because
chomp! only chomps "\n".

No.

$ allruby -e 'p "a\r\n".chomp'
CYGWIN_NT-5.1 padrklemme1 1.5.25(0.156/4/2) 2008-06-12 19:34 i686 Cygwin
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
ruby 1.8.7 (2008-08-11 patchlevel 72) [i386-cygwin]
"a"
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
ruby 1.9.1p129 (2009-05-12 revision 23412) [i386-cygwin]
"a"
If you can assume the program is gonna run only on Windows then the
solution is trivial: read in text-mode, and write in binary mode. No
chomping or gsubs needed, just read and write.

If the program has to be portable then you need to deal with the
spurious \015 that may come up.

String#chomp does that nicely.

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
R

Robert Dober

The idea is good, but this topic is brittle (though easy when you get
the facts straight).

Problem is on CRLF platforms the I/O system filters out the CR of any
pair CRLF before the string arrives to Ruby land. That is, if you work
in text-mode. In fact that is the definition of text-mode, that the
conversion is on.

When you write in text mode in a CRLF platform, the I/O system
monitors the stream of bytes, and inserts a CR every time he sees an
LF. Unconditionally.

On Unix these conversions do not happen, text-mode and binary-mode are
the same, and Unix uses LF on disk to mean a newline.

And the point is those conversions happen in text-mode *no matter
which is the input record separator*, so in those solution the file
opened for reading should be opened in binary mode anyway. If you
don't do this, a file that has on disk
But I did open it in binary mode, did I not?
Anyway, if I had a typo in my snippet, thanx for the correction.

The only issue I can see is the following

Newline = "\n" || 10.chr || "\012" || ";-)"

file.open( "...", "rb"){ | f |
f.each( Newline ) { ...
####### ^
####### Does this work on Windows?

Cheers
Robert
 
X

Xavier Noria

String#chomp does that nicely.

Oh you are right. I thought chomp chomped the input record separator,
but I see in the Pickaxe that's unless $/ has been untouched.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top