Does Ruby need a "line separator" class?

W

Wes Gamble

I've run into a problem where Ruby can't handle newlines on Windows
because the regexp is explicitly looking for \n and not \r\n.

In the Java world, there is a system property to represent line
separator so that you can write code that is cross-platform with respect
to line separation on Unix/Windows/Mac. Is there an equivalent
abstraction of the newline character in Ruby? If not, where does it
belong?

For some reason, I thought I read somewhere that sometimes the "\n"
character is overloaded in this way (to represent a "newline" regardless
of platform), but not sure if I'm misremembering.

Thanks,
Wes
 
X

Xavier Noria

I've run into a problem where Ruby can't handle newlines on Windows
because the regexp is explicitly looking for \n and not \r\n.

It shouldn't look for CRLFs. The rules of the game in languages that
inherit the newline normalization approach from C (those include C++,
and Perl, for instance, but not Java) are that if you work in text
mode and the text file follows runtime conventions, you only read and
print "\n"s.

That's because there's an intermediate IO layer that transforms CRLF
into LF in CRLF platforms on reading, and LF back to CRLF on writing.

In Java this is handled in a different way, "\n" is not portable in
Java. Portable code in Java uses method calls like println. But in
Ruby a portable regexp that assumes text mode and data with the
runtime platform conventions for newlines have to use "\n", no CR
ever gets into the string.

-- fxn
 
W

Wes Gamble

Xavier,

That's interesting.

In a pure Ruby (Rails) app, I've had to modify regexps to handle the
\r\n sequence so that my regexps will work in a Windows environment.
I'm guessing that this is related to the "file follows runtime
conventions" in your post. Meaning that the file that I'm processing
(which is actually sourced externally) did not conform to C runtime
conventions when it was written.

In general, this seems simple enough to handle, you just allow for
optional \r \n combinations in your regexp (assuming setting the
multiline flag for the regexp), like so:

[^\r\n]*
[\r\n]*
(\r*\n*)

Wes
 
X

Xavier Noria

This has come up in the JRuby project fairly frequently since Java
wants to
normalize line-terminators internally to the underlying platform,
rather
than normalizing to \n and handling conversion on read-write.
Xavier, are
you saying that Ruby has in its IO layer code to convert from CRLF
to LF on
input/output, and this is the primary means of normalizing
newlines? We have
had in our bug tracker a patch that resolves JRuby's newline issues
in a
similar way, but had not committed it pending research into whether
this
would be appropriate and sufficient.

If I am not mistaken, in Ruby that is delegated to stdio. After a
quick code inspection I think the exact point where that is done is
in the call to write():

r = write(fileno(f), RSTRING(str)->ptr+offset, l);

That's in the function io_fwrite(), line 455 of io.c in Ruby 1.8.4.

In Perl that was delegated to stdio as well until 5.8.0, where the I/
O layer was substituted with PerlIO who is now the responsible for
that filtering in CRLF platforms.

-- fxn
 
X

Xavier Noria

In a pure Ruby (Rails) app, I've had to modify regexps to handle the
\r\n sequence so that my regexps will work in a Windows environment.
I'm guessing that this is related to the "file follows runtime
conventions" in your post. Meaning that the file that I'm processing
(which is actually sourced externally) did not conform to C runtime
conventions when it was written.

Yes, that is an important point.

When we talk about portability as far as newlines is concerned we are
assuming the newline conventions of the platform and the data match.
A portable line-oriented script might fail if it is running on Linux
processing text files from a FAT32 partition that were generated by
some Windows program. There a lot of common situations when
conventions may not match. A portable line-oriented script is not
supposed to handle those situation, a robust line-oriented script
should do something sensible with foreign conventions.

Web programming is one of them, because you cannot assume anything in
the input that comes from a text area or an uploaded text file for
instance. In that case you better normalize first (written on the way):

normalized_text_area = text_area.gsub(/\015\012/, "\n").gsub(/
\015/, "\n")
# Now text_area has been normalized and all standard line-oriented
# idioms will work.

In Ruby we are done because "\n" is "\012" everywhere, in Perl that
gets slightly more complicated because "\n" is eq "\015" on MacOS pre-
X. But you see the idea and why you do that.

-- fxn (<-- whose article about newlines for O'Reilly is about to
appear)
 
X

Xavier Noria

A large part of our problem is that we currently tend to normalize
everything to \n....all the time. That has the effect of also
writing out \n
to the filesystem for newlines, which as you describe above causes
problems
when trying to re-read. So for the case in question, we run Rails...it
generates files with newlines...we normalize those newlines to \n
and write
such to disk...and then future use of those files (in this case, ERB
templates) fails because the newlines aren't handled correctly
(i.e. we
can't normalize \r\n to \n again because they're already \n on disk).

If those files are only handled by that application there is no
problem because \ns are precisely what the script should see.

For instance, if you pass a Unix text file to a line-oriented script
running on Windows the script will work as long as it only reads.
That's because LFs not following a CR are left untouched by the I/O
layer, and by a happy coincidence LFs is what readline expects. So
everything works, by chance, but works.

Problem is the application generates text files that do not follow
the conventions of the platform, and other programs may assume they do.

-- fxn
 
X

Xavier Noria

normalized_text_area = text_area.gsub(/\015\012/, "\n").gsub(/
\015/, "\n")

Just for the archives, this normalizes in Ruby with only one pass

normalized_text_area = text_area.gsub(/\015\012?/, "\n")

though it is less explicit. Let me add now that we are on it that if
the text is Unicode it may come with a few more codes for newlines.
All in all this is a PITA like character encodings, but is what we've
got for historical reasons.

-- fxn
 
W

Wes Gamble

I was thinking about this a little more.

Why wouldn't JRuby just take advantage of the Java runtime's
normalization facility in this case, using the JVM's notion of "newline"
on the particular platform to handle I/O?

Is the JRuby issue that only _some_ of the code that is doing I/O is
pure Java and some other set of the code is Ruby so that trying to
always use the JVM "line separator" concept won't work?

Wes
 
W

Wes Gamble

In this particular case, could
java.lang.System.getProperty("line.separator") be used to handle
platform-specific reading/writing? That way, you get to piggyback on
the multiplatform support built into Java. If the low-level I/O code is
centralized, it seems like this would be the way to go.

Are there performance implications for this approach? Seems like you
could just grab all of the system specific newline properties from the
System object upon the initialization of the JRuby interpreter and just
refer to them later.

Wes
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top