Does Ruby need a "line separator" class?

Discussion in 'Ruby' started by Wes Gamble, Jul 31, 2006.

  1. Wes Gamble

    Wes Gamble Guest

    I've run into a problem where Ruby can't handle newlines on Windows
    because the regexp is explicitly looking for \n and not \r\n.

    In the Java world, there is a system property to represent line
    separator so that you can write code that is cross-platform with respect
    to line separation on Unix/Windows/Mac. Is there an equivalent
    abstraction of the newline character in Ruby? If not, where does it
    belong?

    For some reason, I thought I read somewhere that sometimes the "\n"
    character is overloaded in this way (to represent a "newline" regardless
    of platform), but not sure if I'm misremembering.

    Thanks,
    Wes

    --
    Posted via http://www.ruby-forum.com/.
    Wes Gamble, Jul 31, 2006
    #1
    1. Advertising

  2. Wes Gamble

    Xavier Noria Guest

    On Jul 31, 2006, at 5:40 PM, Wes Gamble wrote:

    > I've run into a problem where Ruby can't handle newlines on Windows
    > because the regexp is explicitly looking for \n and not \r\n.


    It shouldn't look for CRLFs. The rules of the game in languages that
    inherit the newline normalization approach from C (those include C++,
    and Perl, for instance, but not Java) are that if you work in text
    mode and the text file follows runtime conventions, you only read and
    print "\n"s.

    That's because there's an intermediate IO layer that transforms CRLF
    into LF in CRLF platforms on reading, and LF back to CRLF on writing.

    In Java this is handled in a different way, "\n" is not portable in
    Java. Portable code in Java uses method calls like println. But in
    Ruby a portable regexp that assumes text mode and data with the
    runtime platform conventions for newlines have to use "\n", no CR
    ever gets into the string.

    -- fxn
    Xavier Noria, Jul 31, 2006
    #2
    1. Advertising

  3. Wes Gamble

    Wes Gamble Guest

    Xavier,

    That's interesting.

    In a pure Ruby (Rails) app, I've had to modify regexps to handle the
    \r\n sequence so that my regexps will work in a Windows environment.
    I'm guessing that this is related to the "file follows runtime
    conventions" in your post. Meaning that the file that I'm processing
    (which is actually sourced externally) did not conform to C runtime
    conventions when it was written.

    In general, this seems simple enough to handle, you just allow for
    optional \r \n combinations in your regexp (assuming setting the
    multiline flag for the regexp), like so:

    [^\r\n]*
    [\r\n]*
    (\r*\n*)

    Wes



    --
    Posted via http://www.ruby-forum.com/.
    Wes Gamble, Jul 31, 2006
    #3
  4. Wes Gamble

    Wes Gamble Guest

    Wes Gamble, Jul 31, 2006
    #4
  5. Wes Gamble

    Xavier Noria Guest

    On Jul 31, 2006, at 6:15 PM, Charles O Nutter wrote:

    > This has come up in the JRuby project fairly frequently since Java
    > wants to
    > normalize line-terminators internally to the underlying platform,
    > rather
    > than normalizing to \n and handling conversion on read-write.
    > Xavier, are
    > you saying that Ruby has in its IO layer code to convert from CRLF
    > to LF on
    > input/output, and this is the primary means of normalizing
    > newlines? We have
    > had in our bug tracker a patch that resolves JRuby's newline issues
    > in a
    > similar way, but had not committed it pending research into whether
    > this
    > would be appropriate and sufficient.


    If I am not mistaken, in Ruby that is delegated to stdio. After a
    quick code inspection I think the exact point where that is done is
    in the call to write():

    r = write(fileno(f), RSTRING(str)->ptr+offset, l);

    That's in the function io_fwrite(), line 455 of io.c in Ruby 1.8.4.

    In Perl that was delegated to stdio as well until 5.8.0, where the I/
    O layer was substituted with PerlIO who is now the responsible for
    that filtering in CRLF platforms.

    -- fxn
    Xavier Noria, Jul 31, 2006
    #5
  6. Wes Gamble

    Xavier Noria Guest

    On Jul 31, 2006, at 6:23 PM, Wes Gamble wrote:

    > In a pure Ruby (Rails) app, I've had to modify regexps to handle the
    > \r\n sequence so that my regexps will work in a Windows environment.
    > I'm guessing that this is related to the "file follows runtime
    > conventions" in your post. Meaning that the file that I'm processing
    > (which is actually sourced externally) did not conform to C runtime
    > conventions when it was written.


    Yes, that is an important point.

    When we talk about portability as far as newlines is concerned we are
    assuming the newline conventions of the platform and the data match.
    A portable line-oriented script might fail if it is running on Linux
    processing text files from a FAT32 partition that were generated by
    some Windows program. There a lot of common situations when
    conventions may not match. A portable line-oriented script is not
    supposed to handle those situation, a robust line-oriented script
    should do something sensible with foreign conventions.

    Web programming is one of them, because you cannot assume anything in
    the input that comes from a text area or an uploaded text file for
    instance. In that case you better normalize first (written on the way):

    normalized_text_area = text_area.gsub(/\015\012/, "\n").gsub(/
    \015/, "\n")
    # Now text_area has been normalized and all standard line-oriented
    # idioms will work.

    In Ruby we are done because "\n" is "\012" everywhere, in Perl that
    gets slightly more complicated because "\n" is eq "\015" on MacOS pre-
    X. But you see the idea and why you do that.

    -- fxn (<-- whose article about newlines for O'Reilly is about to
    appear)
    Xavier Noria, Jul 31, 2006
    #6
  7. Wes Gamble

    Xavier Noria Guest

    On Jul 31, 2006, at 7:27 PM, Charles O Nutter wrote:

    > A large part of our problem is that we currently tend to normalize
    > everything to \n....all the time. That has the effect of also
    > writing out \n
    > to the filesystem for newlines, which as you describe above causes
    > problems
    > when trying to re-read. So for the case in question, we run Rails...it
    > generates files with newlines...we normalize those newlines to \n
    > and write
    > such to disk...and then future use of those files (in this case, ERB
    > templates) fails because the newlines aren't handled correctly
    > (i.e. we
    > can't normalize \r\n to \n again because they're already \n on disk).


    If those files are only handled by that application there is no
    problem because \ns are precisely what the script should see.

    For instance, if you pass a Unix text file to a line-oriented script
    running on Windows the script will work as long as it only reads.
    That's because LFs not following a CR are left untouched by the I/O
    layer, and by a happy coincidence LFs is what readline expects. So
    everything works, by chance, but works.

    Problem is the application generates text files that do not follow
    the conventions of the platform, and other programs may assume they do.

    -- fxn
    Xavier Noria, Jul 31, 2006
    #7
  8. Wes Gamble

    Xavier Noria Guest

    On Jul 31, 2006, at 6:54 PM, Xavier Noria wrote:

    > normalized_text_area = text_area.gsub(/\015\012/, "\n").gsub(/
    > \015/, "\n")


    Just for the archives, this normalizes in Ruby with only one pass

    normalized_text_area = text_area.gsub(/\015\012?/, "\n")

    though it is less explicit. Let me add now that we are on it that if
    the text is Unicode it may come with a few more codes for newlines.
    All in all this is a PITA like character encodings, but is what we've
    got for historical reasons.

    -- fxn
    Xavier Noria, Jul 31, 2006
    #8
  9. Wes Gamble

    Wes Gamble Guest

    I was thinking about this a little more.

    Why wouldn't JRuby just take advantage of the Java runtime's
    normalization facility in this case, using the JVM's notion of "newline"
    on the particular platform to handle I/O?

    Is the JRuby issue that only _some_ of the code that is doing I/O is
    pure Java and some other set of the code is Ruby so that trying to
    always use the JVM "line separator" concept won't work?

    Wes


    --
    Posted via http://www.ruby-forum.com/.
    Wes Gamble, Jul 31, 2006
    #9
  10. Wes Gamble

    Wes Gamble Guest

    In this particular case, could
    java.lang.System.getProperty("line.separator") be used to handle
    platform-specific reading/writing? That way, you get to piggyback on
    the multiplatform support built into Java. If the low-level I/O code is
    centralized, it seems like this would be the way to go.

    Are there performance implications for this approach? Seems like you
    could just grab all of the system specific newline properties from the
    System object upon the initialization of the JRuby interpreter and just
    refer to them later.

    Wes


    --
    Posted via http://www.ruby-forum.com/.
    Wes Gamble, Jul 31, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. E11
    Replies:
    1
    Views:
    4,743
    Thomas Weidenfeller
    Oct 12, 2005
  2. Cogito
    Replies:
    53
    Views:
    4,469
    dorayme
    Aug 7, 2005
  3. christopher diggins
    Replies:
    16
    Views:
    749
    Pete Becker
    May 4, 2005
  4. Eitan M
    Replies:
    2
    Views:
    1,272
    Roedy Green
    Aug 19, 2007
  5. Replies:
    2
    Views:
    128
Loading...

Share This Page