File.new and encoding

Discussion in 'Ruby' started by Achim Domma (SyynX Solutions GmbH), Nov 29, 2005.

  1. Hi,

    I'm still quite new to ruby, but have written a simple code generator.
    The generator opens some files and combines them to a new one. The
    resulting file is encoded as iso-8859-1, but it looks like ruby writes
    an UTF-8 Markter to the beginning of the file. Is that possible?

    How can I tell ruby which encoding to use, if I write to textfiles?

    Any pointers to documentation are wellcome, but I didn't find something
    usefull using google.

    regards,
    Achim
     
    Achim Domma (SyynX Solutions GmbH), Nov 29, 2005
    #1
    1. Advertising

  2. Achim Domma (SyynX Solutions GmbH) wrote:
    > Hi,
    >
    > I'm still quite new to ruby, but have written a simple code generator.
    > The generator opens some files and combines them to a new one. The
    > resulting file is encoded as iso-8859-1, but it looks like ruby writes
    > an UTF-8 Markter to the beginning of the file. Is that possible?


    What's an UTF-8 marker? I know only two byte UTF-16 marker but AFAIK
    there is no marker for UTF-8. Did I miss something?

    > How can I tell ruby which encoding to use, if I write to textfiles?
    >
    > Any pointers to documentation are wellcome, but I didn't find
    > something usefull using google.


    Encoding is not an easy issue with ruby - I guess by default it uses the
    default enconding of your environment. But you can specify certain
    (Japanese) encodings with command line option -K. HTH

    Kind regards

    robert
     
    Robert Klemme, Nov 29, 2005
    #2
    1. Advertising

  3. Achim Domma (SyynX Solutions GmbH)

    Guest

    Hi,

    At Wed, 30 Nov 2005 00:17:29 +0900,
    Robert Klemme wrote in [ruby-talk:167988]:
    > > I'm still quite new to ruby, but have written a simple code generator.
    > > The generator opens some files and combines them to a new one. The
    > > resulting file is encoded as iso-8859-1, but it looks like ruby writes
    > > an UTF-8 Markter to the beginning of the file. Is that possible?

    >
    > What's an UTF-8 marker? I know only two byte UTF-16 marker but AFAIK
    > there is no marker for UTF-8. Did I miss something?


    It would be UTF-8 encoded BOM, but ruby itself never write it
    automatically.

    > > How can I tell ruby which encoding to use, if I write to textfiles?


    Can't you show the code?

    --
    Nobu Nakada
     
    , Nov 29, 2005
    #3
  4. wrote:

    > It would be UTF-8 encoded BOM, but ruby itself never write it
    > automatically.

    [...]
    > Can't you show the code?


    Trying to reproduce the problem in a smaller example, I figured out,
    that I'm reading the BOM from one of my source files. Sorry for the
    confusion. I'm doing something like:

    File.open("target","w") do |target|
    File.open("source","r") do |source|
    source.each_line do |line|
    ... some processing ...
    target.write(line)
    end
    end
    end


    source seems to contain the BOM and it is writen to target. Any hint on
    how to strip the BOM?

    regards,
    Achim
     
    Achim Domma (SyynX Solutions GmbH), Nov 29, 2005
    #4
  5. Achim Domma (SyynX Solutions GmbH)

    Alex Fenton Guest

    > I'm doing something like:
    >
    > File.open("target","w") do |target|
    > File.open("source","r") do |source|
    > source.each_line do |line|
    > ... some processing ...
    > target.write(line)
    > end
    > end
    > end


    Have you looked at 'iconv' in the standard library?

    http://www.ruby-doc.org/stdlib/libdoc/iconv/rdoc/classes/Iconv.html

    Assuming all your input files were ISO-8859-1, and you wanted your output file in UTF-8, your example might look something like (untested):

    File.open("target","w") do |target|
    Iconv.open('UTF-8', 'ISO-8859-1') do | converter |
    File.open("source","r") do |source|
    source.each_line do |line|
    # ... some processing ...
    target.write( converter.iconv(line) )
    end
    end
    target << converter.iconv(nil)
    end
    end

    Iconv should deal with BOMs, stripping them out or adding them in where necessary. I'm not sure if it will complain if it finds a BOM mid-stream (as you open your second and subsequent input file) - if so you could just instantiate a new Iconv to deal with each input.

    HTH
    alex
     
    Alex Fenton, Nov 29, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,873
    Jon Skeet [C# MVP]
    Jun 9, 2004
  2. Replies:
    1
    Views:
    23,375
    Real Gagnon
    Oct 8, 2004
  3. Replies:
    3
    Views:
    160
    J├╝rgen Exner
    May 10, 2007
  4. Replies:
    2
    Views:
    376
  5. Replies:
    2
    Views:
    470
    Thomas 'PointedEars' Lahn
    Mar 11, 2008
Loading...

Share This Page