R1.9 mixed encoding in file

Discussion in 'Ruby' started by Vít Ondruch, Aug 7, 2009.

  1. Hello

    I wonder if it is possible to enforce encoding of string in ruby 1.9.
    Let say I have following example:

    C:\enc>echo p 'test'.encoding > encoding.rb
    C:\enc>ruby encoding.rb
    #<Encoding:US-ASCII>

    Thats fine. But what if I like to have in single file ASCII, UTF-8 or
    strings with other encodings, i.e.

    C:\enc>echo p 'zufällige_žluÅ¥ouÄký'.encoding > encoding.rb
    C:\enc>ruby encoding.rb
    encoding.rb:1: invalid multibyte char (US-ASCII)

    I know that for this particular case I could use directive on top of the
    file, but I would like to see something in following manner:

    String.new 'zufällige_žluÅ¥ouÄký', Encoding.CP852

    It means read the content in between quotes binary and interpret it
    according to specified encoding.

    Vit
    --
    Posted via http://www.ruby-forum.com/.
     
    Vít Ondruch, Aug 7, 2009
    #1
    1. Advertising

  2. Vít Ondruch

    James Gray Guest

    On Aug 7, 2009, at 8:49 AM, V=C3=ADt Ondruch wrote:

    > Hello


    Hello.

    > I wonder if it is possible to enforce encoding of string in ruby 1.9.
    > Let say I have following example:
    >
    > C:\enc>echo p 'test'.encoding > encoding.rb
    > C:\enc>ruby encoding.rb
    > #<Encoding:US-ASCII>
    >
    > Thats fine. But what if I like to have in single file ASCII, UTF-8 or
    > strings with other encodings, i.e.
    >
    > C:\enc>echo p 'zuf=C3=A4llige_=C5=BElu=C5=A5ou=C4=8Dk=C3=BD'.encoding =
    > encoding.rb
    > C:\enc>ruby encoding.rb
    > encoding.rb:1: invalid multibyte char (US-ASCII)
    >
    > I know that for this particular case I could use directive on top of =20=


    > the
    > file, but I would like to see something in following manner:
    >
    > String.new 'zuf=C3=A4llige_=C5=BElu=C5=A5ou=C4=8Dk=C3=BD', =

    Encoding.CP852
    >
    > It means read the content in between quotes binary and interpret it
    > according to specified encoding.


    The problem with an idea like this is that before your String is ever =20=

    created the code to create it must be read (correctly) by Ruby's =20
    parser and formed into a proper String literal. That would be =20
    impossible to do if String literals could be in any random Encoding.

    You have a couple of options though:

    * Just set an Encoding like UTF-8 for the source code, enter =20
    everything in UTF-8, and transcode it into the needed Encoding. This =20=

    would make your example something like:

    # encoding: UTF-8
    cp852 =3D "zuf=C3=A4llige_=C5=BElu=C5=A5ou=C4=8Dk=C3=BD".encode("CP852"=
    ) # literal in =20
    UTF-8

    * Have one or more data files the program reads needed String objects =20=

    from. Those files can be in any Encoding you need and you can specify =20=

    it to IO operations, so your String objects are returned with that =20
    Encoding.

    I hope that helps.

    James Edward Gray II
     
    James Gray, Aug 7, 2009
    #2
    1. Advertising

  3. James Gray wrote:
    > On Aug 7, 2009, at 8:49 AM, Vít Ondruch wrote:
    >
    >> Hello

    >
    > Hello.
    >
    >> C:\enc>echo p 'zufällige_žluÅ¥ouÄký'.encoding > encoding.rb
    >> according to specified encoding.

    > The problem with an idea like this is that before your String is ever
    > created the code to create it must be read (correctly) by Ruby's
    > parser and formed into a proper String literal. That would be
    > impossible to do if String literals could be in any random Encoding.


    Yes, I understand that you have to parse the file. However, if I am
    right, you still have to read the file binary in case you are looking
    for some encoding directive on top of file. So from my point of view, it
    shouldn't be big problem to read until first quotes, suppose the file is
    stored in the encoding designed on top of the file. Then read whatever
    in between quotes as binary and decide later how to interpret that
    binary data, by suggested encoding in second parameter of string
    constructor.

    >
    > You have a couple of options though:
    >
    > * Just set an Encoding like UTF-8 for the source code, enter
    > everything in UTF-8, and transcode it into the needed Encoding. This
    > would make your example something like:
    >
    > # encoding: UTF-8
    > cp852 = "zufällige_žluÅ¥ouÄký".encode("CP852") # literal in
    > UTF-8
    >
    > * Have one or more data files the program reads needed String objects
    > from. Those files can be in any Encoding you need and you can specify
    > it to IO operations, so your String objects are returned with that
    > Encoding.


    Both your suggestions are valid of course, but I consider them as
    solutions far from ideal. They brings far more complexity than desired.

    >
    > I hope that helps.
    >
    > James Edward Gray II


    Of course my idea could be considered naive and there might be many
    technical issues with parser, etc. which prevents the implementation.
    Nevertheless, it would be nice feature.

    Thank you for you suggestion anyway.

    Vit
    --
    Posted via http://www.ruby-forum.com/.
     
    Vít Ondruch, Aug 7, 2009
    #3
  4. Vít Ondruch

    James Gray Guest

    On Aug 7, 2009, at 9:47 AM, V=C3=ADt Ondruch wrote:

    > James Gray wrote:
    >> On Aug 7, 2009, at 8:49 AM, V=C3=ADt Ondruch wrote:
    >>
    >>> Hello

    >>
    >> Hello.
    >>
    >>> C:\enc>echo p 'zuf=C3=A4llige_=C5=BElu=C5=A5ou=C4=8Dk=C3=BD'.encoding =

    > encoding.rb
    >>> according to specified encoding.

    >> The problem with an idea like this is that before your String is ever
    >> created the code to create it must be read (correctly) by Ruby's
    >> parser and formed into a proper String literal. That would be
    >> impossible to do if String literals could be in any random Encoding.

    >
    > Yes, I understand that you have to parse the file. However, if I am
    > right, you still have to read the file binary in case you are looking
    > for some encoding directive on top of file.


    You don't really have to:

    $ cat source_encoding.rb
    # encoding: UTF-8

    output =3D ""
    open(__FILE__, "r:US-ASCII") do |source|
    first_line =3D source.gets
    if first_line =3D~ /coding:\s*(\S+)/
    source.set_encoding($1)
    else
    output << first_line
    end
    output << source.read
    end
    p [output.encoding, output[0...20] + "=E2=80=A6"]
    $ ruby_dev source_encoding.rb
    [#<Encoding:UTF-8>, "\noutput =3D \"\"\nopen(__=E2=80=A6"]

    James Edward Gray II
     
    James Gray, Aug 7, 2009
    #4
  5. James Gray wrote:
    > On Aug 7, 2009, at 9:47 AM, Vít Ondruch wrote:
    >
    > You don't really have to:
    >


    It is disturbing that this approach will fail as soon as the file is
    UTF-16 encoded or it has BOM for UTF-8, etc.

    Vit
    --
    Posted via http://www.ruby-forum.com/.
     
    Vít Ondruch, Aug 7, 2009
    #5
  6. Vít Ondruch

    James Gray Guest

    On Aug 7, 2009, at 10:20 AM, V=EDt Ondruch wrote:

    > James Gray wrote:
    >> On Aug 7, 2009, at 9:47 AM, V=EDt Ondruch wrote:
    >>
    >> You don't really have to:
    >>

    >
    > It is disturbing that this approach will fail as soon as the file is
    > UTF-16 encoded or it has BOM for UTF-8, etc.


    You are not allowed to set the source encoding to a non-ASCII =20
    compatible encoding, if memory serves. That eliminates any issues =20
    with encodings like UTF-16. This makes perfect sense as there's no =20
    way to reliably support the magic encoding comment unless we can count =20=

    on being able to read at least that far.

    A BOM could be handled similarly to what I showed. You need to open =20
    the file in ASCII-8BIT and check the beginning bytes, then you could =20
    switch to US-ASCII and finish reading the first line (or to the second =20=

    if a shebang line is includes), then switch encodings again if needed =20=

    and finish processing.

    James Edward Gray II
     
    James Gray, Aug 7, 2009
    #6
  7. > You are not allowed to set the source encoding to a non-ASCII
    > compatible encoding, if memory serves.


    Where is it documented please?

    > That eliminates any issues
    > with encodings like UTF-16. This makes perfect sense as there's no
    > way to reliably support the magic encoding comment unless we can count
    > on being able to read at least that far.


    Needed to say that XML parsers can handle such cases, i.e. when xml
    header is in different encoding than the rest of document.

    > A BOM could be handled similarly to what I showed. You need to open
    > the file in ASCII-8BIT and check the beginning bytes, then you could
    > switch to US-ASCII and finish reading the first line (or to the second
    > if a shebang line is includes), then switch encodings again if needed
    > and finish processing.


    May be this technique could be used for reading UTF-16 encoded files, if
    needed? However this is too far from my initial post :)

    >
    > James Edward Gray II


    Vit
    --
    Posted via http://www.ruby-forum.com/.
     
    Vít Ondruch, Aug 7, 2009
    #7
  8. Vít Ondruch

    James Gray Guest

    On Aug 7, 2009, at 10:41 AM, V=EDt Ondruch wrote:

    >> You are not allowed to set the source encoding to a non-ASCII
    >> compatible encoding, if memory serves.

    >
    > Where is it documented please?


    I'm not sure it's officially documented yet.

    Ruby does throw an error in this scenario though:

    $ ruby_dev
    # encoding: UTF-16BE
    ruby_dev: UTF-16BE is not ASCII compatible (ArgumentError)

    and:

    $ ruby_dev -e 'puts "\uFEFF# encoding: UTF-16BE".encode("UTF-16BE")' | =20=

    ruby_dev
    -:1: invalid multibyte char (UTF-8)

    I believe this is the relevant code from Ruby's parser:

    static void
    parser_set_encode(struct parser_params *parser, const char *name)
    {
    int idx =3D rb_enc_find_index(name);
    rb_encoding *enc;

    if (idx < 0) {
    rb_raise(rb_eArgError, "unknown encoding name: %s", name);
    }
    enc =3D rb_enc_from_index(idx);
    if (!rb_enc_asciicompat(enc)) {
    rb_raise(rb_eArgError, "%s is not ASCII compatible", =
    rb_enc_name(enc));
    }
    parser->enc =3D enc;
    }

    >> That eliminates any issues
    >> with encodings like UTF-16. This makes perfect sense as there's no
    >> way to reliably support the magic encoding comment unless we can =20
    >> count
    >> on being able to read at least that far.

    >
    > Needed to say that XML parsers can handle such cases, i.e. when xml
    > header is in different encoding than the rest of document.


    I doubt we can say that universally. :)

    Also, what you said isn't very accurate. For example, "in different =20
    encoding than the rest of document" is not a possible occurrence =20
    according to the XML 1.1 specification =
    (http://www.w3.org/TR/2006/REC-xml11-20060816/=20
    ) which states:

    "It is a fatal error when an XML processor encounters an entity with =20
    an encoding that it is unable to process. It is a fatal error if an =20
    XML entity is determined (via default, encoding declaration, or higher-=20=

    level protocol) to be in a certain encoding but contains byte =20
    sequences that are not legal in that encoding."

    All XML parsers are required to assume UTF-8 unless told otherwise and =20=

    to be able to recognize UTF-16 by a required BOM. Beyond that, they =20
    are not required to recognize any other encodings, though they may of =20=

    course. Their encoding declaration can be expressed in ASCII and, =20
    since they assume UTF-8 by default, this is similar to what Ruby =20
    does. It allows a switch to an ASCII-compatible encoding.

    XML processors may do more. For example, they can accept a different =20=

    encoding from an external source to support things like HTTP headers =20
    and MIME types. Ruby doesn't really have access to such sources at =20
    execution time, so that option doesn't apply to the case we are =20
    discussing. However, XML processors may also recognize other BOM's =20
    and Ruby could do this.

    >> A BOM could be handled similarly to what I showed. You need to open
    >> the file in ASCII-8BIT and check the beginning bytes, then you could
    >> switch to US-ASCII and finish reading the first line (or to the =20
    >> second
    >> if a shebang line is includes), then switch encodings again if needed
    >> and finish processing.

    >
    > May be this technique could be used for reading UTF-16 encoded =20
    > files, if
    > needed?


    Yes, Ruby could recognize BOM's for non-ASCII compatible encodings to =20=

    support them. A BOM would be required in this case though, just as it =20=

    is in an XML processor that doesn't have external information.

    Ruby doesn't currently do this, as near as I can tell.

    Note that this would not give what you purposed in your initial =20
    message: multiple encodings in the same file. Ruby doesn't support =20
    that and isn't ever likely to. An XML processor that supports such =20
    things is in violation of its specification as I understand it.

    Besides, not many text editors that I'm aware of make it super easy to =20=

    edit in multiple encodings. :)

    James Edward Gray II
     
    James Gray, Aug 7, 2009
    #8
  9. On 8/7/09, V=C3=ADt Ondruch <> wrote:
    > file, but I would like to see something in following manner:
    >
    > String.new 'zuf=C3=A4llige_=C5=BElu=C5=A5ou=C4=8Dk=C3=BD', Encoding.CP852


    You seem to be asking for the ability to have individual string
    literals have encoding different from that of the program as a whole.
    Why not this:

    #encoding: ascii-8bit
    'zuf=C3=A4llige_=C5=BElu=C5=A5ou=C4=8Dk=C3=BD'.force_encoding 'cp852'
    'some utf8 data'.force_encoding 'utf-8'
    'some sjis data'.force_encoding 'sjis'

    I am far from an expert on encodings, but in my (admittedly minimalist
    and perhaps inadequate) testing, this seems to basically work.

    There are going to be holes in this; data in nonascii compatible
    encodings in particular may give trouble. However, if the string data
    does not contain the bytes 0x27 (ascii ') or 0x5C (ascii \) there will
    be no problem. Whether this will work in particular circumstances
    given a known encoding and data to be represented in it is unknown in
    general, but surely very often the case. If it's the single quote
    character that causes the problem, you can switch to a different
    character using the%q[] quote syntax. In extremis, a single quoted
    here document may be called for:

    <<-'end'
    lotsa ' and \ here, but ruby don't care
    end

    This form of string has the advantage of having no special characters
    at all, and you can choose the sequence of bytes that makes up the
    string terminator to be anything you want. (but you do end up with an
    extra (ascii) newline at the end...)

    Another challenge will be editing this file. There's no editor out
    there that could actually display this kind of thing correctly; you'll
    have to become proficient at editing it as binary, or at least find an
    editor than can tolerate arbitrary binary chars in its ascii.
     
    Caleb Clausen, Aug 7, 2009
    #9
  10. Vít Ondruch wrote:
    > I know that for this particular case I could use directive on top of the
    > file, but I would like to see something in following manner:
    >
    > String.new 'zufällige_žluÅ¥ouÄký', Encoding.CP852


    It's not pretty, but

    str = "zuf\x84llige_\xA7lu\x9Cou\x9Fk\xEC".force_encoding("CP852")

    will probably do the job.
    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Aug 7, 2009
    #10
  11. Caleb Clausen wrote:
    > On 8/7/09, Vít Ondruch <> wrote:
    >> file, but I would like to see something in following manner:
    >>
    >> String.new 'zufällige_žluÅ¥ouÄký', Encoding.CP852

    >
    > You seem to be asking for the ability to have individual string
    > literals have encoding different from that of the program as a whole.
    > Why not this:
    >
    > #encoding: ascii-8bit
    > 'zufällige_žluÅ¥ouÄký'.force_encoding 'cp852'
    > 'some utf8 data'.force_encoding 'utf-8'
    > 'some sjis data'.force_encoding 'sjis'


    Hmmm, that is a good idea!!!

    Which leads me to the question why is default encoding US-ASCII instead
    of ASCII-8BIT?

    > Another challenge will be editing this file. There's no editor out
    > there that could actually display this kind of thing correctly; you'll
    > have to become proficient at editing it as binary, or at least find an
    > editor than can tolerate arbitrary binary chars in its ascii.


    Its almost the same challenge if you want to edit single file in
    different encoding than is your system encoding ... so its not relevant
    ... in contrary, it could be even easier. Because in my case, I don't
    care much about content, since I need more encodings for testing.
    --
    Posted via http://www.ruby-forum.com/.
     
    Vít Ondruch, Aug 7, 2009
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,986
    Jon Skeet [C# MVP]
    Jun 9, 2004
  2. Replies:
    1
    Views:
    23,504
    Real Gagnon
    Oct 8, 2004
  3. adpsimpson

    reading a mixed file

    adpsimpson, Jul 29, 2003, in forum: C++
    Replies:
    2
    Views:
    460
    Chris Theis
    Jul 29, 2003
  4. Stanley Xu
    Replies:
    2
    Views:
    671
    Stanley Xu
    Mar 23, 2011
  5. Replies:
    2
    Views:
    396
Loading...

Share This Page