French sentences appearing weird in Rails Website

R

Ritvvij Parrikh

I have a Rails app. One of my clients is importing French Text which
is appearing weirdly. Check below example:

1. str = "--- \nFrench: \"3. Combien de r\\xC3\\xA9gions y a-t-il
au Cameroon?\"\nEnglish: 3. How many regions are there in Cameroon?\n"

Can someone assist please?

I am thinking on following lines:

2. str = str.gsub('"', '')

3. **Need to add a line which replaces \\ in the str above to just
\**

4. str = str.force_encoding("iso-8859-1")

5. str = str.encode('UTF-8')

In step 3, I was thinking of something like

str = str.gsub(/\\\\/, "\\")

OR somehow if possible push output of puts or a similar function back
to str example:

---

French: 3. Combien de r\xC3\xA9gions y a-t-il au Cameroon?

English: 3. How many regions are there in Cameroon?

but even that works. Can someone please assist?
 
S

Simon Krahnke

* Ritvvij Parrikh said:
I have a Rails app. One of my clients is importing French Text which
is appearing weirdly. Check below example:

1. str = "--- \nFrench: \"3. Combien de r\\xC3\\xA9gions y a-t-il
au Cameroon?\"\nEnglish: 3. How many regions are there in Cameroon?\n"

Can someone assist please?

I am thinking on following lines:

2. str = str.gsub('"', '')

3. **Need to add a line which replaces \\ in the str above to just
\**

4. str = str.force_encoding("iso-8859-1")

No, "\xc3\xa9" is UTF-8, not ISO-8859-1. At least, that makes much more
sense in UTF-8.
5. str = str.encode('UTF-8')

In step 3, I was thinking of something like

str = str.gsub(/\\\\/, "\\")

Yeah.

mfg, simon .... l
 
C

Charles Calvert

I have a Rails app. One of my clients is importing French Text which
is appearing weirdly. Check below example:

1. str = "--- \nFrench: \"3. Combien de r\\xC3\\xA9gions y a-t-il
au Cameroon?\"\nEnglish: 3. How many regions are there in Cameroon?\n"

As Simon said, this text is encoded in UTF-8. You need to process it
as such. Are you using 1.8 or 1.9?

[snip rest]
 
S

Simon Krahnke

* Charles Calvert said:
On Wed, 15 May 2013 04:30:45 -0700 (PDT), Ritvvij Parrikh


As Simon said, this text is encoded in UTF-8. You need to process it
as such. Are you using 1.8 or 1.9?

Are there versions of 1.8 that support encodings for strings?

mfg, simon .... l
 
C

Charles Calvert

Are there versions of 1.8 that support encodings for strings?

For file i/o, the only option of which I'm aware is the iconv library
(http://ruby-doc.org/stdlib-1.8.7/libdoc/iconv/rdoc/Iconv.html).

1.9, on the other hand, has built-in support for encoded strings and
conversion for file i/o. Here's some demo code that I wrote for a
talk that I gave on Unicode in Ruby:

#!/usr/bin/env ruby
# encoding: UTF-8

File.open('utf8.txt', 'w') do |file|
puts "Writing a UTF-8 file"
file.write('Tomás')
puts ""
end

File.open('utf8.txt', 'r:UTF-8') do |file|
puts "Reading the UTF-8 file"
puts "File external encoding: #{file.external_encoding}"
puts "File contains:"
line_count = 1
file.each_line do |line|
puts "#{line_count}: #{line}"
line_count += 1
puts ""
end
end

File.open('utf8.txt', 'r:UTF-8:UTF-16LE') do |file|
puts "Reading the UTF-8 file and storing in memory as UTF-16 little
endian"
puts "File external encoding: #{file.external_encoding}"
puts "File internal encoding: #{file.internal_encoding}"
puts "In memory representation contains:"
line_count = 1
file.each_line do |line|
puts "#{line_count}: contains #{line.size} characters and
#{line.bytesize} bytes in encoding #{line.encoding.name}"
line_count += 1
end
puts ""
end
 
S

Simon Krahnke

* Charles Calvert said:
For file i/o, the only option of which I'm aware is the iconv library
(http://ruby-doc.org/stdlib-1.8.7/libdoc/iconv/rdoc/Iconv.html).

1.9, on the other hand, has built-in support for encoded strings and
conversion for file i/o. Here's some demo code that I wrote for a
talk that I gave on Unicode in Ruby:

#!/usr/bin/env ruby
# encoding: UTF-8

File.open('utf8.txt', 'w') do |file|
puts "Writing a UTF-8 file"
file.write('Tomás')

That String is UTF-8 because of the default encoding specified in the
encoding magic comment above.

But why is the File written in UTF-8, because of the same reason?

Thanks for the examples.

mfg, simon .... l
 
C

Charles Calvert

[snip]
1.9, on the other hand, has built-in support for encoded strings and
conversion for file i/o. Here's some demo code that I wrote for a
talk that I gave on Unicode in Ruby:

#!/usr/bin/env ruby
# encoding: UTF-8

File.open('utf8.txt', 'w') do |file|
puts "Writing a UTF-8 file"
file.write('Tomás')

That String is UTF-8 because of the default encoding specified in the
encoding magic comment above.
Correct.

But why is the File written in UTF-8, because of the same reason?

I believe so, though I haven't checked the source to verify.
Thanks for the examples.

You're welcome.
 
S

Simon Krahnke

* Charles Calvert said:
That String is UTF-8 because of the default encoding specified in the
encoding magic comment above.
Correct.

But why is the File written in UTF-8, because of the same reason?

I believe so, though I haven't checked the source to verify.[/QUOTE]

But you can make it explicit, like you did for reading, can't you. I
think that would be a good idea, to keep things local. Someone might
change the encoding of the file, and then the file will have a different
encoding. Some other application might try read the file as UTF-8,
though.

For string literals there is no way to declare the encoding locally,
Let's just hope that the one who changes the encoding doesn't think it
is magically done by just changing the comment.

mfg, simon .... l
 
C

Charles Calvert

But you can make it explicit, like you did for reading, can't you.

Yes, as well as specifying an in-memory encoding that is different
from the file's encoding on disk.
I think that would be a good idea, to keep things local. Someone
might change the encoding of the file, and then the file will have
a different encoding.

Except that specifying the encoding doesn't transform the data if the
actual encoding is something other than what you specified. Maybe I
misunderstood you.
Some other application might try read the file as UTF-8, though.

Yes. You have to be careful with encodings. :)
For string literals there is no way to declare the encoding locally,

No, but you can escape them (e.g. "\x00\x50\x00\x65\x00\xF1\x00\x61")
if you need a literal in an encoding other than the default.
Let's just hope that the one who changes the encoding doesn't think it
is magically done by just changing the comment.

True.
 
S

Simon Krahnke

I've looked through the code and it looks to me like the default is
Encoding.default_external, which seems to be initialized by the locale,
not the file's encoding. I can't find a place to find the source files
encoding from within Ruby.
Yes, as well as specifying an in-memory encoding that is different
from the file's encoding on disk.

puts and the like seem to just dump that internal encoding out, right?
Except that specifying the encoding doesn't transform the data if the
actual encoding is something other than what you specified. Maybe I
misunderstood you.

That was based an false premises anyway. The internal encoding doesn't
inform the default encoding of files written, the locale does.
Yes. You have to be careful with encodings. :)

Which too should expect to find the file be encoded with what the locale
says.
No, but you can escape them (e.g. "\x00\x50\x00\x65\x00\xF1\x00\x61")
if you need a literal in an encoding other than the default.

But that string will still have an encoding attributed with it that says
file's encoding.

I've seen people who seemed to think that on usenet.

mfg, simon .... l
 
C

Charles Calvert

I've looked through the code and it looks to me like the default is
Encoding.default_external, which seems to be initialized by the locale,
not the file's encoding. I can't find a place to find the source files
encoding from within Ruby.

That makes sense from what I've seen. Detecting the encoding of a
file without a BOM is a tricky process, and there are libraries to do
it, so building it into the core seems like overkill.
puts and the like seem to just dump that internal encoding out, right?

The internal encoding of the string, yes.
That was based an false premises anyway. The internal encoding doesn't
inform the default encoding of files written, the locale does.

From my testing, it appears to be the encoding of the string written
to the file, rather than the locale.
Which too should expect to find the file be encoded with what the locale
says.

I never assume when it comes to user input. :)
But that string will still have an encoding attributed with it that says
file's encoding.

String#force_encoding is useful there.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top