How to detect string charset

S

Simone Carletti

Hi list,
I run a deep search through this group and other resources online but
I have been unable to find whether is there a way to guess the charset
of a string in Ruby 1.8.6.

I need to ensure a string is always UTF-8 encoded but Iconv requires
the developer to specify both in and out charset.
On the other side, Kconv provides a #guess() method but doesn't
support Latin or Western encodings.

Any suggestion?
 
X

Xavier Noria

Hi list,
I run a deep search through this group and other resources online but
I have been unable to find whether is there a way to guess the charset
of a string in Ruby 1.8.6.

I need to ensure a string is always UTF-8 encoded but Iconv requires
the developer to specify both in and out charset.
On the other side, Kconv provides a #guess() method but doesn't
support Latin or Western encodings.

The best way is to be aware of the charsets in every data I/O and do
the necessary housekeeping.

If that's not possible, for example working on arbitrary text files,
the best approximation that I am aware of in Ruby is the charguess
library.

-- fxn
 
A

Austin Ziegler

I run a deep search through this group and other resources online but
I have been unable to find whether is there a way to guess the charset
of a string in Ruby 1.8.6.

I need to ensure a string is always UTF-8 encoded but Iconv requires
the developer to specify both in and out charset.
On the other side, Kconv provides a #guess() method but doesn't
support Latin or Western encodings.

Any suggestion?

Kconv can guess because the encodings for the set of Asian written
languages are distinctive (they don't share much with the Latin
character set). What you're wanting is nearly impossible without a
large body of text for analysis, and even then the best commercial
programs are taking stabs at probabilities. (Here's an example: how do
you tell the difference between ISO-8859-1 and ISO-8859-15
programmatically? IIRC, the only difference between them is that -15
supports the Euro symbol, replacing a different symbol from -1.)

You're better off seeking a slightly different approach.

-austin
 
S

Simone Carletti

Kconv can guess because the encodings for the set of Asian written
languages are distinctive (they don't share much with the Latin
character set). What you're wanting is nearly impossible without a
large body of text for analysis, and even then the best commercial
programs are taking stabs at probabilities. (Here's an example: how do
you tell the difference between ISO-8859-1 and ISO-8859-15
programmatically? IIRC, the only difference between them is that -15
supports the Euro symbol, replacing a different symbol from -1.)

You're better off seeking a slightly different approach.

-austin

If I'm right both ISO-8859-1 and ISO-8859-15 belongs to Latin1 thus I
can convert them in the same way using Iconv.iconv('UTF-8', 'LATIN1',
'a string').join.

My goal is not to be able to detect each single different charset but
to convert all string from an input into UTF-8.


In the meantime I was reading the code of rFeedParser, the Ruby
implementation of Python FeedParser.
I just discovered it depends on a project called https://rubyforge.org/projects/rchardet/

I gave it a look and it seems to do exactly what I was looking for.

Anyone is using this library?
 
L

Lionel Bouton

Simone said:
If I'm right both ISO-8859-1 and ISO-8859-15 belongs to Latin1 thus I
can convert them in the same way using Iconv.iconv('UTF-8', 'LATIN1',
'a string').join.

You'll probably loose the € (euro) sign from ISO-8859-15 sources as
LATIN1 is probably equivalent to ISO-8859-1.
My goal is not to be able to detect each single different charset but
to convert all string from an input into UTF-8.

In fact... it's the same if you don't know the original charset you
can't convert properly to UTF-8.
In the meantime I was reading the code of rFeedParser, the Ruby
implementation of Python FeedParser.
I just discovered it depends on a project called https://rubyforge.org/projects/rchardet/

I gave it a look and it seems to do exactly what I was looking for.

Anyone is using this library?

I use chardet 0.9.0. I believe they work more or less the same.

I use it as a fallback mechanism when I can't reliably get the original
charset from feeds. Some feeds actually tell that they are UTF-8 encoded
but have invalid code points (your database isn't happy when you try to
feed it something like that...), this becomes a mess when you find out
that each item in the feed may use different charsets because people
aggregate different sources without checking their charset themselves...

The behavior I'm using is :
1/ Try the advertised charset with Iconv('utf-8', charset), even if
charset =~ /^utf-?8$/i
succeeds? -> END
fails? (Exception) -> continue
2/ Use chardet to guess the charset,
3/ Iconv('utf-8', chardet_charset).

Good luck, you're in for a lot of pain...

Lionel
 
S

Simone Carletti

I use it as a fallback mechanism when I can't reliably get the original
charset from feeds.

That's a great example, thank you.
Unfortunately I don't have a real charset header to check. :( I must
rely only on input string.


Good luck, you're in for a lot of pain...

Lionel

Thanks, Lionel! :D
 
M

Michal Suchanek

charset from feeds.


That's a great example, thank you.
Unfortunately I don't have a real charset header to check. :( I must
rely only on input string.

You can ask a crystal ball as well.

The multibyte encodings can be often distinguished by their structure
- utf-8, perhaps utf-16, the Asian encodings. If something passes for
a valid string in a multibyte encoding it very likely is a string in
that encoding.

However, the Latin 8bit encodings are all the same - 7bit ascii with
some mess attached in the upper 128 characters. By converting from any
of these you get perfectly valid utf-8 but different gibberish each
time. You can tell the ISO variant from the Windows variant sometimes
because some control characters are at different positions - and these
should not appear in text. But that does not help you at all - you
still don't know which of the latin encodings you got.

If you know the language (and it's one of the few supported) you can
use enca. If the language is not supported you can do the filter
yourself - basically you collect the set of accented (with 8th bit
set) characters in your language, and encode them in different
encodings (the dos and windows codepage, the iso encoding, any other
legacy encodings). You get sets of bytes that would usually overlap
but would contain some unique bytes. When you see that byte you know
what encoding you should use.

Good luck :)

Michal
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top