How to detect string charset

Simone Carletti · Feb 25, 2008

Hi list,
I run a deep search through this group and other resources online but
I have been unable to find whether is there a way to guess the charset
of a string in Ruby 1.8.6.

I need to ensure a string is always UTF-8 encoded but Iconv requires
the developer to specify both in and out charset.
On the other side, Kconv provides a #guess() method but doesn't
support Latin or Western encodings.

Any suggestion?

Xavier Noria · Feb 25, 2008

Hi list,
I run a deep search through this group and other resources online but
I have been unable to find whether is there a way to guess the charset
of a string in Ruby 1.8.6.

I need to ensure a string is always UTF-8 encoded but Iconv requires
the developer to specify both in and out charset.
On the other side, Kconv provides a #guess() method but doesn't
support Latin or Western encodings.

The best way is to be aware of the charsets in every data I/O and do
the necessary housekeeping.

If that's not possible, for example working on arbitrary text files,
the best approximation that I am aware of in Ruby is the charguess
library.

-- fxn

Austin Ziegler · Feb 25, 2008

I run a deep search through this group and other resources online but
I have been unable to find whether is there a way to guess the charset
of a string in Ruby 1.8.6.

I need to ensure a string is always UTF-8 encoded but Iconv requires
the developer to specify both in and out charset.
On the other side, Kconv provides a #guess() method but doesn't
support Latin or Western encodings.

Any suggestion?

Kconv can guess because the encodings for the set of Asian written
languages are distinctive (they don't share much with the Latin
character set). What you're wanting is nearly impossible without a
large body of text for analysis, and even then the best commercial
programs are taking stabs at probabilities. (Here's an example: how do
you tell the difference between ISO-8859-1 and ISO-8859-15
programmatically? IIRC, the only difference between them is that -15
supports the Euro symbol, replacing a different symbol from -1.)

You're better off seeking a slightly different approach.

-austin

Simone Carletti · Feb 25, 2008

Kconv can guess because the encodings for the set of Asian written
languages are distinctive (they don't share much with the Latin
character set). What you're wanting is nearly impossible without a
large body of text for analysis, and even then the best commercial
programs are taking stabs at probabilities. (Here's an example: how do
you tell the difference between ISO-8859-1 and ISO-8859-15
programmatically? IIRC, the only difference between them is that -15
supports the Euro symbol, replacing a different symbol from -1.)

You're better off seeking a slightly different approach.

-austin

If I'm right both ISO-8859-1 and ISO-8859-15 belongs to Latin1 thus I
can convert them in the same way using Iconv.iconv('UTF-8', 'LATIN1',
'a string').join.

My goal is not to be able to detect each single different charset but
to convert all string from an input into UTF-8.

In the meantime I was reading the code of rFeedParser, the Ruby
implementation of Python FeedParser.
I just discovered it depends on a project called https://rubyforge.org/projects/rchardet/

I gave it a look and it seems to do exactly what I was looking for.

Anyone is using this library?

Lionel Bouton · Feb 25, 2008

Simone said:
If I'm right both ISO-8859-1 and ISO-8859-15 belongs to Latin1 thus I
can convert them in the same way using Iconv.iconv('UTF-8', 'LATIN1',
'a string').join.

You'll probably loose the â‚¬ (euro) sign from ISO-8859-15 sources as
LATIN1 is probably equivalent to ISO-8859-1.

My goal is not to be able to detect each single different charset but
to convert all string from an input into UTF-8.

In fact... it's the same if you don't know the original charset you
can't convert properly to UTF-8.

In the meantime I was reading the code of rFeedParser, the Ruby
implementation of Python FeedParser.
I just discovered it depends on a project called https://rubyforge.org/projects/rchardet/

I gave it a look and it seems to do exactly what I was looking for.

Anyone is using this library?

I use chardet 0.9.0. I believe they work more or less the same.

I use it as a fallback mechanism when I can't reliably get the original
charset from feeds. Some feeds actually tell that they are UTF-8 encoded
but have invalid code points (your database isn't happy when you try to
feed it something like that...), this becomes a mess when you find out
that each item in the feed may use different charsets because people
aggregate different sources without checking their charset themselves...

The behavior I'm using is :
1/ Try the advertised charset with Iconv('utf-8', charset), even if
charset =~ /^utf-?8$/i
succeeds? -> END
fails? (Exception) -> continue
2/ Use chardet to guess the charset,
3/ Iconv('utf-8', chardet_charset).

Good luck, you're in for a lot of pain...

Lionel

Simone Carletti · Feb 25, 2008

I use it as a fallback mechanism when I can't reliably get the original
charset from feeds.

That's a great example, thank you.
Unfortunately I don't have a real charset header to check.

I must
rely only on input string.

Good luck, you're in for a lot of pain...

Lionel

Thanks, Lionel!

Michal Suchanek · Feb 26, 2008

charset from feeds.

That's a great example, thank you.
Unfortunately I don't have a real charset header to check. I must
rely only on input string.

You can ask a crystal ball as well.

The multibyte encodings can be often distinguished by their structure
- utf-8, perhaps utf-16, the Asian encodings. If something passes for
a valid string in a multibyte encoding it very likely is a string in
that encoding.

However, the Latin 8bit encodings are all the same - 7bit ascii with
some mess attached in the upper 128 characters. By converting from any
of these you get perfectly valid utf-8 but different gibberish each
time. You can tell the ISO variant from the Windows variant sometimes
because some control characters are at different positions - and these
should not appear in text. But that does not help you at all - you
still don't know which of the latin encodings you got.

If you know the language (and it's one of the few supported) you can
use enca. If the language is not supported you can do the filter
yourself - basically you collect the set of accented (with 8th bit
set) characters in your language, and encode them in different
encodings (the dos and windows codepage, the iso encoding, any other
legacy encodings). You get sets of bytes that would usually overlap
but would contain some unique bytes. When you see that byte you know
what encoding you should use.

Good luck

Michal

HCaptcha - How to stop page from refreshing on submit if captcha is not checked/validated	1	Aug 29, 2023
How to detect text charset (UTF-8 or Latin-1)	3	Jan 15, 2008
String default encoding: UTF-16 or Platform's default charset?	14	Dec 10, 2010
Detect any "<a href=mailto:...>...</a>" string in a string?	5	Oct 7, 2009
converting from one charset encoding to another ...	12	Nov 23, 2009
charset problems with urllib/urllib2	0	Feb 23, 2009
how to convert String from one charset to another	5	Aug 3, 2007
How to sort a CSV file with merge sort JAVA	7	May 6, 2021

How to detect string charset

Simone Carletti

Xavier Noria

Austin Ziegler

Simone Carletti

Lionel Bouton

Simone Carletti

Michal Suchanek

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads