Saving the web, charset problems and symbols problems

S

Sak Na rede

Hi all!

I think that a lot of ruby scripts are for web crawling, web scrapping
and many more applications with the web. I'm working with the web too, I
try to save text of many different webs. In this moment I'm trying to
solve two problems:

1 - How to standard the charset of the web. There are a lot of
differents charsets and I think that it must be possible another
solution that see every charset and convert to proper charset each time.
(By the way, what is the best method to see charset of a file? command
file is not very good, I think)

2 - How to convert HTML to plain text. I use Hpricot but a lot of very
rare simbols continues there like "€" or "”". Wich is the most used
method?

Thanks a lot
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top