Ways to filter bad(unicode) characters

R

Ramza Brown

I am parsing a collection of URLS; URLS that seem to have Chinese/Indian
and other unicode characters. My question, how can I filter those out
while still leaving room for alpha-numeric and characters that are
typical of a URL or Title

For example I might get a URL with:
http://????????????-????????
title = ??????

where the ? represents some unicode character

I want to filter these out, but leave room for non-alphanumeric characters:

http://www.yahoo.com

--
Berlin Brown
(ramaza3 on freenode)
http://www.newspiritcompany.com
http://www.newspiritcompany.com/newforums
also checkout alpha version of botverse:
http://www.newspiritcompany.com:8086/universe_home
 
B

baumanj

How about:

new_url = ''
url.each_byte {|b| new_url << b if b < 128 }

That should keep all the ASCII bytes and drop all the non-ASCII ones.
 
D

Dave Burt

In that case:

invalid = false
url.each_byte {|b| invalid = true if b > 127 }

Or:

require 'enumerator'
class String
def seven_bit_clean?
self.each_byte.all? {|c| c <= 127 }
end
end

Cheers,
Dave
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,025
Latest member
KetoRushACVFitness

Latest Threads

Top