Ways to filter bad(unicode) characters

Ramza Brown · Apr 19, 2006

I am parsing a collection of URLS; URLS that seem to have Chinese/Indian
and other unicode characters. My question, how can I filter those out
while still leaving room for alpha-numeric and characters that are
typical of a URL or Title

For example I might get a URL with:
http://????????????-????????
title = ??????

where the ? represents some unicode character

I want to filter these out, but leave room for non-alphanumeric characters:

http://www.yahoo.com

--
Berlin Brown
(ramaza3 on freenode)
http://www.newspiritcompany.com
http://www.newspiritcompany.com/newforums
also checkout alpha version of botverse:
http://www.newspiritcompany.com:8086/universe_home

baumanj · Apr 19, 2006

How about:

new_url = ''
url.each_byte {|b| new_url << b if b < 128 }

That should keep all the ASCII bytes and drop all the non-ASCII ones.

Ramza Brown · Apr 19, 2006

How about:

new_url = ''
url.each_byte {|b| new_url << b if b < 128 }

That should keep all the ASCII bytes and drop all the non-ASCII ones.

I didnt want to modify the URL so much as check the how valid (US valid
that is).

--
Berlin Brown
(ramaza3 on freenode)
http://www.newspiritcompany.com
http://www.newspiritcompany.com/newforums
also checkout alpha version of botverse:
http://www.newspiritcompany.com:8086/universe_home

baumanj · Apr 19, 2006

In that case:

invalid = false
url.each_byte {|b| invalid = true if b > 127 }

Dave Burt · Apr 19, 2006

In that case:

invalid = false
url.each_byte {|b| invalid = true if b > 127 }

Or:

require 'enumerator'
class String
def seven_bit_clean?
self.each_byte.all? {|c| c <= 127 }
end
end

Cheers,
Dave

Business logic for a rails/ruby application	0	May 9, 2006
methods, methods override classes and scope	1	Apr 17, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
[ANN] JRuby 1.2.0 Released	1	Mar 16, 2009
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004
anybody help me	1	Feb 10, 2006
comp.lang.java.gui FAQ	0	Sep 13, 2006

Ways to filter bad(unicode) characters

Ramza Brown

baumanj

Ramza Brown

baumanj

Dave Burt

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads