Detect non-ascii substrings in a file

K

killy971

I have files encoded in Shift_JIS, that mainly contains JSP source
code (ascii), but sometimes also contains strings that are non-ascii
(japanese words).

So, I would like to know if there is a way with ruby to :
- detect files containing something else than ascii,
- extract the non-ascii strings thare were found.

Thank you !
 
R

Ron Fox

Any character that has the top bit clear is potentially valid ascii,
though if you take away the non printing characters there's an
additional exlusion set.
According to http://en.wikipedia.org/wiki/Shift-JIS

Testing for character codes with the top bit set should indicate
either katakana or double byte characters. See the chart there for
which ranges are double byte, which are single and which are not legal.

RF
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,043
Latest member
CannalabsCBDReview

Latest Threads

Top