regexp with accent insensitive ??

D

Davi Barbosa

Hello,
Is there anyway to make the regexp accent-insensitive? (/a/ match with ã
and Ã)

If not, can any one give a solution to my problem:
I'm making a search web page with mod_ruby, so I made an
accent/case-insensitive sql query and this works fine (with
latin1_swedish_ci). Now I want to highlight what the user searched for.
To achieve this I'm doing something *like*:
string.gsub(/search/i,'<span class="highlight">\0</span>')
This works fine if search and the relevant part of string don't have
accents, but if there are any accents it doesn't match, so the entry is
not highlighted.

I know that with
Iconv.conv("ascii//translit","UTF-8",str)
I can remove all the accents from str, so I can remove the accents from
'search' without any problem, but if I remove some accents from string
to do the highlighting, I need to put it back later to display it to the
user.
Does anyone have any idea?

Thank you
 
K

Ken Bloom

Hello,
Is there anyway to make the regexp accent-insensitive? (/a/ match with ã
and Ã)

If not, can any one give a solution to my problem: I'm making a search
web page with mod_ruby, so I made an accent/case-insensitive sql query
and this works fine (with latin1_swedish_ci). Now I want to highlight
what the user searched for. To achieve this I'm doing something *like*:
string.gsub(/search/i,'<span class="highlight">\0</span>') This works
fine if search and the relevant part of string don't have accents, but
if there are any accents it doesn't match, so the entry is not
highlighted.

I know that with
Iconv.conv("ascii//translit","UTF-8",str) I can remove all the accents
from str, so I can remove the accents from 'search' without any problem,
but if I remove some accents from string to do the highlighting, I need
to put it back later to display it to the user.
Does anyone have any idea?

Thank you

in which case, I would try replacing the accented letters with periods
(which match any single character) when searching. This will give some
false positives. I would use gsub with a block to do a more specific
conditional test.

Suppose the search was for ole (without the accent, and the real hits
will have an accent on the e) the search is in a language that allows
accents on only the letter e.

query='ole'
pattern=Regexp.compile('ole'.gsub(/[e]/,'.')) #=> /ol./

translit=Iconv.conv("ascii//translit","UTF-8",'ole') #=> "ole"

gsub(pattern) do |match|
#use the regular expression to get close enough, and to get
#the actual text we're concerned about
if Iconv.conv("ascii//translit","UTF-8",match) == translit
#the if test does the actual exact comparison
"<span class=\"highlight\">#{match}</span>"
else
match
end
end

Of course, there may be some locale tricks that I'm missing that would
make this much easier.
 
D

Davi Barbosa

Thank you for your answer, but I'm working with a lot of languages, so I
don't know where someone can put an accent.

For the moment, I just discovered that I can't remove the accents with
Iconv like I said before. Here, it works only under irb.. I described
this problem here: http://www.ruby-forum.com/topic/70827#738081
Another problem with utf-8 under ruby is that ruby can't index correctly
the string. For example: 'áb'[2..2] gives the second half of 'á'. I
discovered how to workaround using the unicode version of regexp:
$KCODE = 'u'
'áb'.split(//m) == ["á", "b"]

Without these problems, I think that I know how to make it without false
matchs with an ugly loop.
If str and regexp are the versions without accents, str =~ regexp gives
the position of the match and str[regexp].length the length. With these
two numbers, It's possible to make the highlight in the original string.
It's something like:
ascii_string = Iconv.conv('US-ASCII//TRANSLIT','UTF-8',string)
ascii_search = Iconv.conv('US-ASCII//TRANSLIT','UTF-8',search)
regexp = Regexp.new(Regexp.escape(ascii_search),true)
position = (ascii_string =~ regexp)
size = ascii_string[regexp].length
highlighted = ascii_string[0..(position-1)]+'<span
class="highlight">'+ascii_string[position..(position+size-1)]+'</span>'+ascii_string[(position+size)..-1]

Of course, it need some modifications to put this in a loop (and I need
to use the vector version of the string to index correctly the string).
 
K

Ken Bloom

Thank you for your answer, but I'm working with a lot of languages, so I
don't know where someone can put an accent.

For the moment, I just discovered that I can't remove the accents with
Iconv like I said before. Here, it works only under irb.. I described
this problem here: http://www.ruby-forum.com/topic/70827#738081 Another
problem with utf-8 under ruby is that ruby can't index correctly the
string. For example: 'áb'[2..2] gives the second half of 'á'. I
discovered how to workaround using the unicode version of regexp: $KCODE
= 'u'
'áb'.split(//m) == ["á", "b"]

Without these problems, I think that I know how to make it without false
matchs with an ugly loop.
If str and regexp are the versions without accents, str =~ regexp gives
the position of the match and str[regexp].length the length. With these
two numbers, It's possible to make the highlight in the original string.
It's something like:
ascii_string = Iconv.conv('US-ASCII//TRANSLIT','UTF-8',string)
ascii_search = Iconv.conv('US-ASCII//TRANSLIT','UTF-8',search) regexp =
Regexp.new(Regexp.escape(ascii_search),true) position = (ascii_string =~
regexp)
size = ascii_string[regexp].length
highlighted = ascii_string[0..(position-1)]+'<span
class="highlight">'+ascii_string[position..(position+size-1)]+'</ span>'+ascii_string[(position+size)..-1]

Of course, it need some modifications to put this in a loop (and I need
to use the vector version of the string to index correctly the string).

You can use a StringScanner (require 'strscan') to properly do this in a
loop, because StringScanner#pos will tell you the starting position of
the match, where String#scan will not.

Consider whether Ruby 1.9.0 is stable enough for your purposes because it
handles Unicode natively and should save you from needing to have a
vector version of the string.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top