Regexp for matching UTF-8 characters without close tag

J

Jesse P.

Hi all,

Im trying to solve this problem:

string = "\302</u>"
TEXT_PATTERN = /\A([^<]*)/um
text_data = string.match(TEXT_PATTERN).to_s
=> "\302</u>"

As you can see, the regular expression incorrectly captures not only
the text part but also the closing tag, whereas what is supposed to be
captured is just "\302".

This problem is actually part of the REXML::Source#match method
(http://www.germane-software.com/projects/rexml/browser/trunk/src/
rexml/source.rb?rev=1266#L104) and causes REXML to parse UTF-8
documents incorrectly sometimes.

Any ideas why the pattern matching doesnt work? I dont see anything
wrong with the regular expression. Although, Im not sure what the \A
character class is for.

Best regards,

Jesse
 
T

Tiziano Merzi

Jesse said:
Hi all,

Im trying to solve this problem:

string = "\302</u>"
TEXT_PATTERN = /\A([^<]*)/um
text_data = string.match(TEXT_PATTERN).to_s
=> "\302</u>"

A solution may be :

require 'iconv'
string = "\302</u>" # string isn't in utf-8 \302 in utf is \303\202
string = Iconv.conv("UTF-8","ISO-8859-1",string)
TEXT_PATTERN = /\A([^<]*)/um
text_data = string.match(TEXT_PATTERN).to_s
text_data = Iconv.conv("ISO-8859-1","UTF-8",text_data)

puts text_data

\A -> beginig of line
 
J

Jesse P.

Hi Tiziano,

My apologies. It seems that I have oversimplied the problem due to my
lack of understanding for UTF-8.

The actual string is an xml file I obtained from flickr at
http://api.flickr.com/services/rest...@N00&api_sig=6a39aab2fb665e24d2b6e1cef9d0be27:
An excerpt is as follows:
<?xml version="1.0" encoding="utf-8" ?>
<rsp stat="ok">
<person id="55669962@N00" nsid="55669962@N00" isadmin="0" ispro="0"
iconserver="136" iconfarm="1">
<username>(_.·´¯`·â"¢â(tm) â'ª Emirates Wizard â'ªâ(tm) â"¢Â·Â</
username>
<realname />
<mbox_sha1sum>0b88a178b28c40ff81d44c5ae475438abec2009c</mbox_sha1sum>
<location />
<photosurl>http://www.flickr.com/photos/emirates_wizard/</photosurl>
<profileurl>http://www.flickr.com/people/emirates_wizard/</
profileurl>
<mobileurl>http://m.flickr.com/photostream.gne?id=5467956</mobileurl>
<photos>
<firstdatetaken>2006-07-16 15:22:42</firstdatetaken>
<firstdate>1162548449</firstdate>
<count>36</count>
</photos>
</person>
</rsp>

The part of the xml that is causing the problem is in the <username>
tag which if in ruby, is represented with octals as:
"<username>(_.\302\267\302\264\302\257`\302\267\342\204\242\342\231
\342\202\252 Emirates Wizard \342\202\252\342\231
\342\204\242\302\267\302</username>"

Note that the XML says that the contents are in UTF-8. So when I use
REXML to process this xml, after it processes the the tag
"<username>", it is left with
string = "(_.\302\267\302\264\302\257`\302\267\342\204\242\342\231
\342\202\252 Emirates Wizard \342\202\252\342\231
\342\204\242\302\267\302</username>"

I just checked and if I match this string with TEXT_PATTERN = /
\A([^<]*)/um, I get the text and also the close tag.
TEXT_PATTERN = /\A([^<]*)/um
text_data = string.match(TEXT_PATTERN).to_s
=> "(_.\302\267\302\264\302\257`\302\267\342\204\242\342\231
\342\202\252 Emirates Wizard \342\202\252\342\231
\342\204\242\302\267\302</username>"

Assuming that the XML has some malformed data (some are not actually
UTF-8), is there anyway that I can process the xml as it is and only
treat the malformed data differently? (e.g. you mentioned that the
\302 character is not a UTF-8 character)

Best regards,

Jesse



Jesse said:
Im trying to solve this problem:
string = "\302</u>"
TEXT_PATTERN = /\A([^<]*)/um
text_data = string.match(TEXT_PATTERN).to_s
=> "\302</u>"

A solution may be :

require 'iconv'
string = "\302</u>" # string isn't in utf-8 \302 in utf is \303\202
string = Iconv.conv("UTF-8","ISO-8859-1",string)
TEXT_PATTERN = /\A([^<]*)/um
text_data = string.match(TEXT_PATTERN).to_s
text_data = Iconv.conv("ISO-8859-1","UTF-8",text_data)

puts text_data

\A -> beginig of line
 
T

Tiziano Merzi

Jesse said:
Hi Tiziano,

My apologies. It seems that I have oversimplied the problem due to my
lack of understanding for UTF-8.

The actual string is an xml file I obtained from flickr at
http://api.flickr.com/services/rest...@N00&api_sig=6a39aab2fb665e24d2b6e1cef9d0be27:

You have a broken utf-8 docoment (a read the Matz reply)

I see two solutions:

a)

require 'rexml/document'
require 'iconv'

data = your xml document
data = Iconv.conv("UTF-8","ISO-8859-1",data)
doc = REXML::Document.new(data)

you must convert the data of doc from utf-8 to iso befor use it

username = Iconv.conv("ISO-8859-1","UTF-8",doc. )

b) change the encoding of the xml

require 'rexml/document'

data = your xml document
data = data.gsub(/encoding="utf-8"/i, 'encoding="iso-8859-1"')
doc = REXML::Document.new(data)

Anyway a) and b) don't work if the document contains valid utf-8 chars
not in ascii-7 (for example latin letters è, ò, etc.)
 
J

Jesse P.

Thanks Tiziano :)

You have a broken utf-8 docoment (a read the Matz reply)

I see two solutions:

a)

require 'rexml/document'
require 'iconv'

data = your xml document
data = Iconv.conv("UTF-8","ISO-8859-1",data)
doc = REXML::Document.new(data)

you must convert the data of doc from utf-8 to iso befor use it

username = Iconv.conv("ISO-8859-1","UTF-8",doc. )

b) change the encoding of the xml

require 'rexml/document'

data = your xml document
data = data.gsub(/encoding="utf-8"/i, 'encoding="iso-8859-1"')
doc = REXML::Document.new(data)

Anyway a) and b) don't work if the document contains valid utf-8 chars
not in ascii-7 (for example latin letters è, ò, etc.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top