Regexp for matching UTF-8 characters without close tag

Jesse P. · Jan 5, 2008

Hi all,

Im trying to solve this problem:

string = "\302"
TEXT_PATTERN = /\A([^<]*)/um
text_data = string.match(TEXT_PATTERN).to_s
=> "\302"

As you can see, the regular expression incorrectly captures not only
the text part but also the closing tag, whereas what is supposed to be
captured is just "\302".

This problem is actually part of the REXML::Source#match method
(http://www.germane-software.com/projects/rexml/browser/trunk/src/
rexml/source.rb?rev=1266#L104) and causes REXML to parse UTF-8
documents incorrectly sometimes.

Any ideas why the pattern matching doesnt work? I dont see anything
wrong with the regular expression. Although, Im not sure what the \A
character class is for.

Best regards,

Jesse

Tiziano Merzi · Jan 5, 2008

Jesse said:
Hi all,

Im trying to solve this problem:

string = "\302"
TEXT_PATTERN = /\A([^<]*)/um
text_data = string.match(TEXT_PATTERN).to_s
=> "\302"

A solution may be :

require 'iconv'
string = "\302" # string isn't in utf-8 \302 in utf is \303\202
string = Iconv.conv("UTF-8","ISO-8859-1",string)
TEXT_PATTERN = /\A([^<]*)/um
text_data = string.match(TEXT_PATTERN).to_s
text_data = Iconv.conv("ISO-8859-1","UTF-8",text_data)

puts text_data

\A -> beginig of line

Jesse P. · Jan 5, 2008

Hi Tiziano,

My apologies. It seems that I have oversimplied the problem due to my
lack of understanding for UTF-8.

The actual string is an xml file I obtained from flickr at
http://api.flickr.com/services/rest...@N00&api_sig=6a39aab2fb665e24d2b6e1cef9d0be27:
An excerpt is as follows:
<?xml version="1.0" encoding="utf-8" ?>
<rsp stat="ok">
<person id="55669962@N00" nsid="55669962@N00" isadmin="0" ispro="0"
iconserver="136" iconfarm="1">
<username>(_.Â·Â´Â¯`Â·â"¢â(tm) â'ª Emirates Wizard â'ªâ(tm) â"¢Â·Â</
username>
<realname />
<mbox_sha1sum>0b88a178b28c40ff81d44c5ae475438abec2009c</mbox_sha1sum>
<location />
<photosurl>http://www.flickr.com/photos/emirates_wizard/</photosurl>
<profileurl>http://www.flickr.com/people/emirates_wizard/</
profileurl>
<mobileurl>http://m.flickr.com/photostream.gne?id=5467956</mobileurl>
<photos>
<firstdatetaken>2006-07-16 15:22:42</firstdatetaken>
<firstdate>1162548449</firstdate>
<count>36</count>
</photos>
</person>
</rsp>

The part of the xml that is causing the problem is in the <username>
tag which if in ruby, is represented with octals as:
"<username>(_.\302\267\302\264\302\257`\302\267\342\204\242\342\231
\342\202\252 Emirates Wizard \342\202\252\342\231
\342\204\242\302\267\302</username>"

Note that the XML says that the contents are in UTF-8. So when I use
REXML to process this xml, after it processes the the tag
"<username>", it is left with
string = "(_.\302\267\302\264\302\257`\302\267\342\204\242\342\231
\342\202\252 Emirates Wizard \342\202\252\342\231
\342\204\242\302\267\302</username>"

I just checked and if I match this string with TEXT_PATTERN = /
\A([^<]*)/um, I get the text and also the close tag.
TEXT_PATTERN = /\A([^<]*)/um
text_data = string.match(TEXT_PATTERN).to_s
=> "(_.\302\267\302\264\302\257`\302\267\342\204\242\342\231
\342\202\252 Emirates Wizard \342\202\252\342\231
\342\204\242\302\267\302</username>"

Assuming that the XML has some malformed data (some are not actually
UTF-8), is there anyway that I can process the xml as it is and only
treat the malformed data differently? (e.g. you mentioned that the
\302 character is not a UTF-8 character)

Best regards,

Jesse

Jesse said:
Jesse said:

Hi all,

Click to expand...

Im trying to solve this problem:

Click to expand...

string = "\302"
TEXT_PATTERN = /\A([^<]*)/um
text_data = string.match(TEXT_PATTERN).to_s
=> "\302"

Click to expand...

A solution may be :

require 'iconv'
string = "\302" # string isn't in utf-8 \302 in utf is \303\202
string = Iconv.conv("UTF-8","ISO-8859-1",string)
TEXT_PATTERN = /\A([^<]*)/um
text_data = string.match(TEXT_PATTERN).to_s
text_data = Iconv.conv("ISO-8859-1","UTF-8",text_data)

puts text_data

\A -> beginig of line

Tiziano Merzi · Jan 5, 2008

Jesse said:
Hi Tiziano,

My apologies. It seems that I have oversimplied the problem due to my
lack of understanding for UTF-8.

The actual string is an xml file I obtained from flickr at
http://api.flickr.com/services/rest...@N00&api_sig=6a39aab2fb665e24d2b6e1cef9d0be27:

You have a broken utf-8 docoment (a read the Matz reply)

I see two solutions:

a)

require 'rexml/document'
require 'iconv'

data = your xml document
data = Iconv.conv("UTF-8","ISO-8859-1",data)
doc = REXML:

ocument.new(data)

you must convert the data of doc from utf-8 to iso befor use it

username = Iconv.conv("ISO-8859-1","UTF-8",doc. )

b) change the encoding of the xml

require 'rexml/document'

data = your xml document
data = data.gsub(/encoding="utf-8"/i, 'encoding="iso-8859-1"')
doc = REXML:

ocument.new(data)

Anyway a) and b) don't work if the document contains valid utf-8 chars
not in ascii-7 (for example latin letters Ã¨, Ã², etc.)

Jesse P. · Jan 6, 2008

Thanks Tiziano

You have a broken utf-8 docoment (a read the Matz reply)

I see two solutions:

a)

require 'rexml/document'
require 'iconv'

data = your xml document
data = Iconv.conv("UTF-8","ISO-8859-1",data)
doc = REXML:ocument.new(data)

you must convert the data of doc from utf-8 to iso befor use it

username = Iconv.conv("ISO-8859-1","UTF-8",doc. )

b) change the encoding of the xml

require 'rexml/document'

data = your xml document
data = data.gsub(/encoding="utf-8"/i, 'encoding="iso-8859-1"')
doc = REXML:ocument.new(data)

Anyway a) and b) don't work if the document contains valid utf-8 chars
not in ascii-7 (for example latin letters è, ò, etc.)

REXML::Document could not parse UTF-8 "<name>\302</name>"	0	Jan 4, 2008
[perl-python] string pattern matching	9	Feb 1, 2005
[ANN] JRuby 1.2.0RC1 Released	8	Feb 24, 2009
[ANN] JRuby 1.2.0 Released	1	Mar 16, 2009
[ANN] JRuby 1.1.5 Released	5	Nov 3, 2008
[ANN] JRuby 1.1RC2 Released	1	Feb 16, 2008
Ruby Weekly News 6th - 12th June 2005	0	Jun 14, 2005
python-dev Summary for 2004-08-01 through 2004-08-15	17	Aug 24, 2004

Regexp for matching UTF-8 characters without close tag

Jesse P.

Tiziano Merzi

Jesse P.

Tiziano Merzi

Jesse P.

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads