Regexp for matching UTF-8 characters without close tag

Discussion in 'Ruby' started by Jesse P., Jan 5, 2008.

  1. Jesse P.

    Jesse P. Guest

    Hi all,

    Im trying to solve this problem:

    string = "\302</u>"
    TEXT_PATTERN = /\A([^<]*)/um
    text_data = string.match(TEXT_PATTERN).to_s
    => "\302</u>"

    As you can see, the regular expression incorrectly captures not only
    the text part but also the closing tag, whereas what is supposed to be
    captured is just "\302".

    This problem is actually part of the REXML::Source#match method
    (http://www.germane-software.com/projects/rexml/browser/trunk/src/
    rexml/source.rb?rev=1266#L104) and causes REXML to parse UTF-8
    documents incorrectly sometimes.

    Any ideas why the pattern matching doesnt work? I dont see anything
    wrong with the regular expression. Although, Im not sure what the \A
    character class is for.

    Best regards,

    Jesse
     
    Jesse P., Jan 5, 2008
    #1
    1. Advertisements

  2. A solution may be :

    require 'iconv'
    string = "\302</u>" # string isn't in utf-8 \302 in utf is \303\202
    string = Iconv.conv("UTF-8","ISO-8859-1",string)
    TEXT_PATTERN = /\A([^<]*)/um
    text_data = string.match(TEXT_PATTERN).to_s
    text_data = Iconv.conv("ISO-8859-1","UTF-8",text_data)

    puts text_data

    \A -> beginig of line
     
    Tiziano Merzi, Jan 5, 2008
    #2
    1. Advertisements

  3. Jesse P.

    Jesse P. Guest

    Hi Tiziano,

    My apologies. It seems that I have oversimplied the problem due to my
    lack of understanding for UTF-8.

    The actual string is an xml file I obtained from flickr at
    http://api.flickr.com/services/[email protected]&api_sig=6a39aab2fb665e24d2b6e1cef9d0be27:
    An excerpt is as follows:
    <?xml version="1.0" encoding="utf-8" ?>
    <rsp stat="ok">
    <person id="[email protected]" nsid="[email protected]" isadmin="0" ispro="0"
    iconserver="136" iconfarm="1">
    <username>(_.·´¯`·â"¢â(tm) â'ª Emirates Wizard â'ªâ(tm) â"¢Â·Â</
    username>
    <realname />
    <mbox_sha1sum>0b88a178b28c40ff81d44c5ae475438abec2009c</mbox_sha1sum>
    <location />
    <photosurl>http://www.flickr.com/photos/emirates_wizard/</photosurl>
    <profileurl>http://www.flickr.com/people/emirates_wizard/</
    profileurl>
    <mobileurl>http://m.flickr.com/photostream.gne?id=5467956</mobileurl>
    <photos>
    <firstdatetaken>2006-07-16 15:22:42</firstdatetaken>
    <firstdate>1162548449</firstdate>
    <count>36</count>
    </photos>
    </person>
    </rsp>

    The part of the xml that is causing the problem is in the <username>
    tag which if in ruby, is represented with octals as:
    "<username>(_.\302\267\302\264\302\257`\302\267\342\204\242\342\231
    \342\202\252 Emirates Wizard \342\202\252\342\231
    \342\204\242\302\267\302</username>"

    Note that the XML says that the contents are in UTF-8. So when I use
    REXML to process this xml, after it processes the the tag
    "<username>", it is left with
    string = "(_.\302\267\302\264\302\257`\302\267\342\204\242\342\231
    \342\202\252 Emirates Wizard \342\202\252\342\231
    \342\204\242\302\267\302</username>"

    I just checked and if I match this string with TEXT_PATTERN = /
    \A([^<]*)/um, I get the text and also the close tag.
    TEXT_PATTERN = /\A([^<]*)/um
    text_data = string.match(TEXT_PATTERN).to_s
    => "(_.\302\267\302\264\302\257`\302\267\342\204\242\342\231
    \342\202\252 Emirates Wizard \342\202\252\342\231
    \342\204\242\302\267\302</username>"

    Assuming that the XML has some malformed data (some are not actually
    UTF-8), is there anyway that I can process the xml as it is and only
    treat the malformed data differently? (e.g. you mentioned that the
    \302 character is not a UTF-8 character)

    Best regards,

    Jesse



     
    Jesse P., Jan 5, 2008
    #3
  4. You have a broken utf-8 docoment (a read the Matz reply)

    I see two solutions:

    a)

    require 'rexml/document'
    require 'iconv'

    data = your xml document
    data = Iconv.conv("UTF-8","ISO-8859-1",data)
    doc = REXML::Document.new(data)

    you must convert the data of doc from utf-8 to iso befor use it

    username = Iconv.conv("ISO-8859-1","UTF-8",doc. )

    b) change the encoding of the xml

    require 'rexml/document'

    data = your xml document
    data = data.gsub(/encoding="utf-8"/i, 'encoding="iso-8859-1"')
    doc = REXML::Document.new(data)

    Anyway a) and b) don't work if the document contains valid utf-8 chars
    not in ascii-7 (for example latin letters è, ò, etc.)
     
    Tiziano Merzi, Jan 5, 2008
    #4
  5. Jesse P.

    Jesse P. Guest

    Thanks Tiziano :)

     
    Jesse P., Jan 6, 2008
    #5
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.