OpenSP API, Unicode character byte offsets

Discussion in 'XML' started by Phillip Farber, Aug 20, 2003.

  1. Hello,

    I'm posting here with a somewhat technical question in the hope
    of finding someone with experience coding C++ against the SP_API
    in OpenSP 1.5.

    I have an app that uses the SP_API to parse XML and record file
    offsets for elements and attribute values. It works fine with
    ISO-8859-1 encoded data. However, in UTF-8 encoded XML data, it
    corrupts element and attribute names composed of characters that
    encode as UTF-8 multi-byte sequences and only gives me character
    (as opposed to byte) offsets which are useless to me when I need
    to do low-level i/o on the data.

    My XML begins with:

    <?xml version="1.0" encoding="utf-8"?>

    and I'm setting these envvars in my main program:

    (void)putenv("SP_CHARSET_FIXED=YES");
    (void)putenv("SP_ENCODING=XML");

    The parser gets the *character* count right but the *byte* count
    (or length) wrong. E.g. it tells me that the element name
    composed of the 3 Greek characters U+03D5 U+03AC U+03C9 has
    length=3 but the number of bytes the element occupies in my XML
    file is 6, i.e. 2 bytes per character. I can't find anything in
    the API that will return me the *byte* offset. What have I
    missed?

    Further, e.g. in the case of an attribute name, when I ask for
    the attribute's name to do processing on it I only get back as
    many bytes as characters in the name which is wrong for a
    multi-byte encoding like UTF-8 and the bytes I do get back are
    corrupt as UTF-8.

    Here, simplified, is my attribute event handler:

    void XRegionEventHandler::attributes (
    const AttributeList &attributes,
    const StorageObjectSpec *el_storageObj,
    unsigned long &epos )
    {
    char name[MAX_NAME];
    const StorageObjectSpec *attr_storageObj;
    size_t nAttributes = attributes.size();
    unsigned long spos;

    for (size_t i = 0; i < nAttributes; i++)
    {
    const Text *text;
    const StringC *string;
    const AttributeValue *value = attributes.value(i);

    if (value)
    {
    switch (value->info(text, string))
    {
    case AttributeValue::cdata:
    {
    TextIter iter(*text);
    TextItem::Type type;
    const Char *p;
    size_t length;
    const Location *loc;

    while (iter.next(type, p, length, loc))
    {
    epos = spos + length - CHAR_SIZE;

    name << attributes.name(i);

    // process "name" here ... }
    }
    break;

    default:
    break;
    }
    }
    }
    }

    I've walked through the overloaded << operator in the line:

    name << attributes.name(i);

    and attributes.name(i) the name is already corrupt.

    Has anyone successfully parsed UTF-8 encoded multi-byte XML and
    retrieved byte offsets and the UTF-8 encoded form of the element
    and attribute names?

    Any help much appreciated and thanks,

    Phil
    ----
    Phillip Farber, Programmer
    Digital Library Production Service
    Hatcher Graduate Library, University of Michigan
     
    Phillip Farber, Aug 20, 2003
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kenneth McDonald
    Replies:
    1
    Views:
    847
    Carl Banks
    Dec 27, 2006
  2. Replies:
    1
    Views:
    390
    Lawrence Kirby
    Jul 6, 2005
  3. Polaris431
    Replies:
    8
    Views:
    762
    SM Ryan
    Dec 4, 2006
  4. Muhammad Adeel
    Replies:
    2
    Views:
    326
    Muhammad Adeel
    Aug 6, 2010
  5. Robert Dodier
    Replies:
    2
    Views:
    169
    Tad McClellan
    Jul 9, 2006
Loading...

Share This Page