OpenSP API, Unicode character byte offsets

P

Phillip Farber

Hello,

I'm posting here with a somewhat technical question in the hope
of finding someone with experience coding C++ against the SP_API
in OpenSP 1.5.

I have an app that uses the SP_API to parse XML and record file
offsets for elements and attribute values. It works fine with
ISO-8859-1 encoded data. However, in UTF-8 encoded XML data, it
corrupts element and attribute names composed of characters that
encode as UTF-8 multi-byte sequences and only gives me character
(as opposed to byte) offsets which are useless to me when I need
to do low-level i/o on the data.

My XML begins with:

<?xml version="1.0" encoding="utf-8"?>

and I'm setting these envvars in my main program:

(void)putenv("SP_CHARSET_FIXED=YES");
(void)putenv("SP_ENCODING=XML");

The parser gets the *character* count right but the *byte* count
(or length) wrong. E.g. it tells me that the element name
composed of the 3 Greek characters U+03D5 U+03AC U+03C9 has
length=3 but the number of bytes the element occupies in my XML
file is 6, i.e. 2 bytes per character. I can't find anything in
the API that will return me the *byte* offset. What have I
missed?

Further, e.g. in the case of an attribute name, when I ask for
the attribute's name to do processing on it I only get back as
many bytes as characters in the name which is wrong for a
multi-byte encoding like UTF-8 and the bytes I do get back are
corrupt as UTF-8.

Here, simplified, is my attribute event handler:

void XRegionEventHandler::attributes (
const AttributeList &attributes,
const StorageObjectSpec *el_storageObj,
unsigned long &epos )
{
char name[MAX_NAME];
const StorageObjectSpec *attr_storageObj;
size_t nAttributes = attributes.size();
unsigned long spos;

for (size_t i = 0; i < nAttributes; i++)
{
const Text *text;
const StringC *string;
const AttributeValue *value = attributes.value(i);

if (value)
{
switch (value->info(text, string))
{
case AttributeValue::cdata:
{
TextIter iter(*text);
TextItem::Type type;
const Char *p;
size_t length;
const Location *loc;

while (iter.next(type, p, length, loc))
{
epos = spos + length - CHAR_SIZE;

name << attributes.name(i);

// process "name" here ... }
}
break;

default:
break;
}
}
}
}

I've walked through the overloaded << operator in the line:

name << attributes.name(i);

and attributes.name(i) the name is already corrupt.

Has anyone successfully parsed UTF-8 encoded multi-byte XML and
retrieved byte offsets and the UTF-8 encoded form of the element
and attribute names?

Any help much appreciated and thanks,

Phil
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top