P
Phillip Farber
Hello,
I'm posting here with a somewhat technical question in the hope
of finding someone with experience coding C++ against the SP_API
in OpenSP 1.5.
I have an app that uses the SP_API to parse XML and record file
offsets for elements and attribute values. It works fine with
ISO-8859-1 encoded data. However, in UTF-8 encoded XML data, it
corrupts element and attribute names composed of characters that
encode as UTF-8 multi-byte sequences and only gives me character
(as opposed to byte) offsets which are useless to me when I need
to do low-level i/o on the data.
My XML begins with:
<?xml version="1.0" encoding="utf-8"?>
and I'm setting these envvars in my main program:
(void)putenv("SP_CHARSET_FIXED=YES");
(void)putenv("SP_ENCODING=XML");
The parser gets the *character* count right but the *byte* count
(or length) wrong. E.g. it tells me that the element name
composed of the 3 Greek characters U+03D5 U+03AC U+03C9 has
length=3 but the number of bytes the element occupies in my XML
file is 6, i.e. 2 bytes per character. I can't find anything in
the API that will return me the *byte* offset. What have I
missed?
Further, e.g. in the case of an attribute name, when I ask for
the attribute's name to do processing on it I only get back as
many bytes as characters in the name which is wrong for a
multi-byte encoding like UTF-8 and the bytes I do get back are
corrupt as UTF-8.
Here, simplified, is my attribute event handler:
void XRegionEventHandler::attributes (
const AttributeList &attributes,
const StorageObjectSpec *el_storageObj,
unsigned long &epos )
{
char name[MAX_NAME];
const StorageObjectSpec *attr_storageObj;
size_t nAttributes = attributes.size();
unsigned long spos;
for (size_t i = 0; i < nAttributes; i++)
{
const Text *text;
const StringC *string;
const AttributeValue *value = attributes.value(i);
if (value)
{
switch (value->info(text, string))
{
case AttributeValue::cdata:
{
TextIter iter(*text);
TextItem::Type type;
const Char *p;
size_t length;
const Location *loc;
while (iter.next(type, p, length, loc))
{
epos = spos + length - CHAR_SIZE;
name << attributes.name(i);
// process "name" here ... }
}
break;
default:
break;
}
}
}
}
I've walked through the overloaded << operator in the line:
name << attributes.name(i);
and attributes.name(i) the name is already corrupt.
Has anyone successfully parsed UTF-8 encoded multi-byte XML and
retrieved byte offsets and the UTF-8 encoded form of the element
and attribute names?
Any help much appreciated and thanks,
Phil
I'm posting here with a somewhat technical question in the hope
of finding someone with experience coding C++ against the SP_API
in OpenSP 1.5.
I have an app that uses the SP_API to parse XML and record file
offsets for elements and attribute values. It works fine with
ISO-8859-1 encoded data. However, in UTF-8 encoded XML data, it
corrupts element and attribute names composed of characters that
encode as UTF-8 multi-byte sequences and only gives me character
(as opposed to byte) offsets which are useless to me when I need
to do low-level i/o on the data.
My XML begins with:
<?xml version="1.0" encoding="utf-8"?>
and I'm setting these envvars in my main program:
(void)putenv("SP_CHARSET_FIXED=YES");
(void)putenv("SP_ENCODING=XML");
The parser gets the *character* count right but the *byte* count
(or length) wrong. E.g. it tells me that the element name
composed of the 3 Greek characters U+03D5 U+03AC U+03C9 has
length=3 but the number of bytes the element occupies in my XML
file is 6, i.e. 2 bytes per character. I can't find anything in
the API that will return me the *byte* offset. What have I
missed?
Further, e.g. in the case of an attribute name, when I ask for
the attribute's name to do processing on it I only get back as
many bytes as characters in the name which is wrong for a
multi-byte encoding like UTF-8 and the bytes I do get back are
corrupt as UTF-8.
Here, simplified, is my attribute event handler:
void XRegionEventHandler::attributes (
const AttributeList &attributes,
const StorageObjectSpec *el_storageObj,
unsigned long &epos )
{
char name[MAX_NAME];
const StorageObjectSpec *attr_storageObj;
size_t nAttributes = attributes.size();
unsigned long spos;
for (size_t i = 0; i < nAttributes; i++)
{
const Text *text;
const StringC *string;
const AttributeValue *value = attributes.value(i);
if (value)
{
switch (value->info(text, string))
{
case AttributeValue::cdata:
{
TextIter iter(*text);
TextItem::Type type;
const Char *p;
size_t length;
const Location *loc;
while (iter.next(type, p, length, loc))
{
epos = spos + length - CHAR_SIZE;
name << attributes.name(i);
// process "name" here ... }
}
break;
default:
break;
}
}
}
}
I've walked through the overloaded << operator in the line:
name << attributes.name(i);
and attributes.name(i) the name is already corrupt.
Has anyone successfully parsed UTF-8 encoded multi-byte XML and
retrieved byte offsets and the UTF-8 encoded form of the element
and attribute names?
Any help much appreciated and thanks,
Phil