Mapforce: mapping to CSV without column header line inserts hex FF FE FF FE

L

Lukas

Hi Group,

In Mapforce 2005 R3, when mapping to CSV with the "First row contains
field names" option UN-checked on the CSV target component settings,
the characters (hex) FF FE FF FE are inserted in the beginning of the
first line when running Java code autogenerated by Mapforce.

In the output tab of the Mapforce application, this problem doesn't
occur. I've not checked whether it occurs when running C#,C++ or XSLT
autogenerated code.

I've encountered this problem when mapping XML to CSV and CSV to CSV.

Does anyone know whether this is this a known bug? Is it fixed in a
later release?
Any known workarounds?

Not holding my breath,

Lukas
 
L

Lukas

Correction:

My editor was displaying those bytes incorrectly.
The bytes inserted are actually:

EF BB BF
 
P

Peter Flynn

Lukas said:
Hi Group,

In Mapforce 2005 R3, when mapping to CSV with the "First row contains
field names" option UN-checked on the CSV target component settings,
the characters (hex) FF FE FF FE are inserted in the beginning of the
first line when running Java code autogenerated by Mapforce.

In the output tab of the Mapforce application, this problem doesn't
occur. I've not checked whether it occurs when running C#,C++ or XSLT
autogenerated code.

I've encountered this problem when mapping XML to CSV and CSV to CSV.

Does anyone know whether this is this a known bug? Is it fixed in a
later release?
Any known workarounds?

It's not a bug, it's part of XML. It's the Byte Order Mark (BOM) which
is designed to signal to a processor before processing starts which
16-bit character encoding is in use. It's being output because your
processor is emitting UCS-2 which is probably unnecessary unless you
are using a very wide range of character repertoire planes. Check the
Mapforce output settings and switch to UTF-8 instead.

///Peter
 
R

Richard Tobin

Lukas said:
My editor was displaying those bytes incorrectly.
The bytes inserted are actually:

EF BB BF

I can't help you directly, but EF BB BF is the UTF-8 code for a
byte-order mark (or "BOM"). Maybe you can look that up in the manual
for your software.

-- Richard
 
L

Lukas

Sorry for the confusion. The sequence was actually EF BB BF (UTF-8 BOM,
as Richard notes).

What confuses me about the UTF-8 BOM issue:

A) In XML: Since I'm using UTF-8, which is a 7 bit encoding, and the
xml processing instruction says so explicitly, why would I want to have
nasty binary at the start of my document?

B)
* In Text (CSV): some articles claim that Windows Notepad handles the
BOM gracefully, but in our project the issue would've not even been
raised if our editors had not displayed spurious characters;
... "" (if you view this in ISO 8859-1) in Notepad, a dot in
Ultraedit 8.2. When switching to hex in Ultraedit, completely wrong
values are being displayed throug the length of the doc.

* The issue did not occur when (in Mapforce) the option "First row
contains field names" was checked for the output CSV, although we
viewed the output files with the same editors.

* Mapforce ITSELF doesn't handle the BOM gracefully. If the CSV output
with BOM from one Mapforce code-gen mapping is fed as input to another,
the BOM is visible in the first field and trips up functions operating
on that field.
 
L

Lukas

Sorry, something doesn't display in my last post. It's meant to read:

...

* * * * * * *
* * * *
* * * *
* * * *
* * * *
* * * * *
* * * ****

(if you view this in ISO 8859-1) in Notepad, a dot ...
 
R

Richard Tobin

Lukas said:
A) In XML: Since I'm using UTF-8, which is a 7 bit encoding, and the
xml processing instruction says so explicitly, why would I want to have
nasty binary at the start of my document?

UTF-8 is not a 7-bit encoding! It corresponds to ASCII for characters
up to 127, but uses bytes with the high bit set to encode the rest of
Unicode.
* In Text (CSV): some articles claim that Windows Notepad handles the
BOM gracefully, but in our project the issue would've not even been
raised if our editors had not displayed spurious characters;
.. "" (if you view this in ISO 8859-1) in Notepad

I don't know anything about Notepad, but if you see those characters -
i with diaeresis, double greater-than, inverted question mark - it
means that the program is interpreting the document as 8859-1 rather
than UTF-8. Of course, the whole point of the UTF-8 BOM is to let it
know that it's in UTF-8!

-- Richard
 
P

Peter Flynn

Lukas said:
Sorry for the confusion. The sequence was actually EF BB BF (UTF-8
BOM, as Richard notes).

What confuses me about the UTF-8 BOM issue:

A) In XML: Since I'm using UTF-8, which is a 7 bit encoding,

Whoah there. UTF-8 uses all 8 bits in the byte. Where did you get the
information that it's 7-bit? The only 7-bit encoding in widespread
use is US-ASCII.
and the
xml processing instruction says so explicitly, why would I want to
have nasty binary at the start of my document?

To identify that it is UTF-8 as opposed to UTF-16 or UTF-32.
If your XML software can't handle it, it's broken and should be
replaced.
B)
* In Text (CSV): some articles claim that Windows Notepad handles the
BOM gracefully, but in our project the issue would've not even been
raised if our editors had not displayed spurious characters;
.. "" (if you view this in ISO 8859-1) in Notepad, a dot in
Ultraedit 8.2. When switching to hex in Ultraedit, completely wrong
values are being displayed throug the length of the doc.

While most plaintext editors will display ASCII or ISO-8859-1
adequately, large numbers of them spit blood when faced with anything
else. Notepad is suitable for shopping lists and not much else.
* The issue did not occur when (in Mapforce) the option "First row
contains field names" was checked for the output CSV, although we
viewed the output files with the same editors.

* Mapforce ITSELF doesn't handle the BOM gracefully. If the CSV output
with BOM from one Mapforce code-gen mapping is fed as input to
another, the BOM is visible in the first field and trips up functions
operating on that field.

Sounds like Mapforce is broken and you should complain to the vendor.

///Peter
 
S

Shmuel (Seymour J.) Metz

on 12/14/2005 said:
I don't know anything about Notepad, but if you see those characters
-
i with diaeresis, double greater-than, inverted question mark - it
means that the program is interpreting the document as 8859-1 rather
than UTF-8. Of course, the whole point of the UTF-8 BOM is to let it
know that it's in UTF-8!

Why would you need a BOM for UTF-8? It's only needed for characters
larger than an octet, e.g., UTF-16, raw UCS4.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to (e-mail address removed)
 
R

Richard Tobin

Shmuel (Seymour J.) Metz said:
Why would you need a BOM for UTF-8? It's only needed for characters
larger than an octet, e.g., UTF-16, raw UCS4.

It also serves to indicate the encoding, as well as which byte-order
variant.

-- Richard
 
S

Shmuel (Seymour J.) Metz

on 12/19/2005 said:
It also serves to indicate the encoding, as well as which byte-order
variant

What byte-order variant? UTF-8 uses a stream of 8-bit bytes (octets),
not a stream of 16-bit bytes; there is no byte ordering issue. The BOM
is needed for UTF-16 and raw Unicode, not for UTF-8.

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to (e-mail address removed)
 
R

Richard Tobin

It also serves to indicate the encoding, as well as which byte-order
variant
[/QUOTE]
What byte-order variant? UTF-8 uses a stream of 8-bit bytes (octets),
not a stream of 16-bit bytes; there is no byte ordering issue.

The obvious use of a BOM - as the name implies - is to indicate which
byte order variant of an encoding is being used. It is *also* used to
indicate the encoding itself. Obviously for UTF-8 only this second
fuction is relevant.

-- Richard
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,037
Latest member
MozzGuardBugs

Latest Threads

Top