New to XML

J

jodleren

Hi

I thought that XML is simpler... my problem: I am storing some news
stories in xml, say:

<?xml version="1.0" ?>
<article>
<date>20081111</date>
<author>name with non english characters</author>
<header1>text with non english characters</header1>
</article>

the problem is non english characters - how do I store e.g. &oslash;
or &otilde; in there?

WBR
Sonnich
 
M

Martin Honnen

jodleren said:
<?xml version="1.0" ?>
<article>
<date>20081111</date>
<author>name with non english characters</author>
<header1>text with non english characters</header1>
</article>

the problem is non english characters - how do I store e.g. &oslash;
or &otilde; in there?

XML uses and supports Unicode so simply use an editor that supports
Unicode to edit and save your XML documents, that way you can use
characters directly and don't need any character or entity references.
 
J

jodleren

XML uses and supports Unicode so simply use an editor that supports
Unicode to edit and save your XML documents, that way you can use
characters directly and don't need any character or entity references.

Well, that does not work either. Both cases fail:


<?xml version="1.0" standalone="yes"?>
<document>
<aphorism>Shit happens</aphorism>
<author>unknown</author>
<language>English</language>
<more>Ø</more>
</document>



<?xml version="1.0" standalone="yes"?>
<document>
<aphorism>Shit happens</aphorism>
<author>unknown</author>
<language>English</language>
<more>&Oslash;</more>
</document>


and they fail at the same line - both & and even &amp;slash; (someone
suggested that) and Ø fail.... how do I overcome this?

WBR
Sonnich
 
M

Martin Honnen

jodleren said:
Well, that does not work either. Both cases fail:


<?xml version="1.0" standalone="yes"?>
<document>
<aphorism>Shit happens</aphorism>
<author>unknown</author>
<language>English</language>
<more>Ø</more>
</document>

Works fine for me: http://home.arcor.de/martin.honnen/xml/test2008120403.xml

If you still think there are problems then you need to explain exactly
what you have tried and why you think it failed. I am afraid "does not
work" does not tell us what you have tried exactly and what kind of
failure you think there is. You have managed to use the character "Ø"
literally in your Usenet post, why should that pose a problem in an XML
document?
 
P

Philippe Poulard

Hi,

jodleren a écrit :
<more>Ø</more>

if you write directly such a character, you have to mention the charset
that you used with your editor:
<?xml version="1.0" encoding="[the-encoding-that-contains-theOslash]"?>
(note that if you don't specify the encoding, the default is utf-8 or
utf-16, therefore you can also replace in utf-8 the Ø by the 2 bytes C3
98 (shown here in hexa))

otherwise, you can insert a character reference whatever the encoding used:
<more>&Oslash;</more>

this doesn't work because XML is not HTML; an HTML parser relies on some
hardcoded libraries of entities that maps Oslash to U+00D8, but with XML
you have to declare this mapping explicitely (with ENTITY in the DTD)
but I don't recommend such practice (trust me: don't do that)

XML contains 5 hard-coded entities: &amp; &quot; &apos; &lt; &gt;

"&amp;Oslash;" means that you explicitely wants the sequence of text
"&Oslash;" and not an entity reference

--
Cordialement,

///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
-----------------------------
http://reflex.gforge.inria.fr/
Have the RefleX !
 
J

jodleren

Works fine for me:http://home.arcor.de/martin.honnen/xml/test2008120403.xml

If you still think there are problems then you need to explain exactly
what you have tried and why you think it failed. I am afraid "does not
work" does not tell us what you have tried exactly and what kind of
failure you think there is. You have managed to use the character "Ø"
literally in your Usenet post, why should that pose a problem in an XML
document?

The unicode part I realise now...

<from ie>
The error I get when _not_ unicode-saved...
The XML page cannot be displayed
Cannot view XML input using XSL style sheet. Please correct the error
and then click the Refresh button, or try again later.
--------------------------------------------------------------------------------
An invalid character was found in text content. Error processing
resource 'file:///Y:/html2/2770/articles/test.xml'. Line ...
<more>
</from ie>

When I open the file in notepad, I can save it as unicode, I have to
do so. An ordanirary text document does not do it.
This might cause problems ahead, therefor it would be easier for me to
use &oslash; instead. Would that in any way be possible?

WBR
Sonnich
 
M

Martin Honnen

jodleren said:
When I open the file in notepad, I can save it as unicode, I have to
do so. An ordanirary text document does not do it.
This might cause problems ahead, therefor it would be easier for me to
use &oslash; instead. Would that in any way be possible?

I stronly suggest to use Unicode encodings like UTF-8 or UTF-16, that is
what XML parsers have to support.
If you want to use other encodings then you need to simply declare them
in the XML declaration e.g.
<?xml version="1.0" encoding="ISO-8859-1"?>
is certainly possible.

As for using an entity reference, you would need to declare the entities
first in a document type definition. See
http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent for how to do that. But
be aware that non-validating parsers might not read any external
resources so you would need to include the definition in the internal
subset to ensure that any XML parser knows the entities.
 
A

Asger Joergensen

Hi jodleren
jodleren said:
Well, that does not work either. Both cases fail:


<?xml version="1.0" standalone="yes"?>



and they fail at the same line - both & and even &amp;slash; (someone
suggested that) and Ø fail.... how do I overcome this?

I come from Denmark so I know about the Ø and what You need to do is:

The header should look either like this:

<?xml version="1.0" encoding="ISO-8859-1" ?>
if You save in NON-Unicode

or like this if You save in unicode:
<?xml version="1.0" encoding="UTF-8" ?>

Thes characters are not alowed in the text in XML files
& " ' < >
they are reserved for tags and they must be translated to
&amp; &quot; &apos; &lt; &gt;

If You use UTF-8 You can use all other characters

If You use ISO-8859-1 You will have to stay within ISO-8859-1
You can see what that is if You use the charmap.exe and chose
Windows:Wester under advanced.

Kind regards
Asger
 
J

jodleren

Hi jodleren




I come from Denmark so I know about the Ø and what You need to do is:

The header should look either like this:

<?xml version="1.0" encoding="ISO-8859-1" ?>
if You save in NON-Unicode

or like this if You save in unicode:
<?xml version="1.0" encoding="UTF-8" ?>

Thes characters are not alowed in the text in XML files
 & " ' < >
they are reserved for tags and they must be translated to
&amp; &quot; &apos; &lt; &gt;

If You use UTF-8 You can use all other characters

If You use ISO-8859-1 You will have to stay within ISO-8859-1
You can see what that is if You use the charmap.exe and chose
Windows:Wester under advanced.

Hejsa

Tak for svaret, det ser ud til at virker. Jeg spekulerer dog stadig
over alle de tegn, som en artikkel kan indeholde, så måske vil jeg
alligevel konvertere det hele til UTF8. Men det kan jeg gøre senere,
nu kan jeg komme videre med projektet.

Tak for hjælpen

MVH
Sonnich
 
P

Peter Flynn

Asger Joergensen wrote:
[...]
Thes characters are not alowed in the text in XML files
& " ' < >
they are reserved for tags and they must be translated to
&amp; &quot; &apos; &lt; &gt;

No, only & and < are forbidden in text unless escaped. The characters
" ' > are just text and do not require escaping, although > acquires a
special meaning in a start-tag or end-tag, and " and ' are bound by
rules of matching and nesting when used in attributes.

///Peter
 
A

Asger Joergensen

Hi Peter

Peter said:
Asger Joergensen wrote:
[...]
Thes characters are not alowed in the text in XML files
& " ' < >
they are reserved for tags and they must be translated to
&amp; &quot; &apos; &lt; &gt;

No, only & and < are forbidden in text unless escaped. The characters
" ' > are just text and do not require escaping, although > acquires a special meaning in a start-tag or end-tag, and " and ' are bound by rules of matching and nesting when used in attributes.

You are of cource right, BUT it is commen / good practise to escape
all five.

http://www.w3schools.com/xml/xml_syntax.asp

Kind regards
Asger
 
P

Peter Flynn

Asger said:
Hi Peter

Peter said:
Asger Joergensen wrote:
[...]
Thes characters are not alowed in the text in XML files
& " ' < >
they are reserved for tags and they must be translated to
&amp; &quot; &apos; &lt; &gt;
No, only & and < are forbidden in text unless escaped. The characters
" ' > are just text and do not require escaping, although > acquires a special meaning in a start-tag or end-tag, and " and ' are bound by rules of matching and nesting when used in attributes.

You are of cource right, BUT it is commen / good practise to escape
all five.

Possibly. It depends what system you are writing for. If you are writing
normal text, you probably want to avoid " and ' as quotes completely,
and use real (curly) open-and-close quotes (single and double) and keep
the ' for an apostrophe. The > occurs very rarely in normal text. When
used in its mathematical sense, it will of course be inside some kind of

The W3Schools pages are not always reliable or accurate (these ones are OK).

///Peter
 
Joined
Jan 15, 2010
Messages
1
Reaction score
0
Split TAG content on O-slash

I'm dealing with a simulair problem.
Im my XML Tag there is used de O-slash

like: <DESCRIPTION>Powers Ø 12,7mm EV</DESCRIPTION>

when I parse these with php it results in 2 tags
<DESCRIPTION>Powers</DESCRIPTION>
<DESCRIPTION>Ø 12,7mm EV</DESCRIPTION>

when I remove the O-slash everything is fine.

How can i solve this ??
I've tried Unicode and ISO-8859-1 aswell
and place

xml_parser_set_option($xml_parser,XML_OPTION_SKIP_WHITE,1);
xml_parser_set_option($xml_parser,XML_OPTION_TARGET_ENCODING, "ISO-8859-1");

in my code....
but still get the 2 tags

please help
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top