New to XML

Discussion in 'XML' started by jodleren, Dec 4, 2008.

  1. jodleren

    jodleren Guest


    I thought that XML is simpler... my problem: I am storing some news
    stories in xml, say:

    <?xml version="1.0" ?>
    <author>name with non english characters</author>
    <header1>text with non english characters</header1>

    the problem is non english characters - how do I store e.g. &oslash;
    or &otilde; in there?

    jodleren, Dec 4, 2008
    1. Advertisements

  2. XML uses and supports Unicode so simply use an editor that supports
    Unicode to edit and save your XML documents, that way you can use
    characters directly and don't need any character or entity references.
    Martin Honnen, Dec 4, 2008
    1. Advertisements

  3. jodleren

    jodleren Guest

    Well, that does not work either. Both cases fail:

    <?xml version="1.0" standalone="yes"?>
    <aphorism>Shit happens</aphorism>

    <?xml version="1.0" standalone="yes"?>
    <aphorism>Shit happens</aphorism>

    and they fail at the same line - both & and even &amp;slash; (someone
    suggested that) and Ø fail.... how do I overcome this?

    jodleren, Dec 4, 2008
  4. Works fine for me:

    If you still think there are problems then you need to explain exactly
    what you have tried and why you think it failed. I am afraid "does not
    work" does not tell us what you have tried exactly and what kind of
    failure you think there is. You have managed to use the character "Ø"
    literally in your Usenet post, why should that pose a problem in an XML
    Martin Honnen, Dec 4, 2008
  5. Hi,

    jodleren a écrit :
    if you write directly such a character, you have to mention the charset
    that you used with your editor:
    <?xml version="1.0" encoding="[the-encoding-that-contains-theOslash]"?>
    (note that if you don't specify the encoding, the default is utf-8 or
    utf-16, therefore you can also replace in utf-8 the Ø by the 2 bytes C3
    98 (shown here in hexa))

    otherwise, you can insert a character reference whatever the encoding used:
    this doesn't work because XML is not HTML; an HTML parser relies on some
    hardcoded libraries of entities that maps Oslash to U+00D8, but with XML
    you have to declare this mapping explicitely (with ENTITY in the DTD)
    but I don't recommend such practice (trust me: don't do that)

    XML contains 5 hard-coded entities: &amp; &quot; &apos; &lt; &gt;

    "&amp;Oslash;" means that you explicitely wants the sequence of text
    "&Oslash;" and not an entity reference


    (. .)
    | Philippe Poulard |
    Have the RefleX !
    Philippe Poulard, Dec 4, 2008
  6. Philippe Poulard, Dec 4, 2008
  7. jodleren

    jodleren Guest

    The unicode part I realise now...

    <from ie>
    The error I get when _not_ unicode-saved...
    The XML page cannot be displayed
    Cannot view XML input using XSL style sheet. Please correct the error
    and then click the Refresh button, or try again later.
    An invalid character was found in text content. Error processing
    resource 'file:///Y:/html2/2770/articles/test.xml'. Line ...
    </from ie>

    When I open the file in notepad, I can save it as unicode, I have to
    do so. An ordanirary text document does not do it.
    This might cause problems ahead, therefor it would be easier for me to
    use &oslash; instead. Would that in any way be possible?

    jodleren, Dec 4, 2008
  8. I stronly suggest to use Unicode encodings like UTF-8 or UTF-16, that is
    what XML parsers have to support.
    If you want to use other encodings then you need to simply declare them
    in the XML declaration e.g.
    <?xml version="1.0" encoding="ISO-8859-1"?>
    is certainly possible.

    As for using an entity reference, you would need to declare the entities
    first in a document type definition. See for how to do that. But
    be aware that non-validating parsers might not read any external
    resources so you would need to include the definition in the internal
    subset to ensure that any XML parser knows the entities.
    Martin Honnen, Dec 4, 2008
  9. Hi jodleren
    I come from Denmark so I know about the Ø and what You need to do is:

    The header should look either like this:

    <?xml version="1.0" encoding="ISO-8859-1" ?>
    if You save in NON-Unicode

    or like this if You save in unicode:
    <?xml version="1.0" encoding="UTF-8" ?>

    Thes characters are not alowed in the text in XML files
    & " ' < >
    they are reserved for tags and they must be translated to
    &amp; &quot; &apos; &lt; &gt;

    If You use UTF-8 You can use all other characters

    If You use ISO-8859-1 You will have to stay within ISO-8859-1
    You can see what that is if You use the charmap.exe and chose
    Windows:Wester under advanced.

    Kind regards
    Asger Joergensen, Dec 5, 2008
  10. jodleren

    jodleren Guest


    Tak for svaret, det ser ud til at virker. Jeg spekulerer dog stadig
    over alle de tegn, som en artikkel kan indeholde, så måske vil jeg
    alligevel konvertere det hele til UTF8. Men det kan jeg gøre senere,
    nu kan jeg komme videre med projektet.

    Tak for hjælpen

    jodleren, Dec 5, 2008
  11. jodleren

    Peter Flynn Guest

    Asger Joergensen wrote:
    No, only & and < are forbidden in text unless escaped. The characters
    " ' > are just text and do not require escaping, although > acquires a
    special meaning in a start-tag or end-tag, and " and ' are bound by
    rules of matching and nesting when used in attributes.

    Peter Flynn, Dec 13, 2008
  12. Hi Peter

    You are of cource right, BUT it is commen / good practise to escape
    all five.

    Kind regards
    Asger Joergensen, Dec 13, 2008
  13. jodleren

    Peter Flynn Guest

    Possibly. It depends what system you are writing for. If you are writing
    normal text, you probably want to avoid " and ' as quotes completely,
    and use real (curly) open-and-close quotes (single and double) and keep
    the ' for an apostrophe. The > occurs very rarely in normal text. When
    used in its mathematical sense, it will of course be inside some kind of
    The W3Schools pages are not always reliable or accurate (these ones are OK).

    Peter Flynn, Dec 14, 2008
  14. jodleren


    Jan 15, 2010
    Likes Received:
    Split TAG content on O-slash

    I'm dealing with a simulair problem.
    Im my XML Tag there is used de O-slash

    like: <DESCRIPTION>Powers Ø 12,7mm EV</DESCRIPTION>

    when I parse these with php it results in 2 tags

    when I remove the O-slash everything is fine.

    How can i solve this ??
    I've tried Unicode and ISO-8859-1 aswell
    and place

    xml_parser_set_option($xml_parser,XML_OPTION_TARGET_ENCODING, "ISO-8859-1");

    in my code....
    but still get the 2 tags

    please help
    JanMoek, Jan 15, 2010
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.