New to XML

Discussion in 'XML' started by jodleren, Dec 4, 2008.

  1. jodleren

    jodleren Guest

    Hi

    I thought that XML is simpler... my problem: I am storing some news
    stories in xml, say:

    <?xml version="1.0" ?>
    <article>
    <date>20081111</date>
    <author>name with non english characters</author>
    <header1>text with non english characters</header1>
    </article>

    the problem is non english characters - how do I store e.g. &oslash;
    or &otilde; in there?

    WBR
    Sonnich
     
    jodleren, Dec 4, 2008
    #1
    1. Advertising

  2. jodleren wrote:

    > <?xml version="1.0" ?>
    > <article>
    > <date>20081111</date>
    > <author>name with non english characters</author>
    > <header1>text with non english characters</header1>
    > </article>
    >
    > the problem is non english characters - how do I store e.g. &oslash;
    > or &otilde; in there?


    XML uses and supports Unicode so simply use an editor that supports
    Unicode to edit and save your XML documents, that way you can use
    characters directly and don't need any character or entity references.


    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
     
    Martin Honnen, Dec 4, 2008
    #2
    1. Advertising

  3. jodleren

    jodleren Guest

    On Dec 4, 5:11 pm, Martin Honnen <> wrote:
    > jodleren wrote:
    > > <?xml version="1.0" ?>
    > > <article>
    > >  <date>20081111</date>
    > >  <author>name with non english characters</author>
    > >  <header1>text with non english characters</header1>
    > > </article>

    >
    > > the problem is non english characters - how do I store e.g. &oslash;
    > > or &otilde; in there?

    >
    > XML uses and supports Unicode so simply use an editor that supports
    > Unicode to edit and save your XML documents, that way you can use
    > characters directly and don't need any character or entity references.


    Well, that does not work either. Both cases fail:


    <?xml version="1.0" standalone="yes"?>
    <document>
    <aphorism>Shit happens</aphorism>
    <author>unknown</author>
    <language>English</language>
    <more>Ø</more>
    </document>



    <?xml version="1.0" standalone="yes"?>
    <document>
    <aphorism>Shit happens</aphorism>
    <author>unknown</author>
    <language>English</language>
    <more>&Oslash;</more>
    </document>


    and they fail at the same line - both & and even &amp;slash; (someone
    suggested that) and Ø fail.... how do I overcome this?

    WBR
    Sonnich
     
    jodleren, Dec 4, 2008
    #3
  4. jodleren wrote:

    >> XML uses and supports Unicode so simply use an editor that supports
    >> Unicode to edit and save your XML documents, that way you can use
    >> characters directly and don't need any character or entity references.

    >
    > Well, that does not work either. Both cases fail:
    >
    >
    > <?xml version="1.0" standalone="yes"?>
    > <document>
    > <aphorism>Shit happens</aphorism>
    > <author>unknown</author>
    > <language>English</language>
    > <more>Ø</more>
    > </document>


    Works fine for me: http://home.arcor.de/martin.honnen/xml/test2008120403.xml

    If you still think there are problems then you need to explain exactly
    what you have tried and why you think it failed. I am afraid "does not
    work" does not tell us what you have tried exactly and what kind of
    failure you think there is. You have managed to use the character "Ø"
    literally in your Usenet post, why should that pose a problem in an XML
    document?

    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
     
    Martin Honnen, Dec 4, 2008
    #4
  5. Hi,

    jodleren a écrit :
    > <more>Ø</more>


    if you write directly such a character, you have to mention the charset
    that you used with your editor:
    <?xml version="1.0" encoding="[the-encoding-that-contains-theOslash]"?>
    (note that if you don't specify the encoding, the default is utf-8 or
    utf-16, therefore you can also replace in utf-8 the Ø by the 2 bytes C3
    98 (shown here in hexa))

    otherwise, you can insert a character reference whatever the encoding used:
    <more>Ø</more>

    > <more>&Oslash;</more>


    this doesn't work because XML is not HTML; an HTML parser relies on some
    hardcoded libraries of entities that maps Oslash to U+00D8, but with XML
    you have to declare this mapping explicitely (with ENTITY in the DTD)
    but I don't recommend such practice (trust me: don't do that)

    XML contains 5 hard-coded entities: &amp; &quot; &apos; &lt; &gt;

    "&amp;Oslash;" means that you explicitely wants the sequence of text
    "&Oslash;" and not an entity reference

    --
    Cordialement,

    ///
    (. .)
    --------ooO--(_)--Ooo--------
    | Philippe Poulard |
    -----------------------------
    http://reflex.gforge.inria.fr/
    Have the RefleX !
     
    Philippe Poulard, Dec 4, 2008
    #5
  6. Philippe Poulard, Dec 4, 2008
    #6
  7. jodleren

    jodleren Guest

    On Dec 4, 6:21 pm, Martin Honnen <> wrote:
    > jodleren wrote:
    > >> XML uses and supports Unicode so simply use an editor that supports
    > >> Unicode to edit and save your XML documents, that way you can use
    > >> characters directly and don't need any character or entity references.

    >
    > > Well, that does not work either. Both cases fail:

    >
    > > <?xml version="1.0" standalone="yes"?>
    > > <document>
    > >  <aphorism>Shit happens</aphorism>
    > >  <author>unknown</author>
    > >  <language>English</language>
    > >  <more>Ø</more>
    > > </document>

    >
    > Works fine for me:http://home.arcor.de/martin.honnen/xml/test2008120403.xml
    >
    > If you still think there are problems then you need to explain exactly
    > what you have tried and why you think it failed. I am afraid "does not
    > work" does not tell us what you have tried exactly and what kind of
    > failure you think there is. You have managed to use the character "Ø"
    > literally in your Usenet post, why should that pose a problem in an XML
    > document?


    The unicode part I realise now...

    <from ie>
    The error I get when _not_ unicode-saved...
    The XML page cannot be displayed
    Cannot view XML input using XSL style sheet. Please correct the error
    and then click the Refresh button, or try again later.
    --------------------------------------------------------------------------------
    An invalid character was found in text content. Error processing
    resource 'file:///Y:/html2/2770/articles/test.xml'. Line ...
    <more>
    </from ie>

    When I open the file in notepad, I can save it as unicode, I have to
    do so. An ordanirary text document does not do it.
    This might cause problems ahead, therefor it would be easier for me to
    use &oslash; instead. Would that in any way be possible?

    WBR
    Sonnich
     
    jodleren, Dec 4, 2008
    #7
  8. jodleren wrote:

    > When I open the file in notepad, I can save it as unicode, I have to
    > do so. An ordanirary text document does not do it.
    > This might cause problems ahead, therefor it would be easier for me to
    > use &oslash; instead. Would that in any way be possible?


    I stronly suggest to use Unicode encodings like UTF-8 or UTF-16, that is
    what XML parsers have to support.
    If you want to use other encodings then you need to simply declare them
    in the XML declaration e.g.
    <?xml version="1.0" encoding="ISO-8859-1"?>
    is certainly possible.

    As for using an entity reference, you would need to declare the entities
    first in a document type definition. See
    http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent for how to do that. But
    be aware that non-validating parsers might not read any external
    resources so you would need to include the definition in the internal
    subset to ensure that any XML parser knows the entities.



    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
     
    Martin Honnen, Dec 4, 2008
    #8
  9. Hi jodleren
    jodleren wrote:

    > Well, that does not work either. Both cases fail:
    >
    >
    > <?xml version="1.0" standalone="yes"?>


    > <more>Ø</more>


    > <more>&Oslash;</more>


    >
    > and they fail at the same line - both & and even &amp;slash; (someone
    > suggested that) and Ø fail.... how do I overcome this?


    I come from Denmark so I know about the Ø and what You need to do is:

    The header should look either like this:

    <?xml version="1.0" encoding="ISO-8859-1" ?>
    if You save in NON-Unicode

    or like this if You save in unicode:
    <?xml version="1.0" encoding="UTF-8" ?>

    Thes characters are not alowed in the text in XML files
    & " ' < >
    they are reserved for tags and they must be translated to
    &amp; &quot; &apos; &lt; &gt;

    If You use UTF-8 You can use all other characters

    If You use ISO-8859-1 You will have to stay within ISO-8859-1
    You can see what that is if You use the charmap.exe and chose
    Windows:Wester under advanced.

    Kind regards
    Asger
     
    Asger Joergensen, Dec 5, 2008
    #9
  10. jodleren

    jodleren Guest

    On Dec 5, 1:52 pm, "Asger Joergensen" <> wrote:
    > Hi jodleren
    >
    > jodleren wrote:
    > > Well, that does not work either. Both cases fail:

    >
    > > <?xml version="1.0" standalone="yes"?>
    > >  <more>Ø</more>
    > >  <more>&Oslash;</more>

    >
    > > and they fail at the same line - both & and even &amp;slash; (someone
    > > suggested that) and Ø fail.... how do I overcome this?

    >
    > I come from Denmark so I know about the Ø and what You need to do is:
    >
    > The header should look either like this:
    >
    > <?xml version="1.0" encoding="ISO-8859-1" ?>
    > if You save in NON-Unicode
    >
    > or like this if You save in unicode:
    > <?xml version="1.0" encoding="UTF-8" ?>
    >
    > Thes characters are not alowed in the text in XML files
    >  & " ' < >
    > they are reserved for tags and they must be translated to
    > &amp; &quot; &apos; &lt; &gt;
    >
    > If You use UTF-8 You can use all other characters
    >
    > If You use ISO-8859-1 You will have to stay within ISO-8859-1
    > You can see what that is if You use the charmap.exe and chose
    > Windows:Wester under advanced.


    Hejsa

    Tak for svaret, det ser ud til at virker. Jeg spekulerer dog stadig
    over alle de tegn, som en artikkel kan indeholde, så måske vil jeg
    alligevel konvertere det hele til UTF8. Men det kan jeg gøre senere,
    nu kan jeg komme videre med projektet.

    Tak for hjælpen

    MVH
    Sonnich
     
    jodleren, Dec 5, 2008
    #10
  11. jodleren

    Peter Flynn Guest

    Asger Joergensen wrote:
    [...]
    > Thes characters are not alowed in the text in XML files
    > & " ' < >
    > they are reserved for tags and they must be translated to
    > &amp; &quot; &apos; &lt; &gt;


    No, only & and < are forbidden in text unless escaped. The characters
    " ' > are just text and do not require escaping, although > acquires a
    special meaning in a start-tag or end-tag, and " and ' are bound by
    rules of matching and nesting when used in attributes.

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
     
    Peter Flynn, Dec 13, 2008
    #11
  12. Hi Peter

    Peter Flynn wrote:

    > Asger Joergensen wrote:
    > [...]
    > > Thes characters are not alowed in the text in XML files
    > > & " ' < >
    > > they are reserved for tags and they must be translated to
    > > &amp; &quot; &apos; &lt; &gt;

    >
    > No, only & and < are forbidden in text unless escaped. The characters
    > " ' > are just text and do not require escaping, although > acquires a special meaning in a start-tag or end-tag, and " and ' are bound by rules of matching and nesting when used in attributes.


    You are of cource right, BUT it is commen / good practise to escape
    all five.

    http://www.w3schools.com/xml/xml_syntax.asp

    Kind regards
    Asger
     
    Asger Joergensen, Dec 13, 2008
    #12
  13. jodleren

    Peter Flynn Guest

    Asger Joergensen wrote:
    > Hi Peter
    >
    > Peter Flynn wrote:
    >
    >> Asger Joergensen wrote:
    >> [...]
    >>> Thes characters are not alowed in the text in XML files
    >>> & " ' < >
    >>> they are reserved for tags and they must be translated to
    >>> &amp; &quot; &apos; &lt; &gt;

    >> No, only & and < are forbidden in text unless escaped. The characters
    >> " ' > are just text and do not require escaping, although > acquires a special meaning in a start-tag or end-tag, and " and ' are bound by rules of matching and nesting when used in attributes.

    >
    > You are of cource right, BUT it is commen / good practise to escape
    > all five.


    Possibly. It depends what system you are writing for. If you are writing
    normal text, you probably want to avoid " and ' as quotes completely,
    and use real (curly) open-and-close quotes (single and double) and keep
    the ' for an apostrophe. The > occurs very rarely in normal text. When
    used in its mathematical sense, it will of course be inside some kind of
    <math> element; either way it is a matter of personal preference whether
    you use it raw or in the form of a character reference.

    > http://www.w3schools.com/xml/xml_syntax.asp


    The W3Schools pages are not always reliable or accurate (these ones are OK).

    ///Peter
     
    Peter Flynn, Dec 14, 2008
    #13
  14. jodleren

    JanMoek

    Joined:
    Jan 15, 2010
    Messages:
    1
    Split TAG content on O-slash

    I'm dealing with a simulair problem.
    Im my XML Tag there is used de O-slash

    like: <DESCRIPTION>Powers Ø 12,7mm EV</DESCRIPTION>

    when I parse these with php it results in 2 tags
    <DESCRIPTION>Powers</DESCRIPTION>
    <DESCRIPTION>Ø 12,7mm EV</DESCRIPTION>

    when I remove the O-slash everything is fine.

    How can i solve this ??
    I've tried Unicode and ISO-8859-1 aswell
    and place

    xml_parser_set_option($xml_parser,XML_OPTION_SKIP_WHITE,1);
    xml_parser_set_option($xml_parser,XML_OPTION_TARGET_ENCODING, "ISO-8859-1");

    in my code....
    but still get the 2 tags

    please help
     
    JanMoek, Jan 15, 2010
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Stylus Studio
    Replies:
    0
    Views:
    351
    Stylus Studio
    Jul 6, 2004
  2. Stylus Studio
    Replies:
    0
    Views:
    348
    Stylus Studio
    Jul 2, 2004
  3. Stylus Studio
    Replies:
    0
    Views:
    417
    Stylus Studio
    Jul 26, 2004
  4. Erik Wasser
    Replies:
    5
    Views:
    465
    Peter J. Holzer
    Mar 5, 2006
  5. Replies:
    2
    Views:
    470
    Thomas 'PointedEars' Lahn
    Mar 11, 2008
Loading...

Share This Page