How many illegal character for jdom?

Discussion in 'Java' started by Carfield Yim, Oct 28, 2009.

  1. Carfield Yim

    Carfield Yim Guest

    First I see exception message " is not legal for a JDOM character
    content: 0x0 is not a legal XML character.", ok, then I trim all "\0"
    character. Then, I get " is not legal for a JDOM character content:
    0x1 is not a legal XML character." and " is not legal for a JDOM
    character content: 0x2 is not a legal XML character.".

    So.... how many illegal character for JDOM? Any easy way to parse all?
    Carfield Yim, Oct 28, 2009
    #1
    1. Advertising

  2. Carfield Yim

    Mayeul Guest

    Carfield Yim wrote:
    > First I see exception message " is not legal for a JDOM character
    > content: 0x0 is not a legal XML character.", ok, then I trim all "\0"
    > character. Then, I get " is not legal for a JDOM character content:
    > 0x1 is not a legal XML character." and " is not legal for a JDOM
    > character content: 0x2 is not a legal XML character.".
    >
    > So.... how many illegal character for JDOM? Any easy way to parse all?


    I am actually not sure, as I couldn't find any JDOM reference about it,
    but I think it is safe to assume from the error messages, that any
    illegal XML character is an illegal JDOM character.

    U+0, U+1 and U+2 sure are illegal XML characters and it seems a good
    idea for JDOM to reject them.

    According to XML specifications:
    (W3C server is overloaded again, check XML specification in Google, then
    view the in-cache page)
    http://209.85.229.132/search?q=cache:fdujgnyF_v4J:www.w3.org/TR/REC-xml/


    The valid XML characters match this construction:

    Character Range

    Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
    [#x10000-#x10FFFF]
    /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */


    It's up to you to count whatever isn't in this construction.

    > Any easy way to parse all?


    Not sure. Excluding surrogate blocks while keeping non-BMP characters
    should be tricky with a regexp.

    To be honest, I'm kinda wondering what you are trying to build a DOM
    from. It's not everyday that I have to filter out illegal characters and
    am disallowed to just discard the input as invalid.

    --
    Mayeul
    Mayeul, Oct 28, 2009
    #2
    1. Advertising

  3. Carfield Yim

    Carfield Yim Guest

    ..
    >
    > To be honest, I'm kinda wondering what you are trying to build a DOM
    > from. It's not everyday that I have to filter out illegal characters and
    > am disallowed to just discard the input as invalid.


    I cannot control my source so exactly Iwould like to discard those
    characters from the input source...
    Carfield Yim, Oct 28, 2009
    #3
  4. Carfield Yim

    Mayeul Guest

    Carfield Yim wrote:
    > .
    >> To be honest, I'm kinda wondering what you are trying to build a DOM
    >> from. It's not everyday that I have to filter out illegal characters and
    >> am disallowed to just discard the input as invalid.

    >
    > I cannot control my source so exactly Iwould like to discard those
    > characters from the input source...


    I wish you lucks, then.

    Not sure it helps, but Verifier.isXMLCharacter(int) from JDOM will check
    a character is a valid XML character (this same method is called to
    raise the error you got.)

    Note it takes an int, not a char, as parameter. This is because it
    handles non-BMP characters. You might want to do that too.

    --
    Mayeul
    Mayeul, Oct 28, 2009
    #4
  5. Carfield Yim

    Carfield Yim Guest


    > I wish you lucks, then.
    >
    > Not sure it helps, but Verifier.isXMLCharacter(int) from JDOM will check
    > a character is a valid XML character (this same method is called to
    > raise the error you got.)
    >
    > Note it takes an int, not a char, as parameter. This is because it
    > handles non-BMP characters. You might want to do that too.
    >
    > --
    > Mayeul


    Fixed, actually I can reuse API from JDOM to check if character is
    valid for XML document, or JDOM text, here is the code samples


    final String tempText;
    final StringBuilder content = new StringBuilder();
    if (item instanceof FileItem)
    tempText = HeadItem.extendedDesc((FileItem) item);
    else
    tempText = item.getDesc();

    /* from JDOM library... */
    /* 159 */int i = 0;
    for (int len = tempText.length(); i < len; ++i)
    /* */{
    /* 161 */final char ch = tempText.charAt(i);
    /* 164 */if (Verifier.isHighSurrogate(ch))
    /* */{
    /* 166 */++i;
    /* 167 */if (i < len) {
    /* 168 */char low = tempText.charAt(i);
    /* 169 */if (!(Verifier.isLowSurrogate(low))) {
    /* 170 */continue;
    /* */}
    /* */}
    /* */else {
    /* 177 */continue;
    /* */}
    /* */}
    /* 181 */if (!(Verifier.isXMLCharacter(ch)))
    /* */{
    /* 185 */continue;
    /* */}
    /* */content.append(ch);
    /* */}
    /* */
    Carfield Yim, Dec 1, 2009
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Bede
    Replies:
    0
    Views:
    1,406
  2. Wendy S
    Replies:
    1
    Views:
    6,333
    Darren Davison
    Aug 5, 2003
  3. Bernd Oninger
    Replies:
    4
    Views:
    12,207
    GIMME
    Jun 21, 2004
  4. Tinker
    Replies:
    4
    Views:
    5,266
    Harry Bosch
    Oct 9, 2005
  5. Bernd Oninger
    Replies:
    3
    Views:
    2,864
    GIMME
    Jun 21, 2004
Loading...

Share This Page