persian languages charset, and what DOCTYPE?

Discussion in 'HTML' started by Simon, Apr 8, 2006.

  1. Simon

    Simon Guest

    Hi,

    I was asked to have a look at a page that apparently does not display
    Persian language.
    The obvious 2 problems is that the pages does not have doctype or Charest.

    But if I add a Charest and/or DOCTYPE, (any of them), in the page then the
    whole page changes, (the width changes).

    http://journalhome.com/razavi

    I have tried it with FF and IE and they both look ok without DOCTYPE and
    Charest.

    So what default are been used? Because whatever I add it does not display
    properly.

    Also the user claims that we don't support "persian languages". But I can
    see everything fine, (I don't understand what it says, but it 'looks' ok).

    So, what Charest/DOCTYPE should I add without breaking the current display?
    And why would "persian language" not work form some users?

    Simon
    --
    http://urlkick.com/
    Free URL redirection service. Turns a long URL into a much shorter one.
     
    Simon, Apr 8, 2006
    #1
    1. Advertising

  2. On Sat, 8 Apr 2006, Simon wrote:

    > I was asked to have a look at a page that apparently does not display
    > Persian language.

    ....
    > http://journalhome.com/razavi


    Most if it looks plausible to me, although I don't read Persian
    (Farsi).

    > The obvious 2 problems is that the pages does not have doctype or
    > Charest.


    Absence of DOCTYPE is no reason for a browser to fail to display,
    although it'll mean the page is rendered in quirks mode by browsers
    which do that sort of thing.

    Well, a glance at the source indicates that it's been extruded by some
    MS Office tool, so I wouldn't expect much.

    > I have tried it with FF and IE and they both look ok without DOCTYPE
    > and Charest.


    You seem to be consistent in mis-typing that MIME attribute name :-}

    > So what default are been used?


    Most of the content appears to have been included as
    references instead of actual coded characters; so specifying any
    character encoding (charset=) which includes us-ascii would be
    sufficient to get that rendered correctly.

    > Because whatever I add it does not display properly.


    You're not giving very much of a clue as to what kind of "improperly"
    you/they are seeing.

    There aren't many actual coded characters in the document, which makes
    it hard to do diagnostics on that aspect. I can't find an encoding
    which is consistent with them all.

    My guess is that it's not all in the same encoding, and, as such, is
    hopelessly broken. My hunch is that it's in a mixture of Windows-1256
    and utf-8, but as I can't actually read Farsi, I could be wrong.

    > Also the user claims that we don't support "persian languages".


    Pardon? Who's "we", and why should that be a limitation on alt.html?

    As for DOCTYPE, there isn't one that fits the kind of garbage that
    gets extruded by MS. Whichever of the W3C DOCTYPEs you use, you're
    going to get handfuls of validation errors against it. If their
    software doesn't supply one, I'd recommend leaving it that way - well,
    what I would *really* recommend is changing to some software that's
    capable of generating valid HTML, but presumably that isn't an option
    for you.
     
    Alan J. Flavell, Apr 8, 2006
    #2
    1. Advertising

  3. Simon

    Simon Guest

    >
    >> The obvious 2 problems is that the pages does not have doctype or
    >> Charest.

    >
    > Absence of DOCTYPE is no reason for a browser to fail to display,
    > although it'll mean the page is rendered in quirks mode by browsers
    > which do that sort of thing.
    >
    > Well, a glance at the source indicates that it's been extruded by some
    > MS Office tool, so I wouldn't expect much.


    I guess so.

    >
    >> I have tried it with FF and IE and they both look ok without DOCTYPE
    >> and Charest.

    >
    > You seem to be consistent in mis-typing that MIME attribute name :-}


    Sorry, I didn't check my spell checker.

    >
    >> So what default are been used?

    >
    > Most of the content appears to have been included as
    > references instead of actual coded characters; so specifying any
    > character encoding (charset=) which includes us-ascii would be
    > sufficient to get that rendered correctly.
    >
    >> Because whatever I add it does not display properly.

    >
    > You're not giving very much of a clue as to what kind of "improperly"
    > you/they are seeing.


    I am not certain what else to say really, if I add any doctype the width of
    the document changes, (with horizontal scrollbar).
    If I add any charset the same happens.
    If I have neither charset or Doctype the display is as you see it.

    >
    > There aren't many actual coded characters in the document, which makes
    > it hard to do diagnostics on that aspect. I can't find an encoding
    > which is consistent with them all.
    >
    > My guess is that it's not all in the same encoding, and, as such, is
    > hopelessly broken. My hunch is that it's in a mixture of Windows-1256
    > and utf-8, but as I can't actually read Farsi, I could be wrong.


    So, at best i could use "Windows-1256" and that might work. I would have to
    ask the user to try as it is their template.

    >
    >> Also the user claims that we don't support "persian languages".

    >
    > Pardon? Who's "we", and why should that be a limitation on alt.html?


    We, http://www.journalhome.com as the host, nothing to do with alt.html. I
    am only asking here for help here.
    I am just suprised that it displays the code on some machine, (by the looks
    of it yours and mine), and it does not work on other machines.
    I am guessing that the user browser understands the &#; but the machine does
    not have the fonts to actually display them.

    >
    > As for DOCTYPE, there isn't one that fits the kind of garbage that
    > gets extruded by MS. Whichever of the W3C DOCTYPEs you use, you're
    > going to get handfuls of validation errors against it. If their
    > software doesn't supply one, I'd recommend leaving it that way - well,
    > what I would *really* recommend is changing to some software that's
    > capable of generating valid HTML, but presumably that isn't an option
    > for you.


    A bit strange that both browsers seem to display ok without a DOCTYPE, what
    do they use?

    Thanks

    Simon
    --
    http://urlkick.com/
    Free URL redirection service. Turns a long URL into a much shorter one.
     
    Simon, Apr 8, 2006
    #3
  4. Simon wrote:

    > I am not certain what else to say really, if I add any doctype the width of
    > the document changes, (with horizontal scrollbar).


    Really _any_ doctype? Anyway, adding a doctype that throws some browsers
    into non-quirks (or "standard") mode may surely change something in the
    layout.

    > If I add any charset the same happens.


    That sounds rather odd. _Any_ charset? Anyway, there's the _content_
    problem that some of the content is apparently distorted, since it's
    data in some strange and unspecified encoding. This should have higher
    priority in the repair list.

    > So, at best i could use "Windows-1256" and that might work. I would have to
    > ask the user to try as it is their template.


    What exactly are you working with? Trying to fix the page, or to help
    someone view it despite its being broken? In the latter case, you need
    to know the language used on the page and try different encodings and
    see if some of them looks right. In the former case, the information
    producer should be requested to specify the encoding or to convert the
    data to format.

    > We, http://www.journalhome.com as the host,


    If you are the host, then it is your responsibility to inform authors
    about the way(s) to make your server send the correct Content-Type
    information, with a charset parameter as specified by the author. As the
    second best approach, send no charset information (as now) and allow
    authors to use .htaccess or similar technique.

    It is _not_ your responsibility as a service provided to find out the
    encoding of a document or even to help authors to decide on the encoding
    they'll use - assuming, of course, that you have not promised such a
    service. It might be a good idea to offer some general guidance, as
    courtesy, but surely you need know about such matters well before being
    able to help others.

    > I am just suprised that it displays the code on some machine, (by the looks
    > of it yours and mine), and it does not work on other machines.


    Which "it" displays which "code" in which sense?

    > I am guessing that the user browser understands the &#; but the machine does
    > not have the fonts to actually display them.


    That's quite possible, but how does that relate to the other problems
    you have mentioned? It's a user-side problem, and authors may wish to
    consider them at a general level when making their own decisions.

    > A bit strange that both browsers seem to display ok without a DOCTYPE, what
    > do they use?


    Browsers don't use DOCTYPEs for anything but misguided guesses on
    whether they should display the page in an intentionally broken manner
    (i.e., DOCTYPE sniffing).

    As a service provider, you don't need to worry about DOCTYPEs (except of
    course on your own pages). They are to be provided by authors. You just
    need to take care so that your server software does not add any extra
    stuff at the start of the document, as some "free" providers do, thereby
    messing up DOCTYPE detection. It seems that this is not a problem in
    your case.
     
    Jukka K. Korpela, Apr 8, 2006
    #4
  5. Simon wrote:
    > Hi,
    >
    > I was asked to have a look at a page that apparently does not display
    > Persian language.
    > The obvious 2 problems is that the pages does not have doctype or Charest.


    DOCTYPE has nothing to do with character representation. If the document
    is served with a correct HTTP content type header, then a content-type
    META tag is irrelevant.

    Your page looks mostly fine in my Firefox, which thinks that your page
    is encoded as Windows 1252, which lacks Arabic/Persian support
    altogether. But it doesn't matter what encoding is claimed, as long as
    ASCII is a subset of it, because the characters are encoded as numeric
    character references. The only flaws are a number of question marks that
    were obviously meant to be something else, and the appearance in two
    places of "تست2", once after the date at the top, and once as the
    first item in list of Recent Posts. The first one appears in the page
    source as "تست2" and the second appears as
    "تست2", the character entity
    representation of the same thing.
     
    Harlan Messinger, Apr 8, 2006
    #5
  6. On Sat, 8 Apr 2006, Harlan Messinger wrote:

    > else, and the appearance in two places of "تست2", once after the
    > date at the top, and once as the first item in list of Recent Posts.
    > The first one appears in the page source as "تست2"


    Yes, I'd spotted that, and noted that if interpreted as utf-8 it turns
    out as Arabic-script characters, which made it seem as if that part
    had been inserted into it incorrectly.

    > and the second appears as
    > "تست2", the character entity
    > representation of the same thing.


    Blimey, so it does! I hadn't spotted that at first look. So it's
    worse than just broken!!

    Furthermore, I now see loads of hrefs like these:

    http://journalhome.com/razavi/21877/تست2.html

    *Shudder*

    For what it's worth - coming back to the تست2 which we saw, if I
    convert[1] that from utf-8 to us-ascii encoding then the result reads:

    تست2

    which can be decoded e.g with my trusty decoding ring (;-) at
    http://ppewww.ph.gla.ac.uk/~flavell/unicode/unidata06.html


    At this kind of third-hand remove from the original complainant, and
    with me only understanding the theory of the character representation,
    without being able to read Farsi - nor have the slightest inclination
    to tangle with the mess that comes out of MS's attempts to extrude
    something resembling HTML, I'm afraid I can't go much further than to
    say that these pages seem to be dreadfully broken; it's a wonder that
    anything comes out as intended.

    good luck (you-all will need it!)

    [1] by "convert" I mean, in Seamonkey (nee Mozilla), manually set
    View> Encoding to utf-8, then File> Edit Page, then in Composer,
    "Save and change character encoding". Unfortunately it doesn't
    offer us-ascii as an option, but any 8-bit encoding which doesn't
    cover Arabic would suffice for this purpose - e.g Armenian, Thai,
    whatever you like. (Perhaps we should ask the Mozilla folks to
    support saving in us-ascii explicitly?).
     
    Alan J. Flavell, Apr 8, 2006
    #6
  7. Simon

    Neredbojias Guest

    To further the education of mankind, "Simon" <>
    declaimed:

    >> You're not giving very much of a clue as to what kind of "improperly"
    >> you/they are seeing.

    >
    > I am not certain what else to say really, if I add any doctype the
    > width of the document changes, (with horizontal scrollbar).
    > If I add any charset the same happens.
    > If I have neither charset or Doctype the display is as you see it.


    One issue is that the lack of a doctype puts browsers in "quirks mode".
    The markup of the page is definitely not "strict" markup and there are
    errors as well. Your main problem is archaic and invalid html.

    --
    Neredbojias
    Infinity can have limits.
     
    Neredbojias, Apr 8, 2006
    #7
  8. On Sat, 8 Apr 2006, Alan J. Flavell wrote:

    > On Sat, 8 Apr 2006, Harlan Messinger wrote:
    >
    > > else, and the appearance in two places of "تست2", once after
    > > the date at the top, and once as the first item in list of Recent
    > > Posts. The first one appears in the page source as "تست2"

    >
    > Yes, I'd spotted that,


    I've just noticed that Goo-groups has completely garbled this part of
    the thread, as displayed in its normal view, by re-interpreting the
    above string of seven Latin-1 characters as posted and seen by us,
    into a string of four utf-8 characters - as we would assume had been
    intended (but not actually achieved) by the original web page. Thus
    completely obscuring the problem which we were discussing!!! Grrrrrr.

    Curiously, if the usenet postings are viewed in goo-groups by using
    their "Show original" option, instead of their default thread display,
    then they come out in the way that you (and I) posted, i.e exhibiting
    the problem that we were discussing. Their unsolicited
    re-interpretation of the character encoding in their thread view,
    *IGNORING* the explicit specification of charset=iso-8859-1 which
    appears in both of our posting headers, gives us yet another reason to
    advocate that anyone seriously considering goo-groups as a usenet
    interface would be better advised to Get a Real News Reader.

    Here are the characters again, but this time interleaved with spaces.
    Let's see how goo-groups will garble this: "Ø ª Ø ³ Ø ª 2".

    sigh.
     
    Alan J. Flavell, Apr 9, 2006
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Steve
    Replies:
    0
    Views:
    818
    Steve
    Sep 24, 2006
  2. Pedram Rahimi
    Replies:
    0
    Views:
    164
    Pedram Rahimi
    Dec 6, 2004
  3. Pedram Rahimi

    Write Farsi ( Persian ) in Access by Web form

    Pedram Rahimi, Dec 7, 2004, in forum: ASP General
    Replies:
    1
    Views:
    203
    Pedram Rahimi
    Dec 13, 2004
  4. optimistx

    javascript charset <> page charset

    optimistx, Aug 14, 2008, in forum: Javascript
    Replies:
    2
    Views:
    296
    optimistx
    Aug 15, 2008
  5. Larry Lindstrom
    Replies:
    19
    Views:
    1,333
    Jonathan N. Little
    Jun 12, 2012
Loading...

Share This Page