Multiple Language Website

Discussion in 'HTML' started by GS, Jun 15, 2005.

  1. GS

    GS Guest

    Hi there. I hope this is the right place, to what should be a simple
    question.

    I have a website that is in English and now in Arabic. I am creating the
    Arabic language content now, and am having a few problems getting the
    content to display properly.

    When I edit the files with the Arabic characters on my Windows box, in say
    Notepad, the Arabic gets striped unless I save it as a Unicode document
    (ANSI strips the Arabic and converts the chars into question marks). Now,
    when I upload the Unicode document to my webserver, instead of parsing the
    document normally, it is just displaying the actual contents of the file,
    literally (it is a PHP page, so you see the <??> and other actual code being
    displayed). Any idea what I am doing wrong? I am not sure what the problem
    might be (i.e. file format, ftp transfer mode, web-server config, etc) so I
    thought I would start here.

    I am using the meta tag:
    <meta http-equiv="Content-Type" content="text/html;charset=windows-1252">

    Should I be using:
    <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> ?

    Will this cure the code display issue?

    Thank you for any help you can offer,

    GS
     
    GS, Jun 15, 2005
    #1
    1. Advertising

  2. "GS" <> wrote:

    > When I edit the files with the Arabic characters on my Windows box,
    > in say Notepad, the Arabic gets striped unless I save it as a
    > Unicode document


    Why do you use Notepad? There are nice multilingual editors available,
    with much better features.

    > (ANSI strips the Arabic and converts the chars
    > into question marks).


    No, the American National Standards Institute does not strip anything.
    But Microsoft software, which falsely calls a Microsoft proprietary
    encoding "ANSI", does something like that, since that encoding has no
    codes for any Arabic letters.

    > Now, when I upload the Unicode document to
    > my webserver, instead of parsing the document normally, it is just
    > displaying the actual contents of the file, literally (it is a PHP
    > page, so you see the <??> and other actual code being displayed).


    If you want real help, post a real URL. It will not tell anything,
    especially when PHP is involved, but it is a start. Also please specify
    the browser(s) you used for testing.

    > Any idea what I am doing wrong? I am not sure what the problem
    > might be (i.e. file format, ftp transfer mode, web-server config,
    > etc) so I thought I would start here.


    Well, we cannot even know what the FTP transfer mode was. Surely it
    should have been binary.

    > I am using the meta tag:
    > <meta http-equiv="Content-Type"
    > content="text/html;charset=windows-1252">


    This may matter, or it may not, depending on the actual HTTP headers.
    It is certainly wrong, anyway, if the encoding is UTF-8 and not
    windows-1252. _Why_ do you use it?

    > Should I be using:
    > <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
    > ?
    >
    > Will this cure the code display issue?


    You mean you did not test that before posting?

    Of course, testing would not prove much. But if your document is, in
    fact, UTF-8 encoded, as it sounds, then surely it should not contain a
    meta tag that says otherwise. On the other hand, a meta tag is neither
    necessary nor sufficient - it will be overridden by actual HTTP
    headers, if they specify the encoding.


    --
    Yucca, http://www.cs.tut.fi/~jkorpela/
    Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html
     
    Jukka K. Korpela, Jun 15, 2005
    #2
    1. Advertising

  3. GS

    GS Guest

    "Jukka K. Korpela" <> wrote in message
    news:Xns9676CB51DAF18jkorpelacstutfi@193.229.0.31...
    > "GS" <> wrote:
    >
    > > When I edit the files with the Arabic characters on my Windows box,
    > > in say Notepad, the Arabic gets striped unless I save it as a
    > > Unicode document

    >
    > Why do you use Notepad? There are nice multilingual editors available,
    > with much better features.


    Simply because I only had access to a locked-down machine that I was unable
    to install a better editor on. Any suggestions?

    > > (ANSI strips the Arabic and converts the chars
    > > into question marks).

    >
    > No, the American National Standards Institute does not strip anything.
    > But Microsoft software, which falsely calls a Microsoft proprietary
    > encoding "ANSI", does something like that, since that encoding has no
    > codes for any Arabic letters.


    My appologies, I meant Microsoft ANSI then.

    > > Now, when I upload the Unicode document to
    > > my webserver, instead of parsing the document normally, it is just
    > > displaying the actual contents of the file, literally (it is a PHP
    > > page, so you see the <??> and other actual code being displayed).

    >
    > If you want real help, post a real URL. It will not tell anything,
    > especially when PHP is involved, but it is a start. Also please specify
    > the browser(s) you used for testing.
    >


    Browsers: IE 6.x, Firefox 1.03

    Don't have a URL right now, as I took down the test page due to the code
    being shown.

    > > Any idea what I am doing wrong? I am not sure what the problem
    > > might be (i.e. file format, ftp transfer mode, web-server config,
    > > etc) so I thought I would start here.

    >
    > Well, we cannot even know what the FTP transfer mode was. Surely it
    > should have been binary.
    >


    FTP mode was indeed binary, sorry for not mentioning. As I did mentioned, I
    am just starting to try to figure this out. I imagined someone in here had
    at one time had this exact problem and would know exactly what was going on.

    > > I am using the meta tag:
    > > <meta http-equiv="Content-Type"
    > > content="text/html;charset=windows-1252">

    >
    > This may matter, or it may not, depending on the actual HTTP headers.
    > It is certainly wrong, anyway, if the encoding is UTF-8 and not
    > windows-1252. _Why_ do you use it?


    I use windows-1252 because I have seen in other places where this should be
    used to alert the browsers of incoming text that may have many different
    character variations, including right-to-left. Looking at many different
    Arabic websites, they seem to make use of this meta tag as well.

    >
    > > Should I be using:
    > > <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
    > > ?
    > >
    > > Will this cure the code display issue?

    >
    > You mean you did not test that before posting?
    >


    I did, but it made no difference at the time, but I was not sure if this was
    needed. This should have been broken out into a second question. I should
    have asked:

    If I want to display English and Arabic on the same page, which meta tag
    will be more appropriate, and does this meta tag override what the webserver
    sends for a header (which you answered below, thank you)?

    Currently, my Apache webserver is sending
    Content-Type:·text/html;·charset=iso-8859-1. Is this an appropriate header
    for displaying Arabic, etc.?

    > Of course, testing would not prove much. But if your document is, in
    > fact, UTF-8 encoded, as it sounds, then surely it should not contain a
    > meta tag that says otherwise. On the other hand, a meta tag is neither
    > necessary nor sufficient - it will be overridden by actual HTTP
    > headers, if they specify the encoding.
    >
    >
    > --
    > Yucca, http://www.cs.tut.fi/~jkorpela/
    > Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html
    >
    >
     
    GS, Jun 15, 2005
    #3
  4. GS wrote:

    > I hope this is the right place, to what should be a simple question.


    No - post to <news:comp.infosystems.www.authoring.html>

    > I have a website that is in English and now in Arabic. I am creating the
    > Arabic language content now, and am having a few problems getting the
    > content to display properly.


    Read first
    http://ppewww.ph.gla.ac.uk/~flavell/charset/text-direction.html
    and then post any further questions to
    <news:comp.infosystems.www.authoring.html>
     
    Andreas Prilop, Jun 15, 2005
    #4
  5. GS

    Toby Inkster Guest

    GS wrote:

    > Should I be using:
    > <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> ?


    Perhaps.

    > Will this cure the code display issue?


    No.

    I'm guessing that you have a file naming or server configuration issue.

    --
    Toby A Inkster BSc (Hons) ARCS
    Contact Me ~ http://tobyinkster.co.uk/contact
     
    Toby Inkster, Jun 15, 2005
    #5
  6. GS

    N Cook Guest

    "GS" <> wrote in message
    news:qJYre.28952$...
    > Hi there. I hope this is the right place, to what should be a simple
    > question.
    >
    > I have a website that is in English and now in Arabic. I am creating the
    > Arabic language content now, and am having a few problems getting the
    > content to display properly.
    >
    > When I edit the files with the Arabic characters on my Windows box, in say
    > Notepad, the Arabic gets striped unless I save it as a Unicode document
    > (ANSI strips the Arabic and converts the chars into question marks). Now,
    > when I upload the Unicode document to my webserver, instead of parsing the
    > document normally, it is just displaying the actual contents of the file,
    > literally (it is a PHP page, so you see the <??> and other actual code

    being
    > displayed). Any idea what I am doing wrong? I am not sure what the

    problem
    > might be (i.e. file format, ftp transfer mode, web-server config, etc) so

    I
    > thought I would start here.
    >
    > I am using the meta tag:
    > <meta http-equiv="Content-Type" content="text/html;charset=windows-1252">
    >
    > Should I be using:
    > <meta http-equiv="Content-Type" content="text/html;charset=utf-8"> ?
    >
    > Will this cure the code display issue?
    >
    > Thank you for any help you can offer,
    >
    > GS
    >
    >


    Probably related to the prob. i had and now solved

    Foreign unicode script on a file which corrupted the Google cached version
    of otherwise English page.
    I downloaded Hex Editor XVI32 from
    http://www.chmaas.handshake.de/delphi/freeware/xvi32/xvi32.htm
    That allowed me to remove the 2 characters ÿþ / hex FE,FF / ASCII 255,266 /
    y diaresis and p with
    ascender that clogs up the front of the file, which you cannot see let alone
    edit out in Word or Notepad.
    Apparently this is appended to denote the file contains unicode,
    the BOM Byte Order Mark and also Zero Width Non-Breaking
    Space (ZWNBSP) . Google cached interprets this as inter-character spaces
    throughout
    the cached version and consequential loss of HTML action. The preview pane
    on Google
    is also corrupted because of the spaces mangling HTML. I'm surprised there
    is
    nothing on Google FAQs pages about this. Putting "ÿþ" and "h t m l" in
    Google
    produced 206,000 hits. Randomly selecting 5x10 of those showed 44 were
    mangled so
    perhaps about 180,00 such affected files.
    With Hex editor also "Replace All " inter-character 00 to (blank/empty)
    which also
    reduces the file size by half.
    Then a matter of converting the foreign code characters like hex
    code [ 05D2 ] to decimal code [ & # 1 4 9 0 (no spaces) ] which Google
    Cached seems to like and also
    browsers. For smallish amounts of text for conversion: - in Word convert all
    end of line ^p to * , to concattenate and then break up to lines of about
    100 characters.
    Submit each line in turn to Google ( much more than 100 is a Google illegal
    op)
    and it returns search string as &#....; form, highlight and copy back.
    In Word convert back * to ^p , saving as non-unicode text in a non-unicode
    HTML file
    and compare the result when viewed on a browser with a .png, .gif ,
    or .jpg form of the script to check. Then add to English file.
    For a load of foreign text use the block routine in XVI32
    and copy Hex to Word as a .txt file after removing FE,FF and converting all
    the 00 to 0D0A
    and any spaces/punctuation to 2020 or whatever as 4 characters.
    Giving a file of lines of 4 characters after converting 0D0A to ^p.
    Then make a macro for converting adjascent
    quad alphanumeric characters to decimal numeric. Finally changing ^p to ;&#
    and tidying up punctuation etc.
    I used this Yale file as a model which part reads correctly as foreign
    script on a browser and is
    cached by Google correctly
    http://pclt.cis.yale.edu/pclt/encoding/
    and a bare minimum of HTML eg not even LANG designation.
    So with hindsight just save the foreign Hex text as unicode file and convert
    to decimal form before adding to full English file and then can continue to
    save
    as ANSI and retain correct caching of HTML on Google.
    For anyone else with this problem but where they have no foreign
    text on their file and accidently saved their file as Unicode.
    Without a Hex Editor you will not see the ÿþ or double zeros that Google
    sees.
    Suggestion: rename your file from XYZ.htm to XYZ_old.htm
    View it in Internet Explorer and click View / Source,
    "Select All" the text and copy to notepad and name
    the file XYZ.htm saving as ANSI and not Unicode.
    If you want to check the file then download the XVI32 hex editor
    ( link above) - its only about 500KByte so
    only takes a couple of minutes and compare the two versions of your file.
     
    N Cook, Jun 15, 2005
    #6
  7. "GS" <> wrote:

    >> Why do you use Notepad? There are nice multilingual editors
    >> available, with much better features.

    >
    > Simply because I only had access to a locked-down machine that I
    > was unable to install a better editor on. Any suggestions?


    I think you should try and find a computer that you have some control
    over, if you wish to create Arabic Web pages seriously, or any Web
    pages seriously. Ultimately it's a matter of your convenience only, but
    still.

    > Don't have a URL right now, as I took down the test page due to the
    > code being shown.


    Umm... the URL would have let us see what the server really sends.

    > I use windows-1252 because I have seen in other places where this
    > should be used to alert the browsers of incoming text that may have
    > many different character variations, including right-to-left.


    Pardon? Where? Windows-1252 means Windows Latin 1, which has no Arabic
    letters, so either you misunderstood something, or those sites do
    something that overrides this error.

    > If I want to display English and Arabic on the same page, which
    > meta tag will be more appropriate,


    This is a whole new question. As a rule, don't mix languages. There are
    millions of people who know English but no Arabic, or vice versa. Why
    would you throw a foreign language at them? There are some excuses,
    most notably a link to an Arabic version of the page in the English
    version, or vice versa.

    Mixing English and Arabic isn't really much of a problem at the
    encoding level, since any encoding that lets you use Arabic letters
    lets you use English letters as well. It would be more difficult if you
    wanted to combine French and Arabic, for example.

    Forget meta tags, at least for now. Select an encoding, and specify it
    in HTTP headers. It could be UTF-8, or it could be ISO-8859-6, for
    example. Other things being equal, use UTF-8.

    > Currently, my Apache webserver is sending
    > Content-Type:·text/html;·charset=iso-8859-1. Is this an appropriate
    > header for displaying Arabic, etc.?


    No, because the ISO-8859-1 repertoire is a subset of the windows-1252
    (or "Microsoft ANSI") repertoire and thus does not contain any Arabic
    letters. The server should be configured to send e.g.
    Content-Type:·text/html; charset=utf-8
    if your files are UTF-8 encoded. If you cannot do that, check if you
    can make the server send _no_ charset parameter in that header; _then_
    you can effectively specify the encoding in a meta tag. If you cannot
    do even that, i.e. the server persistently claims that everything is
    ISO-8859-1, then your only option (apart from getting a better server)
    for writing Arabic pages is to write all Arabic characters using
    character references, like ا. It's possible, but awkward, at
    least if you no nice tool that lets you write normal Arabic and then
    converts it to a format with character references.

    --
    Yucca, http://www.cs.tut.fi/~jkorpela/
    Pages about Web authoring: http://www.cs.tut.fi/~jkorpela/www.html
     
    Jukka K. Korpela, Jun 15, 2005
    #7
  8. GS

    Toby Inkster Guest

    Jukka K. Korpela wrote:

    > The server should be configured to send e.g.
    > Content-Type:·text/html; charset=utf-8
    > if your files are UTF-8 encoded. If you cannot do that, check if you
    > can make the server send _no_ charset parameter in that header


    The OP has already stated that he's using PHP. In which case, sending an
    appropriate header is as simple as putting this in an include file (say
    "headers.php":

    <?php

    $ua = $_SERVER['HTTP_USER_AGENT'];

    if (preg_match('/^Mosaic/',$ua))
    {
    header("Content-Type: text/html");
    }

    else
    {
    header("Content-Type: text/html; charset=utf-8");
    }

    ?>

    and then including it at the top of every file like this:

    <?php require_once "headers.php"; ?>
    <!DOCTYPE ....

    --
    Toby A Inkster BSc (Hons) ARCS
    Contact Me ~ http://tobyinkster.co.uk/contact
     
    Toby Inkster, Jun 16, 2005
    #8
  9. GS

    Dan Guest

    N Cook wrote:
    > Apparently this is appended to denote the file contains unicode,
    > the BOM Byte Order Mark and also Zero Width Non-Breaking
    > Space (ZWNBSP) . Google cached interprets this as inter-character spaces
    > throughout
    > the cached version and consequential loss of HTML action.


    Sounds like the page was encoded in a 16-bit encoding such as UTF-16LE
    (where every character takes two bytes) rather than a variable-size
    encoding where the characters in the US-ASCII range take only one byte.
    Perhaps the server wasn't sending proper headers to indicate this
    encoding.

    > nothing on Google FAQs pages about this. Putting "ÿþ" and "h t m l" in
    > Google
    > produced 206,000 hits.


    I looked at one of the sites reachable by this, and the server was
    sending the proper header of UTF-16LE, but the HTML document had a
    bogus meta tag incorrectly claiming the encoding was iso-8859-1. By
    the standards, browsers will ignore the meta tag when there's an actual
    HTTP header, but perhaps it confuses search engines.

    > Then a matter of converting the foreign code characters like hex
    > code [ 05D2 ] to decimal code [ & # 1 4 9 0 (no spaces) ] which Google
    > Cached seems to like and also
    > browsers.


    Actually, you should include a semicolon at the end of numeric
    character references.

    > For smallish amounts of text for conversion: - in Word convert all
    > end of line ^p to * , to concattenate and then break up to lines of about
    > 100 characters.
    > Submit each line in turn to Google ( much more than 100 is a Google illegal
    > op)
    > and it returns search string as &#....; form, highlight and copy back.
    > In Word convert back * to ^p , saving as non-unicode text in a non-unicode
    > HTML file
    > and compare the result when viewed on a browser with a .png, .gif ,
    > or .jpg form of the script to check. Then add to English file.


    That sounds like a really clumsy way of doing it compared to using a
    decent editor that lets you choose what character encoding to save as.
    And I wouldn't let MS Word touch in any way a document I intend on
    placing on the Web; that program (and anything else from Microsoft) is
    bad news for standards compliance.

    > Suggestion: rename your file from XYZ.htm to XYZ_old.htm


    I prefer the extension .html myself, not the dumbed-down three-letter
    version designed to be compatible with 10-year-obsolete operating
    systems that can't handle longer filenames.

    > View it in Internet Explorer and click View / Source,


    Or, you can use a *decent* browser instead. I use Mozilla.

    --
    Dan
     
    Dan, Jun 16, 2005
    #9
  10. GS

    N Cook Guest

    "Dan" <> wrote in message
    news:...
    N Cook wrote:
    > Apparently this is appended to denote the file contains unicode,
    > the BOM Byte Order Mark and also Zero Width Non-Breaking
    > Space (ZWNBSP) . Google cached interprets this as inter-character spaces
    > throughout
    > the cached version and consequential loss of HTML action.


    Sounds like the page was encoded in a 16-bit encoding such as UTF-16LE
    (where every character takes two bytes) rather than a variable-size
    encoding where the characters in the US-ASCII range take only one byte.
    Perhaps the server wasn't sending proper headers to indicate this
    encoding.

    > nothing on Google FAQs pages about this. Putting "ÿþ" and "h t m l" in
    > Google
    > produced 206,000 hits.


    I looked at one of the sites reachable by this, and the server was
    sending the proper header of UTF-16LE, but the HTML document had a
    bogus meta tag incorrectly claiming the encoding was iso-8859-1. By
    the standards, browsers will ignore the meta tag when there's an actual
    HTTP header, but perhaps it confuses search engines.

    > Then a matter of converting the foreign code characters like hex
    > code [ 05D2 ] to decimal code [ & # 1 4 9 0 (no spaces) ] which Google
    > Cached seems to like and also
    > browsers.


    Actually, you should include a semicolon at the end of numeric
    character references.

    > For smallish amounts of text for conversion: - in Word convert all
    > end of line ^p to * , to concattenate and then break up to lines of about
    > 100 characters.
    > Submit each line in turn to Google ( much more than 100 is a Google

    illegal
    > op)
    > and it returns search string as &#....; form, highlight and copy back.
    > In Word convert back * to ^p , saving as non-unicode text in a non-unicode
    > HTML file
    > and compare the result when viewed on a browser with a .png, .gif ,
    > or .jpg form of the script to check. Then add to English file.


    That sounds like a really clumsy way of doing it compared to using a
    decent editor that lets you choose what character encoding to save as.
    And I wouldn't let MS Word touch in any way a document I intend on
    placing on the Web; that program (and anything else from Microsoft) is
    bad news for standards compliance.

    > Suggestion: rename your file from XYZ.htm to XYZ_old.htm


    I prefer the extension .html myself, not the dumbed-down three-letter
    version designed to be compatible with 10-year-obsolete operating
    systems that can't handle longer filenames.

    > View it in Internet Explorer and click View / Source,


    Or, you can use a *decent* browser instead. I use Mozilla.

    --
    Dan


    This was the 'reply' (human/bot?) I got back from emailing Google help

    ______

    Thank you for your note.

    Thank you for your reply. We're happy to hear that this problem has been
    resolved. If we can assist you in the future, please don't hesitate to
    write.

    Regards,
    The Google Team

    Regards,
    The Google Team
     
    N Cook, Jun 16, 2005
    #10
  11. On Wed, 15 Jun 2005, N Cook wrote:

    > Probably related to the prob. i had and now solved


    let's see...

    > I downloaded Hex Editor XVI32 from
    > http://www.chmaas.handshake.de/delphi/freeware/xvi32/xvi32.htm
    > That allowed me to remove the 2 characters ÿþ / hex FE,FF / ASCII 255,266 /
    > y diaresis and p with
    > ascender that clogs up the front of the file,


    Hang on. That's not "two characters", that's a two-byte sequence which you
    can read about (well, from what follows maybe you already did) at the
    Unicode FAQ, http://www.unicode.org/faq/utf_bom.html#22

    FE,FF designates the data as being in UTF-16BE format, so if the data is
    not in fact in UTF-16BE format then something has gone unpleasantly wrong
    with it before you got it, and I'd recommend finding out what went wrong,
    because messing around with corrupt data after the event is not a very
    robust way to engineer anything.

    Btw, I would not recommend serving out data in any utf-16 format to the
    web even if you did have it... (if you need a Unicode format for the web
    then utf-8 is the recommended choice).

    > Apparently this is appended to denote the file contains unicode,
    > the BOM Byte Order Mark and also Zero Width Non-Breaking
    > Space (ZWNBSP) .


    Well, it means one or the other, yes, although it certainly doesn't mean
    ZWNBSP if the coding is iso-8859-1 or windows-1252 or utf-8. I think the
    FAQ tries to clear this up a bit.

    In short it seems your server was saying one thing and the data was trying
    to say something else, and some recipients (such as Google) were
    understandably confused.


    metacomment: my normal news server does not take alt.* groups, so I rarely
    participate here. I'd be happy to continue any interesting features of
    the discussion in relevant comp.* groups such as
    comp.infosystems.www.authoring.html

    But I think you've had enough input from A.Prilop and J.Korpela to take
    things forward already.
     
    Alan J. Flavell, Jun 16, 2005
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ed
    Replies:
    24
    Views:
    1,029
    Dimitri Maziuk
    Mar 27, 2006
  2. DaveInSidney
    Replies:
    0
    Views:
    437
    DaveInSidney
    May 9, 2005
  3. pabbu
    Replies:
    8
    Views:
    745
    Marc Boyer
    Nov 7, 2005
  4. Shravani
    Replies:
    8
    Views:
    816
    Bartc
    Mar 16, 2008
  5. Replies:
    28
    Views:
    1,195
Loading...

Share This Page