Is it ok to change $ENV{'QUERY_STRING'} before "use CGI;" is called..?

Discussion in 'Perl Misc' started by Raymundo, Mar 4, 2007.

  1. Raymundo

    Raymundo Guest

    Dear,

    When a web-broswer sends a GET request, I can get "keywords" or
    "param"eters using CGI module, as you know:
    $q = new CGI;
    $name = $q->param('name');

    However, when browser's request includes multi-byte characters, they
    can be encoded using UTF-8 or EUC-KR(in Korea, for example) according
    to the option in the browswer. ("Send URL in UTF-8" in IE,
    "network.standard-url.encode-utf8" in FF, etc.)

    At first, I tried to check the value which I got from $q->param() like
    this:

    $name = $q->param("name");
    $name = check_and_convert($name);
    ....

    sub check_and_convert {
    # this subroutine guesses the encoding of parameter using
    Encode::Guess
    # if not UTF-8, it converts the parameter to UTF-8 encoded string and
    return it
    }


    But there are so many parameters and also so many codes using them. I
    found that it's almost impossible, or so inconvenient to check
    whenever the parameters are fetched.


    Second, I tried to "check and convert" $ENV{QUERY_STRING} value before
    a CGI object is created:

    # convert QUERY_STRING to UTF-8 here
    $ENV{QUERY_STRING} = check_and_convert($ENV{QUERY_STRING});
    # then create CGI object
    $q = new CGI;
    # I can get name=XXX and XXX is encoded in UTF-8
    $name $q->param("name")

    In this case, I think, I don't need to check each parameter in any
    other following codes... All the values are now UTF-8 encoded.

    As far as I had tested, it looked successful. But I'm not sure that
    such approach is good(?) and safe. (I think it's somewhat tricky to
    change the environment variable in script..)

    Is there any other environment variable or anything else that I should
    check before "new CGI;" is called? Can I be sure that I'll not lose
    any information when I change QUERY_STRING?

    Any advices would be appreciated. I'm soryy I'm not good at English.
    Raymundo at South Korea.
     
    Raymundo, Mar 4, 2007
    #1
    1. Advertising

  2. Raymundo

    -berlin.de Guest

    Raymundo <> wrote in comp.lang.perl.misc:
    > Dear,
    >
    > When a web-broswer sends a GET request, I can get "keywords" or
    > "param"eters using CGI module, as you know:
    > $q = new CGI;
    > $name = $q->param('name');
    >
    > However, when browser's request includes multi-byte characters, they
    > can be encoded using UTF-8 or EUC-KR(in Korea, for example) according
    > to the option in the browswer. ("Send URL in UTF-8" in IE,
    > "network.standard-url.encode-utf8" in FF, etc.)
    >
    > At first, I tried to check the value which I got from $q->param() like
    > this:
    >
    > $name = $q->param("name");
    > $name = check_and_convert($name);
    > ...
    >
    > sub check_and_convert {
    > # this subroutine guesses the encoding of parameter using
    > Encode::Guess
    > # if not UTF-8, it converts the parameter to UTF-8 encoded string and
    > return it
    > }
    >
    >
    > But there are so many parameters and also so many codes using them. I
    > found that it's almost impossible, or so inconvenient to check
    > whenever the parameters are fetched.
    >
    >
    > Second, I tried to "check and convert" $ENV{QUERY_STRING} value before
    > a CGI object is created:
    >
    > # convert QUERY_STRING to UTF-8 here
    > $ENV{QUERY_STRING} = check_and_convert($ENV{QUERY_STRING});
    > # then create CGI object
    > $q = new CGI;
    > # I can get name=XXX and XXX is encoded in UTF-8
    > $name $q->param("name")
    >
    > In this case, I think, I don't need to check each parameter in any
    > other following codes... All the values are now UTF-8 encoded.
    >
    > As far as I had tested, it looked successful. But I'm not sure that
    > such approach is good(?) and safe. (I think it's somewhat tricky to
    > change the environment variable in script..)
    >
    > Is there any other environment variable or anything else that I should
    > check before "new CGI;" is called? Can I be sure that I'll not lose
    > any information when I change QUERY_STRING?
    >
    > Any advices would be appreciated. I'm soryy I'm not good at English.
    > Raymundo at South Korea.


    You should avoid changing the environment like that. Use the interface
    that CGI provides. The ->Vars method gives you a hash that contains
    the parameter values keyed by their names. Convert it as follows
    (untested):

    my $param = $q->Vars;
    $_ = check_and_convert( $_) for values %$param;

    This supposes that check_and_convert() leaves null bytes alone. If
    that isn't sure, use

    $_ = join( "\0", map check_and_convert( $_), split /\0/, $_) for
    values %$param;

    See perldoc CGI for the significance of null bytes in the values.
    Either way you will convert all values in one go. Use the converted
    hash instead of the ->param method for parameter access.

    Anno
     
    -berlin.de, Mar 4, 2007
    #2
    1. Advertising

  3. Raymundo

    Ben Morrow Guest

    Quoth "Raymundo" <>:
    > Dear,
    >
    > When a web-broswer sends a GET request, I can get "keywords" or
    > "param"eters using CGI module, as you know:
    > $q = new CGI;
    > $name = $q->param('name');
    >
    > However, when browser's request includes multi-byte characters, they
    > can be encoded using UTF-8 or EUC-KR(in Korea, for example) according
    > to the option in the browswer. ("Send URL in UTF-8" in IE,
    > "network.standard-url.encode-utf8" in FF, etc.)
    >
    > At first, I tried to check the value which I got from $q->param() like
    > this:
    >
    > $name = $q->param("name");
    > $name = check_and_convert($name);
    > ...
    >
    > sub check_and_convert {
    > # this subroutine guesses the encoding of parameter using
    > Encode::Guess
    > # if not UTF-8, it converts the parameter to UTF-8 encoded string and
    > return it
    > }


    I would not recommend using Encode::Guess. It isn't safe.

    For a (detailed) explanation of details of I18N form submission, see
    http://xrl.us/u68e . Executive summary: serve forms as 'text/html;
    charset=utf-8' and assume the results are in UTF-8. You should decode
    *after* getting the values from CGI->param.

    Ben

    --
    'Deserve [death]? I daresay he did. Many live that deserve death. And some die
    that deserve life. Can you give it to them? Then do not be too eager to deal
    out death in judgement. For even the very wise cannot see all ends.'
     
    Ben Morrow, Mar 5, 2007
    #3
  4. Raymundo

    Raymundo Guest

    Thank you, Anno and Ben.

    Anno's suggestion:
    my $param = $q->Vars;
    $_ = check_and_convert( $_) for values %$param;
    works well with GET request. But it makes a problem with POST request
    like file-uploading. I don't know why. I just guess it's because
    check_and_convert affects the contents of POST request. (If I comment
    out check_and_convert line, script works well)

    I'm interested in only GET request, because POST request includes
    "charset=" field in its header and I can convert, if needed, the
    encoding of the contents. So I'm planning to add if clause:
    if ($q->request_method() eq "GET") {
    my $param = $q->Vars;
    $_ = check_and_convert( $_) for values %$param;
    }



    Ben, would you please tell me why Encode::Guess isn't safe? Does it
    have a security problem?

    Anyway,

    > For a (detailed) explanation of details of I18N form submission, see
    > http://xrl.us/u68e. Executive summary: serve forms as 'text/html;
    > charset=utf-8' and assume the results are in UTF-8.


    The script does so when it prints forms and receives POST data from
    the forms, which seemed to be doing well.

    The problem is related to GET request, that is, when URL includes
    multi-bytes characters. W3C recommends that multi-bytes chars in URL
    should be %-encoded. (http://www.w3.org/TR/REC-html40/interact/
    forms.html#form-content-type) But I still want to support when
    visitors type URL using their fingers (they would not like to type "%EC
    %90.." :) and when other webpage gives a link to my page not using %-
    encoded string.

    ....

    Returing to my first post in this thread... Is it so bad idea to
    change the environment variable QUERY_STRING? It solves every problem
    about this. It requires only one additional line in code. I think that
    change may affect only the script and its child processes, and the
    script doesn't fork any child process.

    Raymundo at South Korea.
     
    Raymundo, Mar 6, 2007
    #4
  5. Raymundo

    Ben Morrow Guest

    Quoth "Raymundo" <>:
    >
    > Ben, would you please tell me why Encode::Guess isn't safe? Does it
    > have a security problem?


    Not security, per se; it's just that it's impossible to reliably
    distinguish between (say) UTF-8 and ISO8859-1 that just happens to look
    like UTF-8.

    > > For a (detailed) explanation of details of I18N form submission, see
    > > http://xrl.us/u68e. Executive summary: serve forms as 'text/html;
    > > charset=utf-8' and assume the results are in UTF-8.


    Also, if you read the page linked, you will see that many browsers do...
    rather stupid things when the user enters text into a form that is not
    representable in the encoding of the page. Since UTF-8 can represent
    everything, it doesn't have that problem.

    > The script does so when it prints forms and receives POST data from
    > the forms, which seemed to be doing well.
    >
    > The problem is related to GET request, that is, when URL includes
    > multi-bytes characters. W3C recommends that multi-bytes chars in URL
    > should be %-encoded. (http://www.w3.org/TR/REC-html40/interact/
    > forms.html#form-content-type) But I still want to support when
    > visitors type URL using their fingers (they would not like to type "%EC
    > %90.." :) and when other webpage gives a link to my page not using %-
    > encoded string.


    Well... a not-url-encoded URL is invalid. At least Firefox appears to
    automatically translate (say) a URL typed into the address bar into its
    correct URL-escaped form before submitting it to the server; I don't
    know what IE or Konq/Safari or Opera do.

    > Returing to my first post in this thread... Is it so bad idea to
    > change the environment variable QUERY_STRING? It solves every problem
    > about this. It requires only one additional line in code. I think that
    > change may affect only the script and its child processes, and the
    > script doesn't fork any child process.


    If you're using CGI.pm to process QUERY_STRING, then you should stick to
    that. Messing about is just asking for trouble. What is the problem with
    decoding the submitted values afterwards? (It can still be one line or
    so of code, if you do it right. See Anno's example.)

    Ben

    --
    I must not fear. Fear is the mind-killer. I will face my fear and
    I will let it pass through me. When the fear is gone there will be
    nothing. Only I will remain.
    Frank Herbert, 'Dune'
     
    Ben Morrow, Mar 6, 2007
    #5
  6. Raymundo

    Raymundo Guest

    oops.. I wrote a reply. It took about 3 hours. (It's too difficult to
    me to write in English) I posted it an hour ago but I can't see it
    even now. I'm afraid it's lost :'(

    I'll rewrite my last reply...


    At first, thank you Ben for your kind advice.


    In fact, the Perl script that I'm modifying is not my own code. It is
    UseModWiki (http://www.usemod.com/cgi-bin/wiki.pl) and I've been
    modifying it to use it for my personal homepage. (But I'm just a
    novice in Perl so it's not easy :)

    In wiki site, the URL of each page consists of script URL and "the
    title of that page", like ".../wiki.pl?Perl". I'm a Korean and my wiki
    has many pages whose names are in Korean.




    > Well... a not-url-encoded URL is invalid. At least Firefox appears to
    > automatically translate (say) a URL typed into the address bar into its
    > correct URL-escaped form before submitting it to the server; I don't
    > know what IE or Konq/Safari or Opera do.


    As you said, multi-byte characters in URL is invalid. I know it :'( So
    url-encoded URL is the answer. However, see the following URLs:
    1: .../wiki.pl?Linux <- Everyone can know it is the page about "Linux"
    2: .../wiki.pl?%EB%A6%AC%EB%88%85%EC%8A%A4 <- Can anyone guess what
    the title of this page is?? :-/ It's "Linux" in Korean
    3: .../wiki.pl?¸®´ª½º <- (If you can't see the Korean chars, plz see
    http://gypark.pe.kr/upload/linux_in_korean.gif ) Everyone who are able
    to read Korean can know it is the page about Linux. (I'll type
    "LINUX(ko)" for this word from now on)

    URL 2 is valid, but its appearance is so.... :-/ And I must give up
    the big advantage of wiki, "URL represent the content"

    URL 3 is said to be invalid. But I still want to support it. That is,
    when someone types that URL in the address bar of a browser, or
    someone clicks the link to URL 3 in other site, I want my wiki.pl
    script show the proper page, "LINUX(ko)".

    Fortunately, web browsers like FF, IE, and Safari convert the URL into
    %-encoded form before they submit it, as you said. Therefore, I think,
    it's not main issue that URL contains multi-bytes chars, because the
    server will receive %-encoded request. The problem is that, as I'd
    said in my first article, the %-encoded form of "LINUX(ko)" is not
    unique. It can be "%EB%A6%AC%EB%88%85%EC%8A%A4" (UTF-8 sequence) or
    "%B8%AE%B4%AA%BD%BA" (EUC-KR, in Korea) The browsers choose which
    encoding to use according to the option in them. (for FF,
    "network.standard-url.encode-utf8" in "about:config") Server can't
    choose it and even can't know what is chosen explictily, which is the
    reason that wiki.pl should "guess".




    > > Returing to my first post in this thread... Is it so bad idea to
    > > change the environment variable QUERY_STRING? It solves every problem
    > > about this. It requires only one additional line in code. I think that
    > > change may affect only the script and its child processes, and the
    > > script doesn't fork any child process.

    >
    > If you're using CGI.pm to process QUERY_STRING, then you should stick to
    > that. Messing about is just asking for trouble. What is the problem with
    > decoding the submitted values afterwards? (It can still be one line or
    > so of code, if you do it right. See Anno's example.)



    "The problem with decoding the submitted values afterward" is...
    (following are come from my testing results. it may be fixed but I'm
    not so expert in Perl)

    1) There are hundreds of lines that call "->param()". I don't think
    it's good idea to insert so many "guess_and_convert()" after those
    lines.

    1-1) In fact, those lines actually call "GetParam()" subroutine and
    GetParam() calles ->param in it. So it can be a solution to insert
    guess_and_convert() in GetParam(). However, GetParam() fetches the
    value of a parameter not only from GET request but also from POST
    request and even from saved files. For now, I'm not sure it's ok to
    modify GetParam(). In addition, it seems to be inefficient to call
    convert routine every time a single parameter is fetched.

    2) Concering Anno's example, it looks good because it calls convert
    routine only once. However, it shows some problem while processing
    POST request, like file uploading, receiving trackback, etc. I tried
    to debug but failed to find why. I think it is the second best way to
    apply that code with additional if-clause: if ($q->request_method() eq
    "GET")

    3) In the original code, there are some lines that access
    $ENV{QUERY_STRING} directly, without calling CGI functions. I need to
    apply "guess_and_convert" to those lines.



    So I cling to Q_S like this. :) As far as I know: (please correct me
    if I am wrong)
    1) Q_S is related to only GET request. (All the forms in wiki.pl calls
    "wiki.pl" without any appending URL query when it submits)

    2) Q_S may be in the form of "keywords" or
    "param1=value1&param2=value2...". guess_and_convert() will not change
    the important characters like "&", "=", "+". It will not change any
    other ASCII characters. It will just change the multi-byte chars.
    Because those characters have been already encoded by browser, this
    change is just the change of the number and the sequence of the "%HH"
    runs. There is, I think, no problem when CGI object is created and
    initialized using Q_S.

    3) Changing Q_S affects only the running script and it's child
    process.

    4) After I began to test my approach, no problem shown until now. (Of
    course, this can't be the proof that it will never make a problem. So
    I asked your advices in usenet :)

    5) Most of all, I expect that I don't need to care about it when the
    rest of code is updated. (at least until the browser's behavior change
    dramatically or CGI module)

    If anyone give me concrete examples of the problem that may appear
    when I convert the encoding of Q_S, I'll give up my way immediately...



    Raymundo
     
    Raymundo, Mar 6, 2007
    #6
  7. Raymundo

    Ben Morrow Guest

    Quoth "Raymundo" <>:
    > In fact, the Perl script that I'm modifying is not my own code. It is
    > UseModWiki (http://www.usemod.com/cgi-bin/wiki.pl) and I've been
    > modifying it to use it for my personal homepage. (But I'm just a
    > novice in Perl so it's not easy :)


    I would have been helpful if you'd mentioned this at the start. :)

    > In wiki site, the URL of each page consists of script URL and "the
    > title of that page", like ".../wiki.pl?Perl". I'm a Korean and my wiki
    > has many pages whose names are in Korean.
    >
    > > Well... a not-url-encoded URL is invalid. At least Firefox appears to
    > > automatically translate (say) a URL typed into the address bar into its
    > > correct URL-escaped form before submitting it to the server; I don't
    > > know what IE or Konq/Safari or Opera do.

    >
    > As you said, multi-byte characters in URL is invalid. I know it :'( So
    > url-encoded URL is the answer. However, see the following URLs:
    > 1: .../wiki.pl?Linux <- Everyone can know it is the page about "Linux"
    > 2: .../wiki.pl?%EB%A6%AC%EB%88%85%EC%8A%A4 <- Can anyone guess what
    > the title of this page is?? :-/ It's "Linux" in Korean

    [ I've stripped the top-bit-set characters: my newsreader appears to
    have mangled them ]
    > 3: .../wiki.pl? <- (If you can't see the Korean chars, plz see
    > http://gypark.pe.kr/upload/linux_in_korean.gif ) Everyone who are able
    > to read Korean can know it is the page about Linux. (I'll type
    > "LINUX(ko)" for this word from now on)
    >
    > URL 2 is valid, but its appearance is so.... :-/ And I must give up
    > the big advantage of wiki, "URL represent the content"
    >
    > URL 3 is said to be invalid. But I still want to support it. That is,
    > when someone types that URL in the address bar of a browser, or
    > someone clicks the link to URL 3 in other site,


    Is it common practice for people to write links to URLs with multibyte
    chars in them? Since the actual link itself is not user-visible (the
    text of the link is, but that's quite different) there's no reason not
    to encode it correctly, is there? Of course, if it *is* common practice,
    you may well want to handle it (if you can), regardless of its
    incorrectness.

    > I want my wiki.pl script show the proper page, "LINUX(ko)".


    Firstly, let me say that I entirely sympathise with this desire :). It
    is a major failing in the design of URLs that they are so unfriendly to
    people whose native language is not English.

    That said, I do not think you can win here :). At least my copy of FF
    will convert .../wiki.pl?KOREAN_CHARS into %-encodings *in the address
    bar* before it submits the URL. IE6 appears to do the opposite: that is,
    AFAICT it both displays the URL as typed in the address bar and actually
    submits a multi-byte URL to the server. Your Q_S munging will need to be
    quite subtle, to handle cases like .../wiki.pl?foo%3bbar, and correctly
    distinguish them from .../wiki.pl?foo;bar, which presumably means
    something quite different.

    > Fortunately, web browsers like FF, IE, and Safari convert the URL into
    > %-encoded form before they submit it, as you said. Therefore, I think,
    > it's not main issue that URL contains multi-bytes chars, because the
    > server will receive %-encoded request. The problem is that, as I'd
    > said in my first article, the %-encoded form of "LINUX(ko)" is not
    > unique. It can be "%EB%A6%AC%EB%88%85%EC%8A%A4" (UTF-8 sequence) or
    > "%B8%AE%B4%AA%BD%BA" (EUC-KR, in Korea) The browsers choose which
    > encoding to use according to the option in them. (for FF,
    > "network.standard-url.encode-utf8" in "about:config") Server can't
    > choose it and even can't know what is chosen explictily, which is the
    > reason that wiki.pl should "guess".


    OK, so you're in an impossible situation and you're trying to do the
    best you can. Encode::Guess may be your best option here :).

    > > > Returing to my first post in this thread... Is it so bad idea to
    > > > change the environment variable QUERY_STRING? It solves every problem
    > > > about this. It requires only one additional line in code. I think that
    > > > change may affect only the script and its child processes, and the
    > > > script doesn't fork any child process.

    > >
    > > If you're using CGI.pm to process QUERY_STRING, then you should stick to
    > > that. Messing about is just asking for trouble. What is the problem with
    > > decoding the submitted values afterwards? (It can still be one line or
    > > so of code, if you do it right. See Anno's example.)

    >
    > "The problem with decoding the submitted values afterward" is...
    > (following are come from my testing results. it may be fixed but I'm
    > not so expert in Perl)
    >
    > 1) There are hundreds of lines that call "->param()". I don't think
    > it's good idea to insert so many "guess_and_convert()" after those
    > lines.
    >
    > 1-1) In fact, those lines actually call "GetParam()" subroutine and
    > GetParam() calles ->param in it. So it can be a solution to insert
    > guess_and_convert() in GetParam(). However, GetParam() fetches the
    > value of a parameter not only from GET request but also from POST
    > request and even from saved files. For now, I'm not sure it's ok to
    > modify GetParam(). In addition, it seems to be inefficient to call
    > convert routine every time a single parameter is fetched.


    I would say the Right Answer in this case is to write your own GetParam
    sub which calls the original GetParam, and then applies your
    Encode::Guess logic. If the script isn't changing the values of the
    paramters, only accessing them, you can avoid the multiple guessing by
    using the Memoize module on your sub.

    > 2) Concering Anno's example, it looks good because it calls convert
    > routine only once. However, it shows some problem while processing
    > POST request, like file uploading, receiving trackback, etc. I tried
    > to debug but failed to find why. I think it is the second best way to
    > apply that code with additional if-clause: if ($q->request_method() eq
    > "GET")


    What sort of problems? If your guessing routine is guessing incorrectly
    for some of you real data, this indicates it's not safe to use it
    anyway.

    > 3) In the original code, there are some lines that access
    > $ENV{QUERY_STRING} directly, without calling CGI functions. I need to
    > apply "guess_and_convert" to those lines.


    Well, that's just evil :). My standard recommendation at this point
    would be to throw out whatever it is you're using and find something
    that's decently written.

    > So I cling to Q_S like this. :) As far as I know: (please correct me
    > if I am wrong)
    > 1) Q_S is related to only GET request. (All the forms in wiki.pl calls
    > "wiki.pl" without any appending URL query when it submits)


    You may be correct in this case that your wiki.pl only uses a query
    string for GET requests. It is certainly possible to POST to a URL with
    a query string.

    > 2) Q_S may be in the form of "keywords" or
    > "param1=value1&param2=value2...". guess_and_convert() will not change
    > the important characters like "&", "=", "+". It will not change any
    > other ASCII characters. It will just change the multi-byte chars.
    > Because those characters have been already encoded by browser, this
    > change is just the change of the number and the sequence of the "%HH"
    > runs. There is, I think, no problem when CGI object is created and
    > initialized using Q_S.


    Err... OK. You must make sure you alter Q_S *before* any CGI.pm calls
    are mode, though.

    > 3) Changing Q_S affects only the running script and it's child
    > process.


    I don't know what happens under mod_perl, if you ever move your script
    to that envionment. Under standard CGI, this is certainly true.

    It seems to me that you are trying to take a piece of rather
    badly-written code you don't really understand, and alter it do do
    something that isn't really possible anyway. Given that you're in that
    much of a mess, a simple edit of $ENV{QUERY_STRING} may well be the best
    way out :).

    Ben

    --
    All persons, living or dead, are entirely coincidental.
    Kurt Vonnegut
     
    Ben Morrow, Mar 6, 2007
    #7
  8. Raymundo

    Raymundo Guest

    > > 3: .../wiki.pl? <- (If you can't see the Korean chars, plz see
    > >http://gypark.pe.kr/upload/linux_in_korean.gif) Everyone who are able
    > > to read Korean can know it is the page about Linux. (I'll type
    > > "LINUX(ko)" for this word from now on)

    >
    > > URL 2 is valid, but its appearance is so.... :-/ And I must give up
    > > the big advantage of wiki, "URL represent the content"

    >
    > > URL 3 is said to be invalid. But I still want to support it. That is,
    > > when someone types that URL in the address bar of a browser, or
    > > someone clicks the link to URL 3 in other site,

    >
    > Is it common practice for people to write links to URLs with multibyte
    > chars in them? Since the actual link itself is not user-visible (the
    > text of the link is, but that's quite different) there's no reason not
    > to encode it correctly, is there? Of course, if it *is* common practice,
    > you may well want to handle it (if you can), regardless of its
    > incorrectness.


    Do you mean this case?
    [a href="actual link itself"] text of the link [/a]
    (I replaced "less than" and "greater than" signs with brackets, so
    that any smart(?) news-reader doesn't process it as real link)

    Yes, you're right. In that case the URL is hidden to user, so it
    doesn't matter that URL is "...%EB%A6". And this is very typical in
    plain html documents.

    However many recent CGI tools, like blog(MovableType, TatterTools,
    etc) and almost (as far as I know) wikis, provide the feature of "auto-
    linking"(say). Someone post an article in plain text to his/her blog,
    then the blog tool looks for URL pattern in the text, converts it to
    "a href" links, and print it in its html output. In this case, "text
    of the link" is equal to "actual link".

    Another example is, wiki provides the concept of "interwiki" for a
    convenient linking. That is, when I submit the text:
    UseMod:UseModWiki
    Google:UseModWiki (even though google is not a wiki..)
    In html output, they are converted automatically to the following
    links, respectively:
    [a href="http://www.usemod.com/cgi-bin/wiki.pl?
    UseModWiki"]UseMod:UseModWiki[/a]
    [a href="http://www.google.com/search?q=UseModWiki"]Google:UseModWiki[/
    a]
    (The mapping table, between a interwiki name like "Google:" and the
    real URL like "http://www.google.com/search?q=", is stored in a file
    in the server)

    In this case, someone may want to put a link to my page in his wiki.
    Then "Raymundo:LINUX(ko)" is much (x 100) easier for him and more
    understandable to other visitors than "Raymundo:%EB%A6%AC%EB%88%85%EC
    %8A%A4".

    I've already modified my wiki, so that it encodes the actual link when
    it processes interwiki. But it's impossible to force every developers
    of all wikis in the world. :)

    Anyway this type of links can be common practice nowadays, in my
    opinion.



    > > I want my wiki.pl script show the proper page, "LINUX(ko)".

    >
    > Firstly, let me say that I entirely sympathise with this desire :). It
    > is a major failing in the design of URLs that they are so unfriendly to
    > people whose native language is not English.
    >
    > That said, I do not think you can win here :). At least my copy of FF
    > will convert .../wiki.pl?KOREAN_CHARS into %-encodings *in the address
    > bar* before it submits the URL. IE6 appears to do the opposite: that is,
    > AFAICT it both displays the URL as typed in the address bar and actually
    > submits a multi-byte URL to the server. Your Q_S munging will need to be
    > quite subtle, to handle cases like .../wiki.pl?foo%3bbar, and correctly
    > distinguish them from .../wiki.pl?foo;bar, which presumably means
    > something quite different.



    I agree IE6 acts differently (and strange). This is the access_log of
    apache server when a request URL includes "wiki/LINUX(ko)":

    "GET /wiki/\xb8\xae\xb4\xaa\xbd\xba" <- IE, EUC-KR
    "GET /wiki/%B8%AE%B4%AA%BD%BA <- FF, EUC-KR
    "GET /wiki/%EB%A6%AC%EB%88%85%EC%8A%A4" <- IE and FF, UTF-8

    I don't know why IE's requests are in diffrent forms as the encoding
    differs. It does url-encode if its option is set to use UTF-8 request,
    but it doesn't if the option is unchecked. But as fas as I have
    tested, my wiki.pl showed no difference between when a request came
    from FF and from IE.

    I'll consider what you mention with the example ";" and "%3b" and test
    more.



    > > 2) Concering Anno's example, it looks good because it calls convert
    > > routine only once. However, it shows some problem while processing
    > > POST request, like file uploading, receiving trackback, etc. I tried
    > > to debug but failed to find why. I think it is the second best way to
    > > apply that code with additional if-clause: if ($q->request_method() eq
    > > "GET")

    >
    > What sort of problems? If your guessing routine is guessing incorrectly
    > for some of you real data, this indicates it's not safe to use it
    > anyway.


    I agree and I tried to find the exact problem and the reason of it.


    I'll describe here what I found until now:

    At first, Anno's code was to change the values of CGI->Vars hash:

    $q = new CGI;
    # convert
    my $param = $q->Vars;
    $_ = check_and_convert($_) for values %$param;


    File-uploading and trackback features are not part of the original
    file. I added it myself about two years ago, getting codes from
    examples in WWW.

    For file-uploading, wiki.pl prints the form including:

    $q->start_form('post',"$ScriptName", 'multipart/form-data') . "\n";
    "<input type='hidden' name='action' value='upload'>";
    "<input type='hidden' name='upload' value='1'>" . "\n";
    $q->filefield("upload_file","",60,80) . "\n"; #
    <-- file selection field
    "&nbsp;&nbsp;" . "\n";
    print $q->submit('Upload') . "\n";
    $q->endform


    User is supposed to click "open" button, choose a file in a file
    selection window, and click "Upload" button to submit.

    To save the file in server, the following code is used:

    $file = $q->upload('upload_file');
    open(FILE, ">file_in_local_disk_of_server");
    binmode FILE;
    while (<$file>) {
    print FILE $_; # read from client's file and write to
    server's disk
    }
    close(FILE);


    I put "die;" for check:

    $file = $q->upload('upload_file');
    die "[$file]"; # here
    open(FILE, ">file_in_local_disk_of_server");

    If I don't convert Vars, script dies printing "[D:\download
    \text.txt]". But when Vars is converted, script dies printing "[]".
    That means $file lost the information that it's a file handle.

    How can I keep it as valid file handle? Even without converting, I
    found that any write access to $file causes the same problem.

    my $param = $q->Vars;
    $$param{'upload_file'} .= ""; # no other string appended, but
    it lose file handle
    or even
    $$param{'upload_file'} = $$param{'upload_file'}; # it also lose
    file handle!!! :-O


    So there is nothing that check_and_convert() can do. Modifying "-
    >Vars" itself cause problem. If I have to choose this approach anyway,

    I can do like this:
    my $param = $q->Vars;
    foreach (keys %$param) {
    $$param{$_} = guess_and_convert($$param{$_}) if ($_ ne
    "upload_file"); # don't try to assign param{'upload_file'}
    }

    But there is no confirm that all other parameters are ordinary
    strings.




    > > So I cling to Q_S like this. :) As far as I know: (please correct me
    > > if I am wrong)
    > > 1) Q_S is related to only GET request. (All the forms in wiki.pl calls
    > > "wiki.pl" without any appending URL query when it submits)

    >
    > You may be correct in this case that your wiki.pl only uses a query
    > string for GET requests. It is certainly possible to POST to a URL with
    > a query string.


    Yes, I have to consider it in the future. And I still believe it
    doesn't matter, because "query string" in URL is anyway just a string
    which can't have any invisible information (like $file in above).


    > > 2) Q_S may be in the form of "keywords" or
    > > "param1=value1&param2=value2...". guess_and_convert() will not change
    > > the important characters like "&", "=", "+". It will not change any
    > > other ASCII characters. It will just change the multi-byte chars.
    > > Because those characters have been already encoded by browser, this
    > > change is just the change of the number and the sequence of the "%HH"
    > > runs. There is, I think, no problem when CGI object is created and
    > > initialized using Q_S.

    >
    > Err... OK. You must make sure you alter Q_S *before* any CGI.pm calls
    > are mode, though.


    I agree.



    > > 3) Changing Q_S affects only the running script and it's child
    > > process.

    >
    > I don't know what happens under mod_perl, if you ever move your script
    > to that envionment. Under standard CGI, this is certainly true.



    That's the type of answer I want! I've never thought of mod_perl or
    anything like it. (Actually I have no idea of what it is.)


    > It seems to me that you are trying to take a piece of rather
    > badly-written code you don't really understand, and alter it do do
    > something that isn't really possible anyway. Given that you're in that
    > much of a mess, a simple edit of $ENV{QUERY_STRING} may well be the best
    > way out :).
    >
    > Ben
    >


    I plan to check and test more things and choose what to do.

    I thank you for your constant help. Have a nice day!

    Raymundo at South Korea.
     
    Raymundo, Mar 7, 2007
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Quentin Crain
    Replies:
    1
    Views:
    344
    Irmen de Jong
    Jun 26, 2004
  2. Ankit Mehta
    Replies:
    1
    Views:
    1,439
    Simon Brooke
    Sep 29, 2006
  3. benwylie
    Replies:
    4
    Views:
    382
    Ralf Stolzenberg
    Oct 19, 2006
  4. Arie Kusuma Atmaja
    Replies:
    0
    Views:
    175
    Arie Kusuma Atmaja
    Jul 21, 2006
  5. TDR
    Replies:
    3
    Views:
    180
    Daniel Berger
    Aug 31, 2007
Loading...

Share This Page