Wonky HTTP behavior?

Discussion in 'Java' started by Twisted, Nov 13, 2006.

  1. Twisted

    Twisted Guest

    I'm fiddling with possibly making a custom web browser with a few
    extras (such as auto-retrying broken downloads (of pages, images,
    files...), one-click ad blocking, one-click referrer spoofing, referrer
    spoofing when retrieving non-text/foo content-types by default,
    user-agent spoofing, etc.) and a few unwanted things ditched (such as
    flash, ActiveX, VBScript, and some javascript capabilities). Other
    notions include page loading not being interrupted by a slow-to-run
    script (typical with ad banner code) and making offsite include loading
    use a really short-fuse timeout (again due to synchronous ad loads that
    are too slow).

    So far, the bare-bones Mosaic-alike (no javascript at all, basic normal
    behavior with text, html, and images only) is sort-of working. The
    rendering engine needs a load of work, but then, it would. It's
    something I think I can cope with. OTOH, the backend is doing something
    strange as I discovered after a string of 400 Bad Request error page
    retrievals.

    Wireshark (fork of Ethereal) reveals these dumps of the HTTP headers
    from a bog-standard GET request from clicking the first interesting
    link at http://www.movingtofreedom.org (the one that displays the rest
    of today's blog entry there):


    Using Firefox:


    Request Method: GET
    Request URI:
    /2006/11/12/a-round-of-gnu-linux-heading-in-to-the-back-nine-part-2/
    Request Version: HTTP/1.1
    Host: www.movingtofreedom.org\r\n
    User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.8)
    Gecko/20061025 Firefox/1.5.0.8\r\n
    Accept:
    text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\n
    Accept-Language: en-us,en;q=0.5\r\n
    Accept-Encoding: gzip,deflate\r\n
    Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n
    Keep-Alive: 300\r\n
    Connection: keep-alive\r\n
    Referer: http://www.movingtofreedom.org/\r\n
    Cookie:
    comment_author_email_bbd0d17eb26d1ffea56c7ae59736eeff=nobody%40nowhere.net;
    __utmz=40810430.1158107358.1.1.utmccn=(direct)|utmcsr=(direct)|utmcmd=(none);
    __utma=40810430.2073169434.1158107358.1163248340.1163417340.39;
    comment_autho



    Using the results of tonight's hackathon:

    Request Method: GET
    Request URI:
    /2006/11/12/a-round-of-gnu-linux-heading-in-to-the-back-nine-part-2/
    Request Version: HTTP/1.1
    User-Agent: Mozilla/5.0 (compatible; myuseragent)
    Accept-Language: en-us,en;q=0.5\r\n
    Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n
    Keep-Alive: 300\r\n
    Referer: http://www.movingtofreedom.org/\r\n
    Host: www.movingtofreedom.org\r\n
    Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2\r\n
    Connection: keep-alive\r\n
    Content-type: application/x-www-form-urlencoded\r\n


    Yuck. WTF is this? The cookie is missing, which is to be expected (I
    haven't used the comment form from this thing, partly because it won't
    even render forms properly as of yet), but ... the Accept: headers are
    all jumbled in with the others, the Host: header is near the bottom,
    Connection: keep-alive and Keep-Alive: 300 are separated, the Referer:
    isn't near the bottom where it (apparently) belongs, and there's that
    weird "Content-type:" header, which is coming from Christ alone knows
    where and is probably the cause of the 400 errors and God alone knows
    what more subtle problems (different format of returned page sometimes?
    Getting incorrectly identified as a broken bot or something? I don't
    want people using my new browser getting lots of timeouts, connections
    refused, and 403 errors because some webmaster thought it was a
    misbehaving bot instead of a human being and tells the world to block
    the user agent! At least it's only a test user-agent; obviously when
    it's done the user-agent will be changed to something a lot less lame.
    Getting mistaken for a bad bot might also land my IP range in a block
    list somewhere, which might put a crimp in surfing not to mention
    testing this thing further, as well as inconveniencing up to 255 other
    customers of my ISP at a time as an extra added bonus feature.)

    The code generating the initial connection is:

    uc = (HttpURLConnection)u.url.openConnection();
    uc.setInstanceFollowRedirects(true); // Auto follow what redirects we
    can.
    uc.setRequestProperty("User-Agent", Main.USER_AGENT);
    uc.setRequestProperty("Accept-Language", "en-us,en;q=0.5");
    uc.setRequestProperty("Accept-Charset",
    "ISO-8859-1,utf-8;q=0.7,*;q=0.7");
    uc.setRequestProperty("Keep-Alive", "300");
    if (u.prevURL != null) {
    uc.setRequestProperty("Referer", u.prevURL);
    }
    if (Main.COOKIE_ENABLE) {
    String k = getCookiesFor(u.url.getHost());
    if (k != null) uc.setRequestProperty("Cookie", k);
    }
    uc.connect();


    (Here, "u" references an object that encapsulates a resource fetch
    request. All of this is in a try block and some other cruft I don't
    think is relevant here. And yes, my main class is named
    "foo.bar.baz.Main"; and yes maybe that's lame; so sue me.)

    As you can see, I'm not supplying "Content-type" anywhere. The things I
    *am* supplying are all being put, in order, right after the Request
    Foos. Those, Host, and Accept are automatic, but that's fine with me.
    Same with Connection, but it was failing to provide the Keep-Alive
    header until I manually added it.

    So, three questions:
    1. Is the "Content-type" header what's screwing things up and causing
    400 errors or worse?
    2. How do I get rid of it?
    3. Is the header rearrangement (relative to what Firefox outputs) a
    likely source of problems?

    Regarding number 2, I should note that I already tried these:

    uc.setRequestProperty("Content-type", null)
    uc.setRequestProperty("Content-type", "")

    which produced garbled results (as sniffed with Wireshark) and even
    more 400 errors from hosts that didn't give them before.

    Regarding number 3, given the rampancy of browser discrimination, I
    intend to include user-agent spoofing functionality. If someone spoofs
    Firefox and the headers are in the wrong order, might the spoof be
    exposed? If I can, I want to include header ordering/characteristics
    more generally in "spoof profiles", with at least ones for Firefox and
    Internet Exploder; for now I'm simply trying to masquerade as Firefox
    as accurately as possible (except for, ironically, the user-agent
    header contents themselves) as a proof of concept. Wireshark shows me
    that my efforts are falling way short of the bar there so far.

    Note: I'm aware that there are Firefox extensions. I'm aware that any
    Joe can program one (in theory) and that there are some for spoofing
    and disabling some evil scripts and the like. I'm also aware that none
    of the latter do precisely what I am looking for, and I am *un*aware of
    how to program a Firefox extension. Learning the API and tools would
    probably take longer than it took to make this Java user agent, for
    which 80% of the work is done for me anyway (protocol implementations
    and much html parsing and rendering) by the standard library. Which
    means I'm coding more of a "browser extension" than a "browser" anyway,
    using tools I am already familiar with...and of course there's now a
    sunk investment of time and effort...
     
    Twisted, Nov 13, 2006
    #1
    1. Advertising

  2. Twisted

    Twisted Guest

    Twisted wrote:
    [snip]

    Upgrading to 1.5.0_09 (from 06) and pointing Eclipse at the 1.5.0_09
    JRE didn't change this behavior.
     
    Twisted, Nov 13, 2006
    #2
    1. Advertising

  3. Twisted

    Ian Wilson Guest

    Twisted wrote:
    > I'm fiddling with possibly making a custom web browser


    <snip>

    > the backend is doing something
    > strange as I discovered after a string of 400 Bad Request error page
    > retrievals.
    >
    > Wireshark (fork of Ethereal) reveals


    <snip>

    > the Accept: headers are
    > all jumbled in with the others, the Host: header is near the bottom,
    > Connection: keep-alive and Keep-Alive: 300 are separated, the Referer:
    > isn't near the bottom where it (apparently) belongs, and there's that
    > weird "Content-type:" header, which is coming from Christ alone knows
    > where and is probably the cause of the 400 errors


    <snip>


    "The order in which header fields with differing field names are
    received is not significant."

    http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4
     
    Ian Wilson, Nov 13, 2006
    #3
  4. Twisted

    Twisted Guest

    Ian Wilson wrote:
    > "The order in which header fields with differing field names are
    > received is not significant."
    >
    > http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4


    I know *that*. "Not significant" with regard to interpreting of the
    request by the server. Maybe more relevant with regard to identifying
    and perhaps discriminating against different user-agents. If someone
    surfs using this thing disguised as Internet Exploder, say, to a site
    that tries to enforce IE-only, it would be nice if the server had as
    little to go on as possible in trying to guess that it wasn't really
    IE. Rearranged headers theoretically give the game away (though they
    require the hypothetical discriminator to get up to their armpits in
    the low level guts of the server).

    You're right that it can probably be let slide without incident.

    There's still the bogus "Content-type" header that's sneaking in there
    and appears to be genuinely wreaking havoc, though. And that doesn't
    even involve any elaborate spoofing tricks to resist browser
    discrimination -- just ordinary Web browsing.
     
    Twisted, Nov 13, 2006
    #4
  5. Twisted

    Twisted Guest

    Twisted wrote:
    > Twisted wrote:
    > [snip]
    >
    > Upgrading to 1.5.0_09 (from 06) and pointing Eclipse at the 1.5.0_09
    > JRE didn't change this behavior.


    I found a reference to something like this on the bug blog at sun.com.
    It claimed it was fixed in 1.6.something. Hrm, 1.6.something? A quick
    poke around Java sites finds 1.6.something buried in a beta section,
    apparently because it's still in beta. Upgraded to 1.6.0 and pointed
    Eclipse at the new JRE, also changed the project JRE System Library to
    the 1.6 one, and lo and behold the problem goes away.

    Header order hopefully won't be a problem. (It shouldn't be according
    to the spec, but it may give away that my browser isn't firefox, IE, or
    whatever people sometimes disguise it as...)

    Let's also hope this beta doesn't expire, or blow up in my face due to
    a new and insidious bug or something. Time to browse the docs for any
    new library features. I've wanted to be able to load hashmaps and the
    like from object streams without going through contortions to avoid
    "erased type" cast warnings for a loooong time...
     
    Twisted, Nov 13, 2006
    #5
  6. Twisted wrote:
    > The code generating the initial connection is:
    >
    > uc = (HttpURLConnection)u.url.openConnection();
    > uc.setInstanceFollowRedirects(true); // Auto follow what redirects we
    > can.
    > uc.setRequestProperty("User-Agent", Main.USER_AGENT);
    > uc.setRequestProperty("Accept-Language", "en-us,en;q=0.5");
    > uc.setRequestProperty("Accept-Charset",
    > "ISO-8859-1,utf-8;q=0.7,*;q=0.7");
    > uc.setRequestProperty("Keep-Alive", "300");
    > if (u.prevURL != null) {
    > uc.setRequestProperty("Referer", u.prevURL);
    > }
    > if (Main.COOKIE_ENABLE) {
    > String k = getCookiesFor(u.url.getHost());
    > if (k != null) uc.setRequestProperty("Cookie", k);
    > }
    > uc.connect();
    >
    >


    > So, three questions:
    > 1. Is the "Content-type" header what's screwing things up and causing
    > 400 errors or worse?
    > 2. How do I get rid of it?
    > 3. Is the header rearrangement (relative to what Firefox outputs) a
    > likely source of problems?
    >


    I tried your code as below and got
    Response code: 200 'OK'.

    ------------------------------------------------------------------------
    import java.io.IOException;
    import java.net.HttpURLConnection;
    import java.net.MalformedURLException;
    import java.net.URL;

    public class WebGet {

    public static void main(String[] args) throws IOException {

    /*
    * Contrived class so Twisted's code snippet
    * can be used unchanged.
    */

    class Foo {
    URL url;
    String prevURL;
    Foo() {
    try {
    url = new URL("http://www.movingtofreedom.org/" +
    "2006/11/12/" +
    "a-round-of-gnu-linux-" +
    "heading-in-to-the-back-nine-part-2/");
    } catch (MalformedURLException e) {
    e.printStackTrace();
    }
    }
    }
    Foo u = new Foo();
    HttpURLConnection uc;

    // Twisted's code starts here
    // ..........................................................
    uc = (HttpURLConnection)u.url.openConnection();
    uc.setInstanceFollowRedirects(true);
    uc.setRequestProperty("User-Agent", Main.USER_AGENT);
    uc.setRequestProperty("Accept-Language", "en-us,en;q=0.5");
    uc.setRequestProperty("Accept-Charset",
    "ISO-8859-1,utf-8;q=0.7,*;q=0.7");
    uc.setRequestProperty("Keep-Alive", "300");
    if (u.prevURL != null) {
    uc.setRequestProperty("Referer", u.prevURL);
    }
    if (Main.COOKIE_ENABLE) {
    String k = getCookiesFor(u.url.getHost());
    if (k != null) uc.setRequestProperty("Cookie", k);
    }
    uc.connect();
    // ..........................................................
    // Twisted's code ends here

    System.out.println("Response code: "+uc.getResponseCode() +
    " '"+uc.getResponseMessage()+"'.");
    } // main()

    private static String getCookiesFor(String host) {
    return null;
    }
    } // WebGet

    class Main {
    static String USER_AGENT = "Foo";
    static Boolean COOKIE_ENABLE = false;
    }
    ------------------------------------------------------------------------
     
    RedGrittyBrick, Nov 13, 2006
    #6
  7. Twisted

    Twisted Guest

    RedGrittyBrick wrote:
    [snip]

    Cute, but that particular site was not producing 400s except on the
    occasions when I tried to set the Content-Type header to null or an
    empty string.

    Problem's solved anyway -- upgrading to the 1.6 beta seems to have
    fixed it, which means it was a bona fide bug in Sun's code, not in
    mine.

    I also had it generating 403s when it hit a site with links off
    http://www.foo.com/ like <img src=../images/foo.jpg>. Images that
    worked in Firefox were showing as broken in my thingie, and the 403s
    showed up in a logfile along with the malformed URLs that were causing
    them: http://www.foo.com/../images/foo.jpg. It looks like the
    URL(context, string) constructor is mishandling a corner case here, and
    now I construct the URL, turn it back into a string, manually eat any
    .../s and then any remaining ./s, and then turn it back into a URL
    again.

    Seems that the java.net.* stuff is in a state of flux for some reason,
    with lots of bugs coming and going. Which is damned odd seeing as how
    java.net.* is old, damn old, dating back to java 1.1 if not 1.0...

    Now I wonder if tomorrow's blog entry at movingtofreedom is going to be
    commenting on his server logs and the strange user agents cropping up
    of late -- "myuseragent" here, and "Foo" there...
     
    Twisted, Nov 13, 2006
    #7
  8. Twisted

    Twisted Guest

    And now of course I have new problems, starting with the thing blowing
    up on exit. I'm getting abends in the VM when the debugger detaches on
    an otherwise normal user-initiated terminate, and the "are you sure"
    pops up twice in a row now. (I have one on window close and one on
    actual process terminate; looks like they're no longer being cascaded.)
    To top it off, I have an obscure NPE to trap down that didn't used to
    happen and one of the imageio methods is throwing internally-generated
    IllegalArgumentExceptions from time to time (my method calls into
    imageio look OK, and the exception is coming from more than one stack
    frame beneath my code; looks like it chokes on some network/disk
    originated data, which should really throw an IOException flavour
    instead).

    Ah, the joys of upgrading. Old bugs gone and new ones to track down
    (and probably one or two of mine own that didn't get provoked with the
    older JVM and library code).

    On the whole it works better though. That wonky header is gone, as are
    the spurious 400 Bad Request errors. The only one of those I've seen
    since was from a genuinely malformed URL due to broken HTML on one web
    page I browsed. Even cookies seem to be working properly now. :) (Of
    course, with forms broken, all I get to test it on is *#&!
    ad.doubleclick.net tracking cookies!)
     
    Twisted, Nov 14, 2006
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jiong Feng
    Replies:
    0
    Views:
    856
    Jiong Feng
    Nov 19, 2003
  2. Paul Glavich [MVP - ASP.NET]
    Replies:
    0
    Views:
    451
    Paul Glavich [MVP - ASP.NET]
    Jun 20, 2004
  3. Mantorok Redgormor
    Replies:
    70
    Views:
    1,776
    Dan Pop
    Feb 17, 2004
  4. SpaceMarine

    wonky <authorization> (order matters?)

    SpaceMarine, May 21, 2008, in forum: ASP .Net Security
    Replies:
    1
    Views:
    109
    Joe Kaplan
    May 21, 2008
  5. mindseeker

    javascript resize in IE/XP wonky

    mindseeker, Jan 24, 2006, in forum: Javascript
    Replies:
    3
    Views:
    126
    Randy Webb
    Jan 27, 2006
Loading...

Share This Page