Wonky HTTP behavior?

T

Twisted

I'm fiddling with possibly making a custom web browser with a few
extras (such as auto-retrying broken downloads (of pages, images,
files...), one-click ad blocking, one-click referrer spoofing, referrer
spoofing when retrieving non-text/foo content-types by default,
user-agent spoofing, etc.) and a few unwanted things ditched (such as
flash, ActiveX, VBScript, and some javascript capabilities). Other
notions include page loading not being interrupted by a slow-to-run
script (typical with ad banner code) and making offsite include loading
use a really short-fuse timeout (again due to synchronous ad loads that
are too slow).

So far, the bare-bones Mosaic-alike (no javascript at all, basic normal
behavior with text, html, and images only) is sort-of working. The
rendering engine needs a load of work, but then, it would. It's
something I think I can cope with. OTOH, the backend is doing something
strange as I discovered after a string of 400 Bad Request error page
retrievals.

Wireshark (fork of Ethereal) reveals these dumps of the HTTP headers
from a bog-standard GET request from clicking the first interesting
link at http://www.movingtofreedom.org (the one that displays the rest
of today's blog entry there):


Using Firefox:


Request Method: GET
Request URI:
/2006/11/12/a-round-of-gnu-linux-heading-in-to-the-back-nine-part-2/
Request Version: HTTP/1.1
Host: www.movingtofreedom.org\r\n
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.8)
Gecko/20061025 Firefox/1.5.0.8\r\n
Accept:
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\n
Accept-Language: en-us,en;q=0.5\r\n
Accept-Encoding: gzip,deflate\r\n
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n
Keep-Alive: 300\r\n
Connection: keep-alive\r\n
Referer: http://www.movingtofreedom.org/\r\n
Cookie:
comment_author_email_bbd0d17eb26d1ffea56c7ae59736eeff=nobody%40nowhere.net;
__utmz=40810430.1158107358.1.1.utmccn=(direct)|utmcsr=(direct)|utmcmd=(none);
__utma=40810430.2073169434.1158107358.1163248340.1163417340.39;
comment_autho



Using the results of tonight's hackathon:

Request Method: GET
Request URI:
/2006/11/12/a-round-of-gnu-linux-heading-in-to-the-back-nine-part-2/
Request Version: HTTP/1.1
User-Agent: Mozilla/5.0 (compatible; myuseragent)
Accept-Language: en-us,en;q=0.5\r\n
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n
Keep-Alive: 300\r\n
Referer: http://www.movingtofreedom.org/\r\n
Host: www.movingtofreedom.org\r\n
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2\r\n
Connection: keep-alive\r\n
Content-type: application/x-www-form-urlencoded\r\n


Yuck. WTF is this? The cookie is missing, which is to be expected (I
haven't used the comment form from this thing, partly because it won't
even render forms properly as of yet), but ... the Accept: headers are
all jumbled in with the others, the Host: header is near the bottom,
Connection: keep-alive and Keep-Alive: 300 are separated, the Referer:
isn't near the bottom where it (apparently) belongs, and there's that
weird "Content-type:" header, which is coming from Christ alone knows
where and is probably the cause of the 400 errors and God alone knows
what more subtle problems (different format of returned page sometimes?
Getting incorrectly identified as a broken bot or something? I don't
want people using my new browser getting lots of timeouts, connections
refused, and 403 errors because some webmaster thought it was a
misbehaving bot instead of a human being and tells the world to block
the user agent! At least it's only a test user-agent; obviously when
it's done the user-agent will be changed to something a lot less lame.
Getting mistaken for a bad bot might also land my IP range in a block
list somewhere, which might put a crimp in surfing not to mention
testing this thing further, as well as inconveniencing up to 255 other
customers of my ISP at a time as an extra added bonus feature.)

The code generating the initial connection is:

uc = (HttpURLConnection)u.url.openConnection();
uc.setInstanceFollowRedirects(true); // Auto follow what redirects we
can.
uc.setRequestProperty("User-Agent", Main.USER_AGENT);
uc.setRequestProperty("Accept-Language", "en-us,en;q=0.5");
uc.setRequestProperty("Accept-Charset",
"ISO-8859-1,utf-8;q=0.7,*;q=0.7");
uc.setRequestProperty("Keep-Alive", "300");
if (u.prevURL != null) {
uc.setRequestProperty("Referer", u.prevURL);
}
if (Main.COOKIE_ENABLE) {
String k = getCookiesFor(u.url.getHost());
if (k != null) uc.setRequestProperty("Cookie", k);
}
uc.connect();


(Here, "u" references an object that encapsulates a resource fetch
request. All of this is in a try block and some other cruft I don't
think is relevant here. And yes, my main class is named
"foo.bar.baz.Main"; and yes maybe that's lame; so sue me.)

As you can see, I'm not supplying "Content-type" anywhere. The things I
*am* supplying are all being put, in order, right after the Request
Foos. Those, Host, and Accept are automatic, but that's fine with me.
Same with Connection, but it was failing to provide the Keep-Alive
header until I manually added it.

So, three questions:
1. Is the "Content-type" header what's screwing things up and causing
400 errors or worse?
2. How do I get rid of it?
3. Is the header rearrangement (relative to what Firefox outputs) a
likely source of problems?

Regarding number 2, I should note that I already tried these:

uc.setRequestProperty("Content-type", null)
uc.setRequestProperty("Content-type", "")

which produced garbled results (as sniffed with Wireshark) and even
more 400 errors from hosts that didn't give them before.

Regarding number 3, given the rampancy of browser discrimination, I
intend to include user-agent spoofing functionality. If someone spoofs
Firefox and the headers are in the wrong order, might the spoof be
exposed? If I can, I want to include header ordering/characteristics
more generally in "spoof profiles", with at least ones for Firefox and
Internet Exploder; for now I'm simply trying to masquerade as Firefox
as accurately as possible (except for, ironically, the user-agent
header contents themselves) as a proof of concept. Wireshark shows me
that my efforts are falling way short of the bar there so far.

Note: I'm aware that there are Firefox extensions. I'm aware that any
Joe can program one (in theory) and that there are some for spoofing
and disabling some evil scripts and the like. I'm also aware that none
of the latter do precisely what I am looking for, and I am *un*aware of
how to program a Firefox extension. Learning the API and tools would
probably take longer than it took to make this Java user agent, for
which 80% of the work is done for me anyway (protocol implementations
and much html parsing and rendering) by the standard library. Which
means I'm coding more of a "browser extension" than a "browser" anyway,
using tools I am already familiar with...and of course there's now a
sunk investment of time and effort...
 
T

Twisted

Twisted wrote:
[snip]

Upgrading to 1.5.0_09 (from 06) and pointing Eclipse at the 1.5.0_09
JRE didn't change this behavior.
 
I

Ian Wilson

Twisted said:
I'm fiddling with possibly making a custom web browser

the backend is doing something
strange as I discovered after a string of 400 Bad Request error page
retrievals.

Wireshark (fork of Ethereal) reveals

the Accept: headers are
all jumbled in with the others, the Host: header is near the bottom,
Connection: keep-alive and Keep-Alive: 300 are separated, the Referer:
isn't near the bottom where it (apparently) belongs, and there's that
weird "Content-type:" header, which is coming from Christ alone knows
where and is probably the cause of the 400 errors

<snip>


"The order in which header fields with differing field names are
received is not significant."

http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4
 
T

Twisted

Ian said:
"The order in which header fields with differing field names are
received is not significant."

http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4

I know *that*. "Not significant" with regard to interpreting of the
request by the server. Maybe more relevant with regard to identifying
and perhaps discriminating against different user-agents. If someone
surfs using this thing disguised as Internet Exploder, say, to a site
that tries to enforce IE-only, it would be nice if the server had as
little to go on as possible in trying to guess that it wasn't really
IE. Rearranged headers theoretically give the game away (though they
require the hypothetical discriminator to get up to their armpits in
the low level guts of the server).

You're right that it can probably be let slide without incident.

There's still the bogus "Content-type" header that's sneaking in there
and appears to be genuinely wreaking havoc, though. And that doesn't
even involve any elaborate spoofing tricks to resist browser
discrimination -- just ordinary Web browsing.
 
T

Twisted

Twisted said:
Twisted wrote:
[snip]

Upgrading to 1.5.0_09 (from 06) and pointing Eclipse at the 1.5.0_09
JRE didn't change this behavior.

I found a reference to something like this on the bug blog at sun.com.
It claimed it was fixed in 1.6.something. Hrm, 1.6.something? A quick
poke around Java sites finds 1.6.something buried in a beta section,
apparently because it's still in beta. Upgraded to 1.6.0 and pointed
Eclipse at the new JRE, also changed the project JRE System Library to
the 1.6 one, and lo and behold the problem goes away.

Header order hopefully won't be a problem. (It shouldn't be according
to the spec, but it may give away that my browser isn't firefox, IE, or
whatever people sometimes disguise it as...)

Let's also hope this beta doesn't expire, or blow up in my face due to
a new and insidious bug or something. Time to browse the docs for any
new library features. I've wanted to be able to load hashmaps and the
like from object streams without going through contortions to avoid
"erased type" cast warnings for a loooong time...
 
R

RedGrittyBrick

Twisted said:
The code generating the initial connection is:

uc = (HttpURLConnection)u.url.openConnection();
uc.setInstanceFollowRedirects(true); // Auto follow what redirects we
can.
uc.setRequestProperty("User-Agent", Main.USER_AGENT);
uc.setRequestProperty("Accept-Language", "en-us,en;q=0.5");
uc.setRequestProperty("Accept-Charset",
"ISO-8859-1,utf-8;q=0.7,*;q=0.7");
uc.setRequestProperty("Keep-Alive", "300");
if (u.prevURL != null) {
uc.setRequestProperty("Referer", u.prevURL);
}
if (Main.COOKIE_ENABLE) {
String k = getCookiesFor(u.url.getHost());
if (k != null) uc.setRequestProperty("Cookie", k);
}
uc.connect();
So, three questions:
1. Is the "Content-type" header what's screwing things up and causing
400 errors or worse?
2. How do I get rid of it?
3. Is the header rearrangement (relative to what Firefox outputs) a
likely source of problems?

I tried your code as below and got
Response code: 200 'OK'.

------------------------------------------------------------------------
import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;

public class WebGet {

public static void main(String[] args) throws IOException {

/*
* Contrived class so Twisted's code snippet
* can be used unchanged.
*/

class Foo {
URL url;
String prevURL;
Foo() {
try {
url = new URL("http://www.movingtofreedom.org/" +
"2006/11/12/" +
"a-round-of-gnu-linux-" +
"heading-in-to-the-back-nine-part-2/");
} catch (MalformedURLException e) {
e.printStackTrace();
}
}
}
Foo u = new Foo();
HttpURLConnection uc;

// Twisted's code starts here
// ..........................................................
uc = (HttpURLConnection)u.url.openConnection();
uc.setInstanceFollowRedirects(true);
uc.setRequestProperty("User-Agent", Main.USER_AGENT);
uc.setRequestProperty("Accept-Language", "en-us,en;q=0.5");
uc.setRequestProperty("Accept-Charset",
"ISO-8859-1,utf-8;q=0.7,*;q=0.7");
uc.setRequestProperty("Keep-Alive", "300");
if (u.prevURL != null) {
uc.setRequestProperty("Referer", u.prevURL);
}
if (Main.COOKIE_ENABLE) {
String k = getCookiesFor(u.url.getHost());
if (k != null) uc.setRequestProperty("Cookie", k);
}
uc.connect();
// ..........................................................
// Twisted's code ends here

System.out.println("Response code: "+uc.getResponseCode() +
" '"+uc.getResponseMessage()+"'.");
} // main()

private static String getCookiesFor(String host) {
return null;
}
} // WebGet

class Main {
static String USER_AGENT = "Foo";
static Boolean COOKIE_ENABLE = false;
}
------------------------------------------------------------------------
 
T

Twisted

RedGrittyBrick wrote:
[snip]

Cute, but that particular site was not producing 400s except on the
occasions when I tried to set the Content-Type header to null or an
empty string.

Problem's solved anyway -- upgrading to the 1.6 beta seems to have
fixed it, which means it was a bona fide bug in Sun's code, not in
mine.

I also had it generating 403s when it hit a site with links off
http://www.foo.com/ like <img src=../images/foo.jpg>. Images that
worked in Firefox were showing as broken in my thingie, and the 403s
showed up in a logfile along with the malformed URLs that were causing
them: http://www.foo.com/../images/foo.jpg. It looks like the
URL(context, string) constructor is mishandling a corner case here, and
now I construct the URL, turn it back into a string, manually eat any
.../s and then any remaining ./s, and then turn it back into a URL
again.

Seems that the java.net.* stuff is in a state of flux for some reason,
with lots of bugs coming and going. Which is damned odd seeing as how
java.net.* is old, damn old, dating back to java 1.1 if not 1.0...

Now I wonder if tomorrow's blog entry at movingtofreedom is going to be
commenting on his server logs and the strange user agents cropping up
of late -- "myuseragent" here, and "Foo" there...
 
T

Twisted

And now of course I have new problems, starting with the thing blowing
up on exit. I'm getting abends in the VM when the debugger detaches on
an otherwise normal user-initiated terminate, and the "are you sure"
pops up twice in a row now. (I have one on window close and one on
actual process terminate; looks like they're no longer being cascaded.)
To top it off, I have an obscure NPE to trap down that didn't used to
happen and one of the imageio methods is throwing internally-generated
IllegalArgumentExceptions from time to time (my method calls into
imageio look OK, and the exception is coming from more than one stack
frame beneath my code; looks like it chokes on some network/disk
originated data, which should really throw an IOException flavour
instead).

Ah, the joys of upgrading. Old bugs gone and new ones to track down
(and probably one or two of mine own that didn't get provoked with the
older JVM and library code).

On the whole it works better though. That wonky header is gone, as are
the spurious 400 Bad Request errors. The only one of those I've seen
since was from a genuinely malformed URL due to broken HTML on one web
page I browsed. Even cookies seem to be working properly now. :) (Of
course, with forms broken, all I get to test it on is *#&!
ad.doubleclick.net tracking cookies!)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top