Bizarre network application problem

R

Ron Albright

I have a small PC running Fedora Core 3 with kernel 2.6.12-1.1380_FC3. On
it is a Java application that collects data from a serial port and sends
it to a server application. It can do this over a LAN (possible a
broadband internet connection) or a dialup internet connection. In some
cases it has a fixed IP and in some it uses DHCP. The connection can be
either through a normal socket or a SSL socket.

It works fine under all conditions except using SSL over the LAN when the
network interface is configured using DHCP. It works with a normal socket
over dialup or over the LAN with a fixed IP or DHCP. It works with the SSL
socket over dialup or over the LAN with a fixed IP but not DHCP. The
switch from fixed IP to DHCP can be on the same subnet with the
only change being the ifcfg-eth0 script and bringing the network interface
down and up. Switch it back and it works. I can bring up the network
interface with DHCP and switch the application from SSL to normal socket
without any other changes and it stops working. Switch it back and it
works.

The application has excellent logging and comprehensive exception
handling. Both sides show the connection but the app thread just seems to
freeze immediately after with no program exceptions of any kind.

A different small PC running Radhat 7.3 and a somewhat older version
of the app works fine even over SSL using DHCP.

I have no idea where to even start looking for this one. Any pointers on
things to check would be appreciated.
 
P

Pep

muttley said:
run the older app on the new machine.

I think it's your machine.

Not necessarily.

Run the older app on FC3 and see what happens but also run the newer app on
RH7.3 to compare the results.

If the older app runs on FC3 then it points to the newer app being wrong.

Also, if the newer app runs on the RH7.3 then it points to the FC3 o/s being
wrong.

What are the differences between the older and newer apps?

Pep.
 
C

Chris Uppal

Ron said:
It works fine under all conditions except using SSL over the LAN when the
network interface is configured using DHCP. It works with a normal socket
over dialup or over the LAN with a fixed IP or DHCP. It works with the SSL
socket over dialup or over the LAN with a fixed IP but not DHCP. [...]
I have no idea where to even start looking for this one. Any pointers on
things to check would be appreciated.

After trying "muttley"s suggestion (and assuming that it didn't show up the
problem), I'd be inclined to focus very carefully on what was happening with
DHCP. First ensure that the settings supplied by DHCP are /exactly/ the same
as the static configuration (or it may be easier to change the static
configuration to be /exactly/ what DHCP would supply). Don't forget to
double-check things like the DNS and gateway info that may be supplied by DHCP.
I seem to remember that even the hostname can be supplied by DHCP, if so check
that too -- check /everything/. Does the problem still manifest ? Then fire
up Ethereal and look /very/ hard at the network behaviour, taking particular
note of the IP headers and the like (i.e. don't only look at the payload). For
comparison do the same thing with SSL turned off. Obviously there will be some
differences caused by the need to set up and use encryption, but a lot of the
traffic /should/ be identical (DNS lookups and the like). Where is the
difference ?

-- chris
 
R

Ron Albright

After trying "muttley"s suggestion (and assuming that it didn't show up the
problem), I'd be inclined to focus very carefully on what was happening with
DHCP.

I ran the new app on the old hardware/OS and still had the problem.
Running the old app on the new hardware/OS would take some significant
work. To Pep: The difference is the older version uses RMI while the newer
uses a straight SSL socket. But they both use identical code, keys and
passphrases to set up the socket factories.
First ensure that the settings supplied by DHCP are /exactly/ the same
as the static configuration (or it may be easier to change the static
configuration to be /exactly/ what DHCP would supply). Don't forget to
double-check things like the DNS and gateway info that may be supplied by DHCP.
I seem to remember that even the hostname can be supplied by DHCP, if so check
that too -- check /everything/. Does the problem still manifest ?

I've checked everything under ifconfig and netstat -r and they are
identical. What's confusing is under DHCP even when the app connection is
hanging, all other network functions seem to work including ssh to the
server. The hostname wasn't set and DHCP was suppling one but I set the
hostname so it was the same under DHCP and fixed IP and it made no
difference.

I found that when I boot DHCP kill the app (the app starts on boot) take
the network interface down set a fixed IP and bring the network interface
back up the app still won't talk until I do a reboot with the fixed IP
configuration.
Then fire
up Ethereal and look /very/ hard at the network behaviour, taking particular
note of the IP headers and the like (i.e. don't only look at the payload). For
comparison do the same thing with SSL turned off. Obviously there will be some
differences caused by the need to set up and use encryption, but a lot of the
traffic /should/ be identical (DNS lookups and the like). Where is the
difference ?

I used snoop but I did this 4 ways. DHCP with SSL (no talk) and without
SSL (talks), fixed IP with SSL (talks) and the above scenario prior to the
reboot (no talk). The only difference I see are 16 DNS related packets
when DHCP is used but this is the case for both with and without SSL. Also
all communications in the app use IPs not domainnames. In both cases where
it didn't talk there where 16 packets exchanged between the app box and
the server. I compared these 16 with the first 16 packets exchanged when
it did talk with SSL and a fixed IP. It appears the 16 packet headers are
identical in all cases.

The DNS packets seem to be related to the hostname assigned by the DHCP
server. I'm using a USR8200 for a router and DHCP. I'm also using it's
brain dead DNS as a local server. I haven't set up real DNS for the domain
configured in the router. Maybe it has something to do with the SSL layer
failing the hostname and domain assigned by the USR8200. I'm going to try
connect the app box directly to a brain dead cheap router so only a
minimum of information is picked up by DHCP. It's the only thing I can
think of right now.

I appreciate the pointers so far and any more ideas will be greatly
appreciated because I'm still completely stumped.
 
C

Chris Uppal

Ron said:
I found that when I boot DHCP kill the app (the app starts on boot) take
the network interface down set a fixed IP and bring the network interface
back up the app still won't talk until I do a reboot with the fixed IP
configuration.

I admit that I'm clutching at straws here, but that sounds a bit suspicious.
What happens if you remove the app from the startup sequence entirely, and only
run it once you are /sure/ that DHCP, /dev/random, etc, are all fully set up ?

I should have asked before. When you say the application freezes, what do you
mean ?

More specifically: Is it hanging in a send, or in a receive, or (even)
somewhere else ? What can you see when you run it under a debugger ? When you
sniff the network, where did the last packet get sent, was it from the app or
to it, was it to/from the server, or somewhere else ? Is the answer to that
question consistent with the answer to the first one (it might a deadlock
between app and server caused by buffering, so that both end "think" its the
other's turn to speak next) ?

If that doesn't suggest anything, and your DNS investigations don't turn up a
hint, then I'm afraid I've run out of ideas.

-- chris
 
R

Ron Albright

Man, I hope you're in Europe somewhere. The thought of getting up before
5:30 AM (central time USA) and actually having complex thought processes
working is a scary one for me. Unless you're still on the night before.

Hooking it up to a simpleton Linksys firewall router worked. I'm guessing
the incomplete DNS setup was causing some sort of security problem with
the SSL at some level. What really bothers me about this is that it was
locking up rather then failing clean. Unfortunately since it's working I
doubt I'll have the time to followup on finding where the bug is. I
included some more info below for thread completeness for anyone
interested.

Thanks to everyone for the ideas.

What happens if you remove the app from the startup sequence entirely, and only
run it once you are /sure/ that DHCP, /dev/random, etc, are all fully set up ?

It's being start in rc.local so it should be the last thing started but
even if that wasn't the case I had tried shutting down and restarting the
app.
I should have asked before. When you say the application freezes, what do you
mean ?

More specifically: Is it hanging in a send, or in a receive, or (even)
somewhere else ?
What can you see when you run it under a debugger ?

Unfortunately my development environment is all Windows here. The
deployment is on fully automated headless small form factor Lniux PCs. The
This problem doesn't manifest itself in the development environment. I
have no debugging tools on the deployment computers. From the logging I
can tell it is getting past the Socket.connect() and the
ServerSocket.accept() but not past the reading or writing anything from or
to the connection. In other words it appears threads on both sides go to
read and write and never return. After the connection the app writer
thread is supposed to send stuff to the server which the server then
acknowledges. The stuff is never being sent. The confusing part
is there are several log messages that should be spit out after the
connect but prior to any actual writes on the socket. These are not
happening. The reader thread is doing a read on the socket immediately
after the connection is established. So the best guess is the read call on
the socket is locking not only that thread but the entire process.
When you
sniff the network, where did the last packet get sent, was it from the app or
to it, was it to/from the server, or somewhere else ?

The last packet was from the app to the server. That was one of the things
I focused on in the packets. Maybe one or more of the protocol setup
packets was being directed to a wrong address but the MAC addresses were
correct in every packet. In successful cases the 17th packet was also from
the app to the server. So it would appear the server was waiting on
something from app that was never being sent.
Is the answer to that
question consistent with the answer to the first one (it might a deadlock
between app and server caused by buffering, so that both end "think" its the
other's turn to speak next) ?

I'm not sure what level of buffering you're referring to but I would think
it would have to be at the OS level since the entire process appears to be
locking at the first operation (a read) on the socket.
 
C

Chris Uppal

Ron said:
Man, I hope you're in Europe somewhere. The thought of getting up before
5:30 AM (central time USA) and actually having complex thought processes
working is a scary one for me. Unless you're still on the night before.

<chuckle/>

Not to worry, I'm in the UK.

What really bothers me about this is that it was
locking up rather then failing clean. Unfortunately since it's working I
doubt I'll have the time to followup on finding where the bug is. I
included some more info below for thread completeness for anyone
interested.

A shame not to nail it, but such are the pressures of commercial life...

-- chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top