Fixing socket.makefile()

B

Bryan Olson

Here's the problem: Suppose we use:

import socket
[...]
f = some_socket.makefile()

Then:

f.read() is efficient, but verbose, and incorrect (or at
least does not play will with others);

f.readline() is correct, but verbose and inefficient.

To justify the "verbose" part, just look at the code in the
Python library's socket.py. Below, I'll explain playing well
with others, and then (in)efficiency.

Consider the operations:

f = some_socket.makefile()
ch = f.read(1)
print "The first char is", ch
ch = some_socket.recv(1)
print "The second char is", ch

The code above does *not* (usually) print the first and second
characters from the socket.

The problem is that makefile() returns a Python object that has
its own local buffer. The recv() call reads directly from the
socket, oblivious to any data queued in the file object's
buffer. The problem is not limited to recv(); select(), and
perhaps other calls, will ignore the buffer and look directly at
the socket. Output buffering appears to have a similar problem.

Now look up socket.makefile().readline(). It gets one byte at a
time. It will get the byte from the Python buffer if the buffer
is non-empty, otherwise it will try to recv() one byte at a
time, directly from the socket. By itself, readline() never
over-reads the socket; if select() and recv() would work
correctly before the readline(), they'll work after. While
correct, reading one byte at a time is painfully slow.

The Python Library Reference is silent on whether the
socket.makefile operations are supposed to interact correctly
with the direct socket operations. If they are supposed to play
well together, then read() is wrong. If they are not, then
readline() is absurdly slow.

Enough of my whining. The good news is that we can have both
efficiency and correctness, and we can fix the bloat at the same
time. Operating systems already do efficient buffering for
sockets. That efficiency varies, but any smart operating system
copies buffers to user-space in large chunks, and answers
recv()'s from the buffers without system calls, when possible.
Python's socket module now supports MSG_PEEK, which enables
Python code to examine a socket's native buffer.

Below my sig, I show code to replace the corresponding member
functions in the class socket._fileobject. The updated version
passes the tests in test_socket.py.

Make sense? Worth doing? I thought I'd talk it up here before
jumping into the devel list.


--
--Bryan



# class _fileobject(object):

def __init__(self, sock, mode='rb', bufsize=-1):
self._sock = sock
if bufsize <= 0:
bufsize = self.default_bufsize
self.bufsize = bufsize
self.softspace = False

def read(self, size=-1):
if size <= 0:
size = sys.maxint
blocks = []
while size > 0:
b = self._sock.recv(min(size, self.bufsize))
size -= len(b)
if not b:
break
blocks.append(b)
return "".join(blocks)

def readline(self, size=-1):
if size < 0:
size = sys.maxint
blocks = []
read_size = min(20, size)
found = 0
while size and not found:
b = self._sock.recv(read_size, MSG_PEEK)
if not b:
break
found = b.find('\n') + 1
length = found or len(b)
size -= length
blocks.append(self._sock.recv(length))
read_size = min(read_size * 2, size, self.bufsize)
return "".join(blocks)

def write(self, data):
self._sock.sendall(str(data))

def writelines(self, lines):
# This version mimics the current writelines, which calls
# str() on each line, but comments that we should reject
# non-string non-buffers. Let's omit the next line.
lines = [str(s) for s in lines]
self._sock.sendall(''.join(lines))

def flush(self):
pass
 
A

Alan Kennedy

[Bryan Olson]
> The problem is that makefile() returns a Python object that has
> its own local buffer. The recv() call reads directly from the
> socket, oblivious to any data queued in the file object's
> buffer. The problem is not limited to recv(); select(), and
> perhaps other calls, will ignore the buffer and look directly at
> the socket. Output buffering appears to have a similar problem.
and

> The Python Library Reference is silent on whether the
> socket.makefile operations are supposed to interact correctly
> with the direct socket operations. If they are supposed to play
> well together, then read() is wrong. If they are not, then
> readline() is absurdly slow.

I'm glad you asked these questions ;-)

I also am interested in the answers, because I'm just coming to end of
my implementation of cpython 2.3 compatible socket, select and asyncore
modules for jython, i.e. asynchronous socket support, using the new
java.nio APIs in jdk1.4+.

Points to make in relation to jython include

1. The problem you describe doesn't arise very often, I think. Most
users who use makefile() on sockets are going to use the file-based
interface exclusively and not the underlying socket interface.

2. The problem does not exist in jython, because jython implements the
socket.makefile() method by returning wrappers on the java.net.socket's
InputStream and OutputStream, meaning that calling either file or socket
interface sends data through the same underlying streams.

3. I am eager to have the behaviour of cpython explicitly defined, since
I am working hard to make my jython implementation 100% cpython
compatible, right down to the exceptions. I want all cpython socket code
to not know that it's running on jython.

4. I'm particularly interested in seeing documentation on how read and
write operations on socket.makefile()s should behave when the socket is
in non-blocking mode: Should it raise an exception? Which exception? The
same exception on every platform?

P.S. To those who know I've working on this for *ages* now: apologies
(Hi Irmen :) My finances have prevented me from spending too much time
working on this voluntary project. However, you may be encouraged to
know that I now have it passing most of the cpython 2.3 test_socket.py
unit tests (including the ones that use select.select). It's only a
matter of a month or two more now .....

Out of interest: Does anyone know if developing asynch-socket support
for jython is the sort of work that might fall under the auspices of the
PSF grant scheme?
 
D

Donn Cave

Bryan Olson said:
The problem is that makefile() returns a Python object that has
its own local buffer. The recv() call reads directly from the
socket, oblivious to any data queued in the file object's
buffer. The problem is not limited to recv(); select(), and
perhaps other calls, will ignore the buffer and look directly at
the socket. Output buffering appears to have a similar problem.

Now look up socket.makefile().readline(). It gets one byte at a
time. It will get the byte from the Python buffer if the buffer
is non-empty, otherwise it will try to recv() one byte at a
time, directly from the socket. By itself, readline() never
over-reads the socket; if select() and recv() would work
correctly before the readline(), they'll work after. While
correct, reading one byte at a time is painfully slow.

I don't get this. Has socket.py changed this much since 2.2?
The readline I'm looking at says self._sock.recv(self._rbufsize),
so you would only get this behavior if you specified a buffer
size of 1 or less, and read() does the same - so you could do
this to yourself, but not specially just with readline.

At any rate, I think it would put this in better perspective
to recall that pipes, terminals and in general any "slow"
device has the same issues, and that they work out the same
in Python as in the original C, with socket file descriptors
in place of socket objects and stdio file pointers in place
of file objects.

It's definitely a problem, and some kind of solution might be
well received, but it needs to be portable (so forget MSG_PEEK
unless you're really confident that it will be supported on
every platform that now supports sockets to some useful degree),
and it would be nice to apply to the problem in general and not
just sockets. I think the root of the problem really is that
select() doesn't look at process buffers in fileobject instances,
and it can't be made to do that because that information isn't
available from the stdio file pointer underneath the fileobject.
So, you need a replacement for fileobject, to start with.

Donn Cave, (e-mail address removed)
 
J

John J. Lee

Alan Kennedy said:
I also am interested in the answers, because I'm just coming to end of
my implementation of cpython 2.3 compatible socket, select and
asyncore modules for jython, i.e. asynchronous socket support, using
the new java.nio APIs in jdk1.4+.
Hooray!

[...]
P.S. To those who know I've working on this for *ages* now: apologies
(Hi Irmen :) My finances have prevented me from spending too much
time working on this voluntary project. However, you may be encouraged
to know that I now have it passing most of the cpython 2.3
test_socket.py unit tests (including the ones that use
select.select). It's only a matter of a month or two more now .....

Having Pyro running on both ends of Jython / CPython divide will be
very handy.

Out of interest: Does anyone know if developing asynch-socket support
for jython is the sort of work that might fall under the auspices of
the PSF grant scheme?
[...]

I don't see why not. This sort of fundamental-but-undramatic stuff is
really valuable.

Maybe the PSF should look into funding research aimed at cloning
Martin v. Loewis? Or, if he's really a bot, maybe he could be
reimplemented using your new code, for superior scalability?


John
 
B

Bryan Olson

Donn Cave wrote:
[...]
> I don't get this. Has socket.py changed this much since 2.2?
> The readline I'm looking at says self._sock.recv(self._rbufsize),
> so you would only get this behavior if you specified a buffer
> size of 1 or less, and read() does the same - so you could do
> this to yourself, but not specially just with readline.

Hi Donn; yes, looks like I got confused on that one.
> At any rate, I think it would put this in better perspective
> to recall that pipes, terminals and in general any "slow"
> device has the same issues, and that they work out the same
> in Python as in the original C, with socket file descriptors
> in place of socket objects and stdio file pointers in place
> of file objects.

And it gets worse. I've seen layered handlers with buffers of
buffers of buffers.
> It's definitely a problem, and some kind of solution might be
> well received, but it needs to be portable (so forget MSG_PEEK
> unless you're really confident that it will be supported on
> every platform that now supports sockets to some useful degree),

Hold on ... Gooogle...Google...Google... Well, support for
MSG_PEEK seems to be universal except for a couple reported bugs
and versions of BeOS without BONE (BeOS Network Environment).
I've never used BeOS, but apparently BeOS'ers are used to the
idea that they need BONE to get network stuff working.

Actually testing against the wide range of platforms is beyond
my own capabilities.
> and it would be nice to apply to the problem in general and
> not just sockets.

Agreed, but for now I'd like to call that out-of-scope. I came
upon this particular problem when writing an HTTP/1.1 thingy.
The socket module works well, but I found the higher-level
library classes not-so-useful.
> I think the root of the problem really is that
> select() doesn't look at process buffers in fileobject instances,
> and it can't be made to do that because that information isn't
> available from the stdio file pointer underneath the fileobject.
> So, you need a replacement for fileobject, to start with.

Really we want a general, portable, extensible event-handler.
It should to be unified with all the asynchronous things, such
as socket/file activity, thread locks and semaphore, and GUI
event loops.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,797
Messages
2,569,647
Members
45,377
Latest member
Zebacus

Latest Threads

Top