Writing a Server in Python

So with this new job, my boss occasionally gives me little “homework” assignments. I don’t work full-time, so I enjoy doing it and it helps me better understand what it is we’re doing.

So anyway, my latest task is to write a server in Python. After a bit of learning about sockets and threads I seem to have created an echo server that can connect and do stuff. That’s all well and good.

But from his description he seemed to be stating that I would want to send objects. He mentioned that I need to use the pickle library for this. Still no big deal. As I understand it, pickle converts objects into streams and vice versa. Doesn’t seem like a problem thusfar.

Here’s the application, We are using a computing cloud to do some computation. And in order to control the behavior of all of the machines I guess we’re going to write a server that can be filled with jobs, and the client machines will take these tasks and execute them and report back the answer, or data or whatever.

But I guess I don’t completely understand the need for passing objects. Wouldn’t a simple string with a command be good enough? Say I make a list of commands in the server app, and every time a client connects, it takes a command and runs it locally. On the server side, it can remove the task from the list.

What am I missing here?

One thing I’m thinking is that it might be helpful for feedback. I’m thinking that I could create an object that has something like a command ID, which would allow feedback for a particular command. I am guessing this would help with such matters. Anyone have any other ideas of what I need to be anticipating here? I don’t exactly have a problem that warrants it, I don’t think, so I guess I’m asking, what am I missing? Why is he telling me to do it that way?

I think the need for passing objects strongly depends on your application. If you’re just executing commands locally on the client, then just passing strings should be sufficient. However, if you’re building up complex data objects, you don’t want to have to send all the information needed to rebuild them locally.

Actually, just writing this post has helped me think of something to be honest. Here’s what I’m thinking.

We’re using Amazon EC2 for our clients, and S3 as storage, so I can break down a task into two components already. Command and address of data. Also maybe the address of the sender in order to send a confirmation of completion back?

One reason to not use strings is to cut down on network volume. That is, a double value is 8 bytes; a string representation of a double value might be much larger (especially if one uses non-scientific notation).

Another reason might be simplicity. It’s much easier (and less error prone) to have a single-object write/read, relying on the language’s mechanisms, than to convert data to string form then read it back in.

As to other things to consider, many of which are likely already on your mind: the most important is failure handling. That is, what happens if and when a client chokes? You don’t want lose the submitted job, rather you want it resubmitted, so you can’t just “remove it from the task list”. Some others: submitting jobs by identifiable batches (allowing you to subdivide the entire job set into more managable chunks), security (probably not really a worry in this case), some sort of probing/monitoring capability (e.g., status/progress), synchronization (required by distributed tasks, compounded if multi-threading), and data structure efficiency.

I just finished writing such a beast, albeit in Java. Feel free to PM me if you’d like and we can exchange thoughts.

This shouldn’t be necessary, as the server will likely be waiting for the results of any and all active clients. Unless either: (1) the system is essentially stateless, and is explicitly designed to decouple the send/receive or (2) the application is such that there are no results (but that’d be kinda weird).

I don’t think you’d be very concerned with (1), because I’d expect EC2 to operate like a cluster (i.e., a uniform system view) and also to be pretty reliable. It’d be a great feature if you’re relying on arbitrary computers on a potentially unreliable network (e.g., SETI@home or the World Community Grid), though.

BTW – what’s an “S3”?

S3 is the persistent storage that Amazon uses. EC2 doesn’t have it.

So I started thinking a bit more and here’s what I’m thinking of doing now.

write an object that has a method that does the following:
download the data:
do some stuff to it:
upload the results:
send some sort of confirmation signal to the server:

the server will hold on to each “work order” until two flags are true (“isSent and isCompleted”)

any thoughts?
ETA: Better to have the server cache all work orders and simply write them to a log when they are completed, instead of throwing them out completely.

Here’s a thought – I can’t tell if it’s exactly what you’re thinking or not, but it’s how I’m interpreting the above. The “object” is the client; once started (presumably with the IP address of the server as an argument), a client initiates contact with the server and then performs the above steps. Granted, you have to start each client separately, but that’s easily scriptable. And, of course, you need to synchronize the job pick-up (a simple mutex would do; initially, there’d be a sizable waiting queue, but, over time, I’d expect it would resolve to a pretty even distribution of job completion/pick-up).

The best part is, this decentralizes/reduces the server’s control, making the server more of a data repository than an active distributor. It resolves a bunch of bookkeeping issues (e.g., keeping track of what clients are available, how to assign jobs to clients, etc.). Nice.

Yes. Of course, you’ll probably also want to keep track of which client picks up a job and when. That way, you can identify if any jobs aren’t being completed. You can also pretty easily put some statistics gathering structures in the server (e.g., jobs currently out, who has them, who’s completed how many jobs, time taken for each job, etc.).

After writing to the log, you’ll probably want to remove the jobs to free up their memory. A comprehensive log would allow you to re-create the job from scratch if necessary (e.g., for debugging purposes), and disk space is relatively cheap.

Yes, I was thinking of each individual box as something that could just sit around and ask for “clients” and then ask for more when they’re done. But the server would need to know if it ever is completed, so I would need to send some kind of response back.

Btw, why would I need to deal with mutexes? Do I need to have it multithreaded? If the server is simply dealing out objects one at a time, then who cares? Or are you talking about the problem of adding to the queue and removing at the same time? I can see the need for two threads of operation there.

So, I’ll do a little terminology refinement here, just to make sure we’re on the same page. Both “server” and “client” refer to single, independent processes (likely one per “host”, although the hosts in your case will be virtual machines, IIRC). What’s being asked for is a “job”; I’ll assume one job per client.

No matter what the system structure, you start up the server, which has a set of jobs that need doing. From my last post, I was thinking that the server sits there waiting for clients to connect (much like a web server), but does no client allocation of its own accord. Instead, you start up a bunch of clients, each of which initiates a connection to the server, at which point a job is allocated (preview: this is where the mutex comes in). The server does some minor bookkeeping (e.g., marks the “isSent” flag) and then sits and waits.

Each client does its thing; once its job is completed, the client then returns its result to the server. The server then does some more minor bookkeeping (e.g., marks the the “isComplete” flag) and likely stores the result someplace (possibly in a log, possibly in memory for later organization, such as putting the jobs back in order).

Note that there are various options for connection management here: sockets can be maintained over the life of the application (you improve efficiency by not having to construct and destroy them, although then you need to handle broken sockets), they can be opened/closed at the start/end of a single job, they can be opened/closed at each point of contact, etc. Your choice.

Alternatively, you could set it up to work the way I did mine, which is that the server is in control, allocating jobs to clients pro-actively. That means that the server has to do a lot more internal bookkeeping, including already knowing what clients are available, when they’re active, and what job they’re doing. It makes sense for me to do it this way, given my requirements; seems to me you can avoid some of the work by offloading control to the clients.

The mutex is simply so that any single job only gets allocated once. Since you’ll have many clients making requests, likely overlapping, you need some way to do that (unless you don’t mind having individual jobs potentially done by multiple clients). A similar thing might apply to receiving the results; for instance, you might have to lock a file for output so that results don’t get mish-mashed together.

As far as multi-threading goes, I don’t see a need for it in the clients; but I think you’d suffer a substantial performance hit if you didn’t multi-thread the server. Finally, from what you’ve described, I don’t really think there’s a need for simultaneous addition/removal to a queue; rather, I think you’ll have your set of jobs to allocate at startup (so it’ll always be removal, could be a queue, list, stack, whatever), and the results will be gathered as they’re available (always addition). I’m not seeing the need for a producer/consumer model here.

Was that understandable?

Thanks for that DS, I’ll have to dig into this deeper a bit. I just actually spoke to my boss and here are the key points he told me. First he told me not to focus too hard on the object that gets sent. Well that’s actually done already. As far as my testing purposes go, that is. I’ve got an object now that, once properly initiated, will pull a file from the server, do something to it, and put it in the results folder.

So the test object is ostensibly done. Now when I asked if I need to use multiple threads, he said no. What he did say to look into was the “select” command for sockets.

Quick question about sockets since I’ve gotten your attention here. The following is some sample socket code I’ve been using and I’d like to ask a few specific questions if you wouldn’t mind answering. I’m going to interrupt here and ask your advice. Note that the indentation is kind of messed up. It didn’t run without some work on my part in cleaning up the indentation.



from socket import *
HOST = 'localhost'
PORT = 21567
BUFSIZ = 1024
ADDR = (HOST, PORT)
serversock = socket(AF_INET, SOCK_STREAM)
serversock.bind(ADDR)


Binding to ‘localhost’? That makes me feel like it could only communicate internally, where do I get the address of the other machine?



serversock.listen(2)


okay, this is the biggest mystery here for me. I read that this is something called a backlog. Apparently this can be a value between 1 and 5 (on most systems). Hypothetically in a single-threaded server, what would happen if say 4 people decided to connect at once? Would this simply have a queue of maximum five connections?

I’m just guessing here, but does this continue to function while i’m in the loop below? So if I fill up my queue here, and after one disconnects, can I add another one to the list? This means to me that it’s setting the state of the socket, and isn’t something that needs to be called again to enable more connections. Is that right? Does this simply define the way the socket behaves in the loop below?



while 1:
       print 'waiting for connection…'
       clientsock, addr = serversock.accept()


I was a bit surprised to see here that there wasn’t an infinite loop. For some reason the loop stops at the serversock.accept()


        

       print '…connected from:', addr

       while 1:
             data = clientsock.recv(BUFSIZ)


could you explain why there needs to be a BUFSIZ variable there? I realize it was set to 1024. I’m just curious as to why it needs to be done.


        

       if not data: break
       clientsock.send(‘echoed’, data)
  
       clientsock.close()

serversock.close()


If I am right about the socket.listen() command then couldn’t one open several sockets? If that were the case then I can understand why he’d want to use the select command.
Anyway, thanks again for your help. Python’s weird as I’m used to C++, but I do really like it. I feel bad about coming in here with programming-related questions, but the truth is that this is one of the best places! Anyway, I figure it’s better to have this conversation where someone might be able to find it later. Who knows? Maybe someone will find it useful.

First, you’re welcome. Next, it sounds like a good piece of the work is already done, which is good.

Ah, OK. Before going forward, I have to add a disclaimer that I’m not a Python programmer. The fundamentals don’t change, but I might not have ready answers to implementation questions.

Perhaps looking at this echo server would be helpful. There should be little difference between returning a string and an object.

Since the client is initiating the contact, the server doesn’t need the address of the other machine. My guess is that the localhost is simply for localized testing; it can be changed easily when put on EC2. You could, however, have the client pass its address after contact, if you didn’t want to dig it out of the connection information (in C, IIRC, you’d find the info in the sock_addr structure; in Python, I’m not sure where it is, but I’m sure it’s there).

Yes. This page (I linked to the non-blocking select section, but it covers listen and backlog in section 2) gives a good synopsis. Read that and then ask about anything that’s not clear.

There is an infinite loop; it’s the while 1:.

You need someplace to store the incoming data; this sets aside 1K of memory in which to put it.

The listen indicates a server socket, which waits for a connection, handles it, then repeats (effectively acting as several sockets, even though it’s actually only one that operates sequentially). If lots of data is expected or there will be extended interaction between client/server, one would use multi-threading here (or spawn subprocesses). In your case, it sounds like there will be a connection and a single object (or a short stream of data; it’s not clear to me) will be sent. Since it’s such a small amount of data, multi-threading isn’t necessary.

Although there’s no assurance that anyone will be able to answer specific questions, I like posting technical questions here, as I have a high level of trust in the answers I might receive. In communication theory parlance, when there’s a signal, it’s almost a given that there’ll be very little noise. Even beyond the “fighting ignorance” charge, that’s close to priceless in and of itself.

I don’t know if this is what you’re looking for, you might want to look into a framework that takes care of the annoying server-side broilerplate for you. I like Twisted - it manages the threading and socket connections, and hands you the socket to do with as you please. (You can also take advantage of a huge client/server library so you don’t need to re-invent an HTTP server, for example.) It’s all done with an event-based API, which I find makes it easier.

Of course, YMMV.

Here’s some simple examples of Twisted Core (the async client/server API): echo server, telnet chat server, SSL, etc.

This is probably a good idea; it’s always productive (for me, anyway) to learn by example. And I have to say that, now that I have a firmer grasp of the task, I think I’ve been viewing this as more complex than it needs to be; I’m pretty sure it’s a case of projecting, substituting my design specs and reqs for what Merkwurdigliebe actually needs. For instance, as it turns out, my mention of mutexes to synchronize the job queue was unnecessary – the select will serve the purpose.

Hopefully, chatting about it is beneficial in and of itself, at least from a learning perspective.