This purpose of this document is to outline the design of the Kylimar MUD Server. It shall be refined over time as the design is developed. The goal is to design a server that can support between 10,000 and 20,000 simultaneous connections on a machine that roughly one to two gigahertz of processing power, one to two gigabytes of main memory, and an network connection that can support the required bandwidth.
The following references have been and will continue to be useful as the design moves forward:
These are terms and definitions used in this document.
The initial platform for development will be Linux with a 2.4 kernel. It is thought that upgrading to the 2.6 kernel will not present any problems. It is hoped that the architecture and implementation will make the server very portable to other platforms. However, since the initial destination platform is Linux 2.4, some decisions will be slightly tailored to that.
Through this project the following libraries and APIs will be used:
The Vstr library was chosen because it has excellent support for network I/O. Combining the MUD Server with a string library means that we are not reliant on the host system's default C string library, which may or may not be broken.
The epoll API was chosen because it demonstrates superior performance over other similar APIs as of this writing. Asynchronous I/O was avoided because it is just threading hidden by the GNU libc library. Also, it was decided that epoll should be used in an Edge Triggered way to prevent epoll from constantly returning a socket that was already being processed. However, it is still possible for an event to arrive on a socket that is already being processed. For details, see the Threading section below.
In designing a data storage system for the Kylimar MUD Server, we first consider a database. It has many useful features:
However, databases have some downsides:
The downsides of a complete database system seem fairly substantial. Therefore another idea was put forth: bundle the Kylimar MUD Server with its own Data Management Backend Server. A server like this would have these benefits:
The biggest issues facing such a system would be transactions, both for data integrity between system hiccups and for locking. These issues have yet to be addressed.
The Kylimar MUD Server will be multithreaded in order to handle the required number of simultaneous Client connections. However, it will not have one thread dedicated to each client connection, as that is not scalable. The system cannot handle more than few hundred threads, less if the threads are very active. There will be at least three threads, configurable to many more:
Of the worker threads, only one may be polling for events at a time. This is enforced by a pthread mutex. Normally, a Client connection is polled for EPOLLIN, EPOLLPRI, EPOLLERR, and EPOLLHUP, where EPOLLPRI, EPOLLERR, and EPOLLHUP indicate a connection needing to be closed. When there is data to be sent to a Client, the thread determining that there is data to be sent will call epoll_ctl and add the EPOLLOUT event, despite the fact that another thread MAY be waiting for events. This is not a problem. Once all data is sent to the Client, epoll_ctl is called again and the EPOLLOUT event is removed. Remember, we are using epoll in an Edge Triggered paradigm.
It is possible for events to happen in this way:
epoll_wait.epoll_wait; it does not see the event from before even though it has not yet been processed because we are in Edge Triggered mode.
The solution to this problem is that once an event is seen, the processing thread must still retrieve a lock on the Client connection before processing occurs. Once the lock is retrieved, the Client connection is processed, even though the data causing the original event to be fired may have been eaten by the previous lock holder. This is why all I/O must be marked as non-blocking. If the event is noticed, but the read returns EAGAIN, no processing is really done, except the wasted locking and unlocking of the Client connection just to determine that there is nothing to be done. Hopefully, event processing will occur fast enough and events will enough infrequently enough that in practice this situation hardly ever happens.
The sweeper thread handles memory deallocation. It runs occasionally, sleeping between runs. When it runs, it deallocates any memory that has not been used for some considerable time. For details about what "some considerable time" is, see the Memory Management section below. The sweeper thread will probably run once every five seconds or so (sleep(5)), but that interval will be tunable by the Administrator.
Worker threads will also be involved in listening for connections. When the worker notices an event on the listening socket, it will call accept() to retrieve the new connection. It will then perform some minor initialization on the connection before adding it to the list of connections that the workers are waiting for events on.
To minimize context switches, we only allow the number of threads competing for processor time to do work to be the same as the number of processors (or maybe +1). This is tunable by the administrator; we do not attempt to detect the number of processors. This is enforced by a counting semaphore. Any thread performing "work" (i.e. not sleeping or waiting for an event) must have retrieved the semaphore by calling sem_wait(). When any thread has finished doing "work" and is about to sleep or wait for an event, it will release the semaphore by calling sem_post().
There will be another thread which is a maintainance thread. It will run at regular intervals, being the heartbeat or tick of the server. The administator will tune the interval, but the default (and suggested) value will be thirty seconds. This thread will be devoted almost entirely to tasks specific to the application. Some examples tasks might be:
It is expensive to call the memory allocator. Therefore, we will try to avoid the allocator as much as reasonably possible. When a thread needs more memory, it will request it from a list of unused objects.
The following are object types that will be pooled and cleaned by the sweeper.
| Object Type | When Requested | When Released |
|---|---|---|
| Client Connection Object | Upon initial client accept() | Upon client close() |
| User Data Object | Upon first associated Client authentication | Upon last associated Client close() |
String objects are not contained in the list because the Vstr library maintains its own internal pool.
Pooled objects will be contained in two lists. When an object is requested, it is first pulled off the "A" list. If none exist on the "A" list, then it is pulled off the "B" list. If none exist on the "B" list, then it is allocated from the system memory. When an object is released, it is added to the "A" list. When the sweeper runs, it creates a new "A" list, moves the old "A" list to the "B" list, and deallocates the memory associated with the old "B" list.
Steps for handling a new connection:
accept() is called and returns a new connection.O_NONBLOCK. If it is not, set that flag.epoll_ctl.Steps for handling the termination of a connection, initiated by the server:
close() the Client connection socket.Steps for handling the termination of a connection, initiated by the remote end:
close() the Client connection socket.epoll_ctl call? Is any locking needed?epoll_ctl?sleep(), alarm(), select(), or some other way to delay?I would greatly appreciate any comments and suggestions you might have. This is definitely an ongoing project. My mailing address is ben at this domain.
Copyright (C) 2005 Ben Love. All Rights Reserved.