Friday, August 20, 2004

 

High performance sockets


Introduction

In this article, I write about developing a scalable high performance network server application for Windows. Windows server 2003 is the target platform for this discussion. I recommend using Overlapped I/O, IO Control Ports and WinSock extended functions as a high performance solution.

The other choices

Before we discuss our high performance solution, let us briefly discuss some other options.

○ Use select() to wait for data and process data.
This is probably the most commonly used model. The problem with this model is, select() call requires arrays of handles for read, write and exceptions. These arrays are scanned in every select() call, their corresponding kernel structures are modified and these arrays are rewritten at the end of select(). The programmer has to rebuild this array every time a new socket is created or removed, assuming you use only a copy of your handle arrays for every select() call. You can optimize this behavior but still the overhead is high, and if you are using more than 100 sockets, you would see significant performance degradation.

○ Use WSAAsyncSelect() and a message loop
In this method, you associate every socket with a window message using WSAAsyncSelect() call and use a message loop to retrieve messages. Now you are going to face typical message loop issues, it is a thread specific single queue. If you share this message loop with non-socket Window messages you would see latency issues.

○ Use WSAEventSelect() and WaitForMultipleObjects()
Here, you associate each socket with an event kernel object. The problem here is you are limited to 64 sockets per WaitForMultipleObjects() call, so you may need to use multiple threads. You also need to rebuild the array whenever a socket is added or deleted.


Now let us get back to the topic and introduce my solution. To start with, I am going to introduce overlapped I/O and IO control ports. Let us discuss overlapped I/O first.

Overlapped operations

As with IO control ports, overlapped I/O is not socket specific. They are part of the IO subsystem in Windows. You should also remember that sockets are true file handles. Creating a socket is similar to opening a serial port device or for that matter any other device. When a socket is created, the user level socket library (winsock2 and mswsock.dll) uses AFD.sys kernel driver to ultimately open /device/Tcp (or Udp). So there is a kernel level file handle and its associated file object for every socket. Why do I talk about this? Because you need a file handle to use overlapped I/O and IO control ports.

Overlapped IO enables asynchronous execution of socket operations. Basically you tell WinSock to initiate a socket operation and call you back when the operation is completed. The socket operation can be send, receive, connect or accept. There are 3 ways you can be notified of the result: events, callbacks and completion ports. Let us do a walk through with sending data. Here is a partial function signature of WSASend().

int WSASend(, , , , ,LPWSAOVERLAPPED, LPWSAOVERLAPPED_COMPLETION_ROUTINE)

So with the usual parameters for sending data, we have to specify a WSAOVERLAPPED structure. It is an opaque structure except for the hEvent parameter. If you want event notification, assign a handle for a manual event kernel object to this field. If you want a callback notification, write your own callback function and pass it's address as the last argument. This callback function is invoked as an user mode APC, so your thread has to wait in an alertable wait state. If you specify a callback routine, the hEvent field is ignored.

Events and callback routines are not highly scalable methods. When you use events, you are limited to 64 per WaitFor… call. So for more than 64 events, you need multiple threads and also the issues we discussed earlier. The callbacks are associated with a single thread, so they are not scalable.

IO Control Ports

IO Control Port (IOCP) is not a port at all, it is a per process message queue. Windows adds and removes IO completion messages to this queue. It is a special queue, only Windows IO manager (and its related utilities) knows how to deal with it, the message format and size are fixed and not documented. You cannot directly access this queue, because the queue structures are not documented.

IO manager can add a message to this queue whenever an IO operation (IRP) is complete. You can add your own completion message using PostQueueCompletionStatus() function. You can retrieve messages using GetQueuedCompletionStatus() function.

To create an IOCP, call CreateIOCompletionPort(), this creates the IOCP queue. To start with no file handle is associated with this queue (this is not completely true). To associate a file handle with this IOCP, call CreateIOCompletionPort() again, passing the file handle as a parameter. This is the confusing part about IOCPs, when you call CreateIOCompletionPort() again, it doesn't really create anything, it just marks a reference to the IOCP handle in the kernel file object (associated with your file handle). Now, whenever an IO operation finishes (either successful or failed) on that file handle, Windows knows which IOCP to use. There can be only one IOCP associated with a file handle, and there is no documented way to disassociate an IOCP from the file handle. When you close the file handle, it is no longer associated with that IOCP. Use CloseHandle() to delete the IOCP itself, it will be deleted after all referring file handles are closed.

IOCPs are meant to be used with thread pools. When you create an IOCP you can specify how many threads are associated with it. Say, you associated 5 threads with an IOCP, then there can be only 5 running threads that were waken from GetQeuedCompletionStuatus() function. I mention running because, if a worker thread goes ahead and enters a wait state (after being released from an IOCP wait), Windows scheduler detects it and schedules one more thread in. Don’t use this fact as a design feature, it is an insurance against accidental or unexpected blocking.

WinSock extended functions

Extended functions are Windows specific socket functions (not part of standard Unix/POSIX). They are generally high performance alternatives for their corresponding WinSock functions. Among these, I would like to talk about ConnectEx and AccpetEx. ConnectEx lets you associate an overlap structure with a connect operation, there by you don't have to wait for connect to complete, you will be notified by events or IOCPs. AcceptEx does the same for accept operation.

When you call the standard accept() function, WinSock creates a socket handle for the new connection before accept() returns, but with AcceptEx you have to create this new socket before you call AcceptEx. This enables you pre allocate socket handles when your program starts and keep reusing them. To reuse a socket handle, you should call DisconnectEx instead of closing the socket with closesocket() function. Since creating a socket is a relatively expensive operation, reusing them saves time.

Unlike standard WinSock functions, these extended functions are not directly linked against the WinSock library. You have get the address of these functions before calling them. You can use code similar to the following to achieve this. Here we are using WSAIoctl with the SIO_GET_EXTENSION_FUNCTION_PTR option to retrieve the address.

GUID guidAcceptEx = WSAID_ACCEPTEX;
LPFN_ACCEPTEX AcceptExFunction;

INT rc = WSAIoctl (listenSocket, SIO_GET_EXTENSION_FUNCTION_POINTER, &guidAcceptEx, sizeof(guidAcceptEx), (FARPROC **)&AcceptExFunction, sizeof (FARPROC), &bytesReturned, NULL, NULL);


More on overlapped IO

Before looking at some sample code, I would like to say a few more things about overlapped IO. When you call overlapped send or receive, the data buffer is specified as an array of WSABUF structures. Each WSABUF entry specifies the address and length of a buffer, you can specify a number of them in a single call. When you use WSABUF structures Windows locks those buffers during operation and transfers data directly to and from the buffer. This eliminated need for intermediate copy.

When you initiate an overlap operation, it can complete immediately, fail immediately or left pending. If the operation failed immediately, you will not be notified through overlap notification mechanism, but you will be notified if the operation completed successfully (either immediately or later) or failed later. This is very convenient, because you can process the result of an overlap operation in a single location (i.e., at the notification point) instead of doing it both at invocation point and notification point.

And one another point, remember, Windows I/O subsystem is inherently asynchronous, Windows does extra work for you to implement synchronous behavior. So there is no penalty or cost associated with overlapped I/O.

Sample code

This is a simple echo server to demonstrate the concepts discussed earlier. Here is a brief code walk though:

Structures:
○ SERVER_INFO : Per server structure
○ CONNECTION_INFO: Per TCP connection structure
○ GENERIC_OVERLAP_INFO: Describes overlap operation
○ DATA_OVERLAP_INFO: Describes overlap operation for send/receive

Function StartServer(): -- Called to start the server.
○ Allocate memory for SERVER_INFO structure
○ Open a socket for listening, using WSASocket
○ Get function pointers for AcceptEx and DisconnectEx, using WSAIoctl
○ Create IO Completion port
○ Bind the listening socket to server TCP port
○ Call listen
○ Call our CreateConnection function to create connections
○ Repeat calling CreateConnection for OVERLAP_CONNECTIONS number of connections.
○ Create a thread with ControlPortThreadEntry as the entry point.
○ Repeat above for MAX_THREADS number of threads.

Function CreateConnection():
○ Allocate memory for CONNECTION_INFO structure
○ Open a socket to be used in AcceptEX
○ Associate this socket handle with ther server's IO completion port
○ Call AcceptEx function, with the overlapped structured defined in CONNECTION_INFO structure

Function ControlPortThreadEntry(): -- Entry point for IOCP thread
○ In a forever loop, wait for a new IOCP message by calling GetQueuedCompletionStatus
○ Use the key from GetQueuedCompletionStatus as a pointer to ConnectionInfo
○ For each new message received, get the type of overlapped operation, To do this get the containing record for the overlapped operation.
○ For an ACCEPT message, call HandleAcceptExComplete function
○ For RECEIVE message, call HandleReceiveComplete function
○ For SEND message, call HandleSendComplete function
○ For TIMER message, call HandleTimeoutEvent function
○ For DISCONNECT message, call HandleDisconnectExComplete function

Function HandleAcceptExComplete(): -- Handle the completion of AcceptEx operation, a new connection has arrived
○ Start a overlap received operation using WSARecv
○ Start a timeout timer, just to protect against a long idle connection.

Function HandleReceiveComplete(): -- Data has been received, process it.
○ Send the received data by calling WSASend().
○ If the socket is closed, call EndConnection function.
○ Stop the timeout timer. (May be restarted)

Function HandleSendComplete():
○ Just start another receive operation

Function HandleDisconnectEx complete():
○ Disconnect function call has completed, go ahead start AcceptEx overlap operation.

Function EndConnection(): -- Called when a connection done
○ Call overlap DisconnectEx operation.
○ Stop the receive timeout timer if already running.

Function TimerCallback(): -- Called when receive timeout timer fires
○ Submit a message to IOCP, using PostQueedCompletionStatus, we do this to synchronize access to the thread pool.

Please email svanam@hotmail.com for source code.


Thursday, August 19, 2004

 

Multi threaded programming

Introduction

The major issue in multi threaded programming is synchronization of access to shared resources, otherwise called locking. I talk about locking issues in this document.

Locking In General

Before going to Windows specific locking issues, some general thoughts on locking is in order.

The most important rule in locking is: Lock your data, not code. Meaning, don't try to identify critical areas of code that requires serialization, identify the shared resources and lock access to those resources. Locking code seriously affects scalability.

It is important to realize that locking is a critical issue affecting scalability of your program. The worst case locking method is to require a single global lock for everything, which means effectively only one thread is operating at a time. This is clearly not what you want.

The best thing you can do to improve scalability is not to use locks at all. Well, it is almost impossible to achieve this, but you should work hard to get closer to this goal. The most effective method to achieve this is to use immutable objects. When you design a data structure, divide it into shared and non-shared units. Each thread can lock the shared unit, extract the non-shared unit (or a copy of it), and release the lock. Now this thread would continue doing its work without any lock.

An exercise in locking

Let us explore how one can implement locking in a typical network server application. In this server, we use a receive queue where requests received from the network are stored. We have multiple threads to extract and process requests in this queue and send a response packet to the clients. Our goal here is to minimize locking to the most important resource here, the queue itself. Let us look at different locking strategies.

The least efficient option would be to require each thread to lock the receive queue for the entire processing the request. Since the lock is held for the entire length of processing, no other thread can get in between, and this effectively becomes a single threaded solution.

In the case where the requests in the receive queue are independent of each other and need not be processed in the same order received: each thread can lock the queue, remove the request, unlock the queue and then process the request. Here the queue is locked only until the request is removed and no lock is held during processing. This solution scales better.

In the case where requests are to be processed in the order received : we can follow the earlier strategy but we need to make sure the send operation is serialized. To achieve this, we can create a send queue and associate a thread with it. We have to fill the send queue in the same order the packets are received (We can use partially filled send packets for this, the send thread would wait for the send packets to be filled in completely before sending it). You would probably need to use events to communicate with the send queue thread.

In the case where packets for different channels are received in the same queue: the requests within a single channel are to be processed in order. You can design it in such a way that only a single thread may operate on a channel at a time (no thread affinity here, a thread starting on one channel would complete all the requests on that channel). Instead of allocating a thread per channel, a thread working on a particular channel picks up all the pending requests of that channel. To achieve this, a thread pulls a request from the queue, looks at a per channel request queue for that channel (a separate queue), if this queue is empty then processes the request, if not add it to this per channel queue and go back to the main receive queue. On completing the processing of a request, each thread looks at the per channel queue of the request just completed, if a request is found there, processes the request, this repeats until the queue is empty. This is a slightly complicated method with its own problems. If you have more channels than threads, some channels may starve. If processing a request is time consuming, throughput would be slow with a possibility of idle threads. It is going to balancing act, depending on your situation.

Just as a note of caution, when you work with queues always consider the head of line blocking issue. A problem with the first element in the queue blocks other entries in the queue from being processed.

Locking order

Some developers recommend a locking order. To avoid deadlocking, they recommend that you grab lock A before you grab lock B. In my opinion, if you have to define a locking order, you made a mistake in your design. If you design your locks to be less granular and try to minimize the time you hold on to the locks, you should never run into this problem. Again remember the rule, lock your data not code.

Locking and Callbacks

In a typical server application, you need to call some external functions during your processing of data. Since you don't know what these functions might do, you should not be holding your lock during these callbacks. A common way to solve this problem is to release the lock before calling the external function and then grab the lock again. The problem with this approach is, when you grab the lock the again, the associated resource (and the lock itself) might have been deleted. You need to use reference counts and flags to solve this problem and the code gets complicated and buggier.

A better approach would be to complete all your processing, release the lock and then invoke the callback. You can achieve this by cloning data and using multiple locks.

If you use fine granular locks, you can even invoke the callback while still holding the lock. You can use this technique if you have access to the source code for the callback function and you can make sure it doesn't do any abnormal thing like sleeping.

Reference Counts

Reference counts are a widely used technique to solve resource sharing problems. It is so easy to use, developers don’t hesitate to use it. Reference counts are long term locks, they span multiple calls, they are mainly used to prevent a resource from accidentally getting deleted. You increment the reference count for the duration of an object's use, you decrement it when you no longer need the resource. Whenever the reference count becomes zero, it is deleted. So when an object is created, the reference count is 1, and when it is deleted by the user the reference count is just decremented, object gets deleted only when reference count is zero.

But reference counts are a debugging nightmare, while they are easy to code, they are very difficult to maintain. With bad coding, a resource can be accidentally deleted or never deleted. So try to avoid reference counts if possible. If you have to use reference counts, use separate flags for each user of the resource, when you increment a reference count set the associated flag, and when you decrease the reference count, clear the flag. This way when you are debugging, you can find out who is holding the resource.

If the number of users of a resource is small and they are part of the same application, you can use flags instead of reference counts. They may be difficult to code, but presents fewer opportunities for bugs.

Locking in Windows

Windows provides a rich set of locking instruments. You can use many kernel objects for locking, mutex being the most common. You could also use events and semaphores for locking. There is enough documentation on MSDN on how to use these kernel objects. If you need to learn more about object manager and kernel objects, read the Inside Windows book by Mark Russinovich and David Solomon.

If your application contains multiple processes, you need to use kernel objects. But if your application is multi threaded, you can (and should) use critical sections. I wish Microsoft named CriticalSection as UserLock or something, it gives the impression that you should lock code instead of data.

A critical section is basically a data structure to be defined along with other fields in your own data structure. A critical section is a smart object, it uses a combination of Interlocked operations and kernel objects (Interlocked operations are discussed later). Locking a critical section object may not make a kernel transition, so it is faster than a kernel object. Kernel object is used only if the object is already held by another thread. This is the reason why you should use a critical section in a multi threaded program instead of a kernel object. Since a critical section uses addresses from its own process, it can't used between processes.

Critical sections are recursive, if a thread already holds a critical section, trying to grab the critical section again would succeed.

To use a critical section, you should first initialize it by calling InitializeCriticalSection(). To lock the critical section, call EnterCriticalSection(), to unlock critical section call LeaveCriticalSection(). At the end of your program call DeleteCriticalSection() to release the resources.

Interlocked operations

Interlocked operations take advantage of the CPU LOCK keyword to implement 32-bit operations atomically (there are 64-bit variants too). The most common of these are InterlockedIncrement and InterlockedDecrement, they increment and decrement a 32-bit value at an address atomically, that means no processing occurs in the middle of interlocked operations. If it is not atomic, some thread might change the value between your read and write.

Interlocked operations come handy when you try to synchronize access to a 32 bit value. They are fast, but I still prefer to use critical sections instead primarily because they can't be used to protect their containing data structure. For simple things like incrementing a reference count, these may be useful. Critical sections make the code more readable and in a complicated project it is safer (with multiple programmers working on a single project, it's easier to introduce a bug with interlocked operations).

To get a better perspective on interlocked operations, try to learn how these are implemented in a cache-coherent CPU like Pentium in a multi CPU environment. One thing you would notice is, when you do interlocked operations the cache line would be invalidated and reloaded, so your L1, L2 caches are not used as much as they should be. Since memory access occurs in magnitudes of a cache line, data adjacent to your interlocked variable that are in the same cache line may be invalidated and reloaded from memory.


Monday, August 16, 2004

 

Basics of .Net

Introduction

This blog is about some discussion points on various .Net feature. These are mostly basic things one should be aware of.


Design Philosophy

Productivity is the most important feature of .Net. Productivity is the numer one feature, speed and memory conservation are secondary features. That means, the CLR does more work for you. So your software design methods have to change accordingly.

First thing to note is that a .Net assembly is much more than an executable. It has a rich meta data describing various aspects of the assembly. Attributes are an easy way to add your own information to the assembly. Using reflection other assemblies can read your meta data. Also note that all the classes in your assembly are fully described, including names of private members.

The execution environment is dynamic. Don't hesitate to allocate memory and don't worry about memory fragmentation. CLR does excellent memory management. Memory usage of a .Net program is generally high, there is a lot of memory copying involved. For a C programmer, this is scary, just get over it.

When you develop a .Net application, your goal is to develop a reliable and secure application. Memory overhead and speed of execution should be your secondary goals, if they they are your primary goals better stick with C/C++.

Managed Code

A .net assembly (in intermediate language) needs to be converted to native (x86) code by the Common Language Runtime (CLR) before it gets executed. CLR does the compilation from intermediate language to native code, so it knows exactly what your assembly is doing. Even after the compilation is done, the CLR doesn't give complete control of the processor to your assembly. The CLR is the running program, your assembly is just a set of library calls for the CLR. Your assembly is managed by the framework, that's why it's called managed code.

Contrast this with native code, it has full control of the processor, it can do anything it wants, (of course limited to OS rights management).

Let me put it this way, managed code is digital and native code is analog. It is easy to manipulate digital data but not analog signals.

Common Type System

A .net assembly is just a collection of types - classes, interfaces and so on. Any class with a static Main method, can be marked as the entry point for the assembly. While an entry point is typically how an assembly starts execution, it doesn't have to be that way. Some other assembly can load your assembly and use one of the types you include in your assembly, bypassing your entry point completely. So it is safe to think that your assembly is just a collection of classes, interfaces, value types and other basic types, you just suggest an entry point.

For any instance of your class, the type information can be easily obtained using reflection.

Meta data

All the types (classes, interfaces) you define in your c# program are fully described in your assembly, only a few things are lost during compilation from C# to assembly. A .net assembly is not a binary in the traditional sense, unless you use some obfuscation technique, your assembly is readable source code.

This meta data is a fundamental feature of .net. This enables your classes to be fully described in the assembly, another assembly can learn about your types dynamically and can invoke your class's methods dynamically. Contrast this with calling a function in a native code DLL, the caller has to know the function signature during development, it cannot be learnt dynamically.

Each assembly has metadata describing itself. It includes digital certificate, version number, company name and other things. By including version number in each assembly, DLL hell is avoided, even though version hell is introduced.

Reflection

Reflection is the process of publishing information about types and it is a widely used feature in .Net. The CLR learns about your assembly using reflection.

The types in an assembly are fully described using the Type class. With this class you can learn about a class's members, methods, attributes and so on. You can call a class's method or change the value of members using reflection.

Reflection is not an esoteric feature, it is an integral part of CLR. Don't be shy about using them.



IO Streams

To do any IO in .Net you should be comfortable with the streams framework. Using streams is well documented in MSDN. But the most important thing to know about streams is it's pluggable model of doing IO. Another stream can be attached above or below your stream enabling the user to build a chain of streams. A pluggable stream doesn't necessarily know that another stream is attached on either top or bottom. Not all streams are pluggable, some streams can only be at the bottom of the chain.

For example, to build a compressed, encrypted, network data stream: Obtain a network stream from a connected socket, create a encryption stream and attach it on top of the network stream, then create a compression stream and attach it on top. The end result is a single stream that you can pass to your network application. When the application writes data to this stream, it will be compressed, encrypted, and sent to the other end reliably. This is all done without a network application doing extra work for compression/encryption.



Environment

Every .Net installation is meant to co-exist with other installations. By default, .Net files are installed in Windows\Microsoft.Net\Framework\ under it's own version dependent directory.
Compilers for C#, VB.NET and VJ are installed by default.
Global assembly cache (GAC) contains the system .Net assemblies.
There are no header files .Net. The c# compiler doesn't compile each c# file separately, it combines all the files together and compiles the resulting source file.



Garbage collection

In a .Net environment, dynamic memory allocation is very common. Compared to native code, memory allocation in .Net is faster. So when programming in .Net, don't worry about memory allocation overhead.



Every class instance in .Net is a pointer. All these instances are tracked by CLR and when these pointers are no longer in use, their memory is reclaimed. This process is called garbage collection (GC).

GC is done in a function that's called when memory is running low. It frees memory with no reference to it. For example, if a variable pointing to a class instance is no longer in scope, there is no reference to that class instance, so it can be freed.

GC may change the memory address of a class instance. So don't count on memory address of a class instance to be the same through out the application execution.

GC is a time consuming process, all threads are stopped when GC is running. Keep this in mind if you are designing a real time system.

Arrays and Structures

The order of members in a class is not preserved in memory. For example, if you have defined a byte array in between integers in your class, CLR may combine both integers together in memory and keep the byte array afterwards.

All arrays have their length included with the data. Generally this length is not modified after the array has been created. This length is just the capacity of the array, not the number of valid elements in the array. For example, if you have a network buffer declared as a byte array, you may need a separate variable to keep track of actual bytes in the array.

Asynchronous Execution

Asynchronous execution is another widely used feature in .Net. This feature enables the calling thread to continue execution while the called function is still executing.
In asynchronous execution the called function is split into two parts, the begin method and end method.

To use asynchronous call execution, you pass the callback function (delegate) to the begin called function along with a context value (any object). The begin method returns an IAsyncResult instance. When the called function is complete, it invokes the delegate and pass the context value as a parameter. Now the callback has to call the end method with the IAsyncResult instance returned by the begin method. The purpose of this IAsyncResult instance is to link the begin and end methods.

Code Access Security

Since .Net is a managed code environment, the CLR can identify and enforce access control. CLR defines an extensive set of rights that the user can configure, and applies it against a set of evidence the CLR obtains from the assembly.

Code Access Security (CAS) is still evolving, may be in the longhorn time frame it will be fully appreciated.

Code DOM

This class is used to dynamically generate code. Microsoft doesn't document this feature well, but MSDN has a working sample to start with.

With CodeDOM you can define your methods and members and dynamically generate code in any of the .Net languages. By default, code generators for C# and VB.Net are included. CodeDOM is difficult to get used to, the amount of work involved in generating a simple class is quite high. But once you get used to it, it could be very useful.



This page is powered by Blogger. Isn't yours?