Newsletter
RSS

Blog posts of '2014' 'July'

.NET CLR Synchronization Mechanisms Performance Comparison

Multi threaded high throughput applications and services must choose the synchronization mechanisms carefully to gain an optimum throughput. I have ran several tests on different computers so I can compare common synchronization mechanisms in the Microsoft .NET CLR. Here are the results:

Test Results

Processor Empty Loop Interlocked Locked Polling Reader Writer Lock
Read / Write
 AMD Opteron 4174 HE 2.3 GHz 10.4 ms 30.6 ms 90.8 ms 1095 ms  208 ms  / 182 ms
 AMD Athlon 64 X2 5600+ 2.9 GHz 7,1 ms 19.9 ms 37.8 ms 546 ms 88.9 ms / 82.8 ms
 Intel Core 2 Quad Q9550 2.83 GHz 4.3 ms 19.3 ms 56.2 ms 443 ms  99 ms / 82.6 ms
Azure A1 (Intel Xeon E5-2660 2.2 GHz) 8.0 ms 19.9 ms 57.5 ms 502 ms 108 ms / 104 ms
Processor Auto Reset
Event
Manual Reset
Event
Semaphore Mutex Thread Static
 AMD Opteron 4174 HE 2.3 GHz 2927 ms 3551 ms 2732 ms 2764 ms 12.8 ms
 AMD Athlon 64 X2 5600+ 2.9 GHz 1833 ms 2052 ms 1870 ms 1733 ms 18,2 ms
 Intel Core 2 Quad Q9550 2.83 GHz 1015 ms 1328 ms 1169 ms 1099 ms 13.2 ms
Azure A1 (Intel Xeon E5-2660 2.2 GHz) 1576 ms 2215 ms 1760 ms 1847 ms 8.4 ms

As we see, there are huge differences in the throughput of the synchronization mechanisms. Next i wil discuss each mechanism and its use cases.

.NET Synchronization Mechanisms

Interlocked (Interlocked.Increment …)

Interlocked access has 3-4 times the overhead than unsynchronized access. It is best, when a single variable must be modified. If more than one variable must be modified, consider the use of the lock statement. On standard hardware a throughput of 400.000.000 Interlocked operations per second can be achieved.

Locked (C# lock Statement)

Locked access is about 5-10 times slower than unsynchronized access. It is quite equal to Interlocked when two variables must be modified. From a general perspective it is fast and there is no need for tricks (malicious double locking) to avoid it. On standard hardware a throughput of 200.000.000 lock operations per second can be achieved.

Polling (Thread.Sleep)

The overhead of Thread.Sleep itself has an overhead of about 100 times, which is quite equal to other complex synchronization mechanisms like events and semaphores. Polling with Thread.Sleep(0) can achieve a throughput of about 1.500.000 operations per second. But it burns a lot of CPU cycles. Due to the fact that Thread.Sleep(1) will sleep about 1.5 ms, you can’t get much more than 750 operations per second with a real polling mechanism that doesn’t burn a lot of CPU cycles. Polling is a no go for high throughput synchronization mechanisms.       

Reader Writer Lock (ReaderWriterLockSlim)

A reader / writer lock is about two times slower than locked access. It has an overhand of about 15-20 times to the unsynchronized access. The acquiring of the lock itself has nearly no difference if you choose read or write access. It has a throughput of nearly 100.000.000 lock operations per second on standard hardware. It is meant for for resources with a lot of readers and one (or only a view) writers and in this scenarios you should use it.

Auto Reset Event (AutoResetEvent)

An auto reset event is about 100-150 times slower than the unsynchronized access. It has an throughput of about 700.000 operations per second.

Manual Reset Event (ManualResetEvent)

An manual reset event is about 150-200 times slower than the unsynchronized access. It has an throughput of about 500.000 operations per second. You should check if your architecture requires the manual reset, and consider to use the faster auto reset event if possible.

Semaphore (SemaphoreSlim)

An semaphore is about 100-150 times slower than the unsynchronized access and has an throughput of about 700.000 operations per second. There is no alternative to its functions

Mutex (Mutex)

A mutex is about 100-150 times slower than the unsynchronized access and has an throughput of about 700.000 operations per second. There

Thread Static Field ([ThreadStatic] static)

Last but not least the thread static field. It isn’t a synchronization mechanism, but it is a great way to avoid synchronization at all. With a thread static each tread can keep its own object to work with. So there is possibly no need to synchronize with other thread. Thread static field access is about 2-3 times slower than unsynchronized access.

Conclusion

If you have to synchronize a piece of code that runs much longer than 0.1 milliseconds or has a throughput less than 10.000 operations per second, don’t worry about the performance of the synchronization mechanisms. they are all fast enough and will have nearly no performance impact to the rest of your code. It is more important to avoid over-synchronization than to choose the fastest method. If you have to synchronize small operations with a throughput of 100.000+ operations per second, the synchronization mechanism must be optimized to gain not too much impact on the throughput.

Source Code

Here is the C# source code for the tests:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Threading;
namespace ThreadingPerformance
{
    public class StaticFieldTestClass
    {
        public static int StaticField;
        [ThreadStatic]
        public static int ThreadStaticField;
    }
    public class Program
    {
        public static int mtResource; // Resource and methods are public to prevent compiler optimizations
        private static object resourceLock;
        private static ReaderWriterLockSlim resourceReaderWriterLockSlim;
        private static AutoResetEvent resourceAutoResetEvent;
        private static ManualResetEvent resourceManualResetEvent;
        private static Semaphore resourceSemaphore;
        private static Mutex resourceMutex;
        public static void Main(string[] args)
        {
            Console.WriteLine("Performance Tests");
            Console.WriteLine("  Stopwatch Resolution (nS): " + (1000000000.0 /Stopwatch.Frequency).ToString());
        resourceLock = new object();
        resourceReaderWriterLockSlim = new ReaderWriterLockSlim();
        resourceAutoResetEvent = new AutoResetEvent(true);
        resourceManualResetEvent = new ManualResetEvent(true);
        resourceSemaphore = new Semaphore(1,1);
        resourceMutex = new Mutex();
            RunTests(1000000);
            Console.WriteLine("Tests Finished, press any key to stop...");
            Console.ReadKey();
        }
        public static void RunTests(int iterations)
        {
            Console.WriteLine("  Iterations: " + iterations.ToString());
            Stopwatch watch = new Stopwatch();
            Console.WriteLine();
            Console.WriteLine("  Simple (Empty) Call - Bias for all tests");
            SimpleCall();
            SimpleCall();
            watch.Reset();
            watch.Start();
            for (int i = 0; i < iterations; ++i)
                SimpleCall();
            watch.Stop();
            Console.WriteLine("  Simple (Empty) Call Elapsed Time (mS): " + ((double)watch.ElapsedTicks / Stopwatch.Frequency * 1000).ToString());
            Console.WriteLine();
            Console.WriteLine("  Interlocked Call");
            InterlockedCall();
            InterlockedCall();
            watch.Reset();
            watch.Start();
            for (int i = 0; i < iterations; ++i)
                InterlockedCall();
            watch.Stop();
            Console.WriteLine("  Interlocked Call Elapsed Time (mS): " + ((double)watch.ElapsedTicks / Stopwatch.Frequency * 1000).ToString());
            Console.WriteLine();
            Console.WriteLine("  Locked Call");
            LockCall();
            LockCall();
            watch.Reset();
            watch.Start();
            for (int i = 0; i < iterations; ++i)
                LockCall();
            watch.Stop();
            Console.WriteLine("  Locked Call Elapsed Time (mS): " + ((double)watch.ElapsedTicks / Stopwatch.Frequency * 1000).ToString());
            Console.WriteLine();
            Console.WriteLine("  Polling Call");
            PollingCall();
            PollingCall();
            watch.Reset();
            watch.Start();
            for (int i = 0; i < iterations; ++i)
                PollingCall();
            watch.Stop();
            Console.WriteLine("  Polling Call Elapsed Time (mS): " + ((double)watch.ElapsedTicks / Stopwatch.Frequency * 1000).ToString());
            Console.WriteLine();
            Console.WriteLine("  ReaderWriterLockSlim Read Call");
            ReaderWriterLockSlimReadCall();
            ReaderWriterLockSlimReadCall();
            watch.Reset();
            watch.Start();
            for (int i = 0; i < iterations; ++i)
                ReaderWriterLockSlimReadCall();
            watch.Stop();
            Console.WriteLine("  ReaderWriterLockSlim Read Call Elapsed Time (mS): " + ((double)watch.ElapsedTicks / Stopwatch.Frequency * 1000).ToString());
            Console.WriteLine();
            Console.WriteLine("  ReaderWriterLockSlim Write Call");
            watch.Reset();
            watch.Start();
            for (int i = 0; i < iterations; ++i)
                ReaderWriterLockSlimWriteCall();
            watch.Stop();
            Console.WriteLine("  ReaderWriterLockSlim Write Call Elapsed Time (mS): " + ((double)watch.ElapsedTicks / Stopwatch.Frequency * 1000).ToString());
            Console.WriteLine();
            Console.WriteLine("  AutoResetEvent Call");
            watch.Reset();
            watch.Start();
            for (int i = 0; i < iterations; ++i)
                AutoResetEventCall();
            watch.Stop();
            Console.WriteLine("  AutoResetEvent Call Elapsed Time (mS): " + ((double)watch.ElapsedTicks / Stopwatch.Frequency * 1000).ToString());
            Console.WriteLine();
            Console.WriteLine("  ManualResetEvent Call");
            watch.Reset();
            watch.Start();
            for (int i = 0; i < iterations; ++i)
                ManualResetEventCall();
            watch.Stop();
            Console.WriteLine("  ManualResetEvent Call Elapsed Time (mS): " + ((double)watch.ElapsedTicks / Stopwatch.Frequency * 1000).ToString());
            Console.WriteLine();
            Console.WriteLine("  Semaphore Call");
            watch.Reset();
            watch.Start();
            for (int i = 0; i < iterations; ++i)
                SemaphoreCall();
            watch.Stop();
            Console.WriteLine("  Semaphore Call Elapsed Time (mS): " + ((double)watch.ElapsedTicks / Stopwatch.Frequency * 1000).ToString());
            Console.WriteLine();
            Console.WriteLine("  Mutex Call");
            watch.Reset();
            watch.Start();
            for (int i = 0; i < iterations; ++i)
                MutexCall();
            watch.Stop();
            Console.WriteLine("  Mutex Call Elapsed Time (mS): " + ((double)watch.ElapsedTicks / Stopwatch.Frequency * 1000).ToString());
            Console.WriteLine();
            Console.WriteLine("  StaticField Setter Call");
            StaticFieldSetter();
            StaticFieldSetter();
            watch.Reset();
            watch.Start();
            for (int i = 0; i < iterations; ++i)
                StaticFieldSetter();
            watch.Stop();
            Console.WriteLine("  StaticField Setter Call Elapsed Time (mS): " + ((double)watch.ElapsedTicks / Stopwatch.Frequency * 1000).ToString());
            Console.WriteLine();
            Console.WriteLine("  ThreadStaticField Setter");
            ThreadStaticFieldSetter();
            ThreadStaticFieldSetter();
            watch.Reset();
            watch.Start();
            for (int i = 0; i < iterations; ++i)
                ThreadStaticFieldSetter();
            watch.Stop();
            Console.WriteLine("  ThreadStaticField Setter Call Elapsed Time (mS): " + ((double)watch.ElapsedTicks / Stopwatch.Frequency * 1000).ToString());
            Console.WriteLine();
            Console.WriteLine("  StaticField Getter Call");
            StaticFieldGetter();
            StaticFieldGetter();
            watch.Reset();
            watch.Start();
            for (int i = 0; i < iterations; ++i)
                StaticFieldGetter();
            watch.Stop();
            Console.WriteLine("  StaticField Getter Call Elapsed Time (mS): " + ((double)watch.ElapsedTicks / Stopwatch.Frequency * 1000).ToString());
            Console.WriteLine();
            Console.WriteLine("  ThreadStaticField Getter");
            ThreadStaticFieldGetter();
            ThreadStaticFieldGetter();
            watch.Reset();
            watch.Start();
            for (int i = 0; i < iterations; ++i)
                ThreadStaticFieldGetter();
            watch.Stop();
            Console.WriteLine("  ThreadStaticField Getter Call Elapsed Time (mS): " + ((double)watch.ElapsedTicks / Stopwatch.Frequency * 1000).ToString());
        }
        public static void SimpleCall()
        {
            ++mtResource;
        }
        public static void InterlockedCall()
        {
            Interlocked.Increment(ref mtResource);
        }
        public static void LockCall()
        {
            lock (resourceLock)
            {
                ++mtResource;
            }
        }
        public static void PollingCall()
        {
            Thread.Sleep(0);
            lock (resourceLock)
            {
                ++mtResource;
            }
        }
        public static void ReaderWriterLockSlimReadCall()
        {
            resourceReaderWriterLockSlim.EnterReadLock();
            ++mtResource;
            resourceReaderWriterLockSlim.ExitReadLock();
        }
        public static void ReaderWriterLockSlimWriteCall()
        {
            resourceReaderWriterLockSlim.EnterWriteLock();
            ++mtResource;
            resourceReaderWriterLockSlim.ExitWriteLock();
        }
        public static void AutoResetEventCall()
        {
            resourceAutoResetEvent.WaitOne();
            ++mtResource;
            resourceAutoResetEvent.Set();
        }
        public static void ManualResetEventCall()
        {
            resourceManualResetEvent.WaitOne();
            resourceManualResetEvent.Reset();
            ++mtResource;
            resourceManualResetEvent.Set();
        } 
        public static void SemaphoreCall()
        {
            resourceSemaphore.WaitOne();
            ++mtResource;
            resourceSemaphore.Release();
        }  
        public static void MutexCall()
        {
            resourceMutex.WaitOne();    
            ++mtResource;
            resourceMutex.ReleaseMutex();
        }
        public static void StaticFieldSetter()
        {
            StaticFieldTestClass.StaticField = ++mtResource;
        }
        public static void StaticFieldGetter()
        {
            mtResource += StaticFieldTestClass.StaticField;
        }
        public static void ThreadStaticFieldSetter()
        {
            StaticFieldTestClass.ThreadStaticField = ++mtResource;
        }
        public static void ThreadStaticFieldGetter()
        {
           mtResource += StaticFieldTestClass.ThreadStaticField;
        } 
    }
}

 

TimeSpan Calculation based on DateTime is a Performance Bottleneck

Some so small like DateTime.Now can be a bottleneck. On a typical windows system the Environment.TickCount is at least 100 times faster. You don’t believe it? Try it yourself! Here is the test code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace TimerPerformance
{
    using System.Diagnostics;
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Performance Tests");
            Console.WriteLine("  Stopwatch Resolution (nS): " + (1000000000.0 / Stopwatch.Frequency).ToString());
            RunTests();
            Console.WriteLine("Tests Finished, press any key to stop...");
            Console.ReadKey();
        }
        public static long DummyValue;
        public static void RunTests()
        {
            const int loopEnd = 1000000;
            Stopwatch watch = new Stopwatch();
            Console.WriteLine();
            Console.WriteLine("Reference Loop (NOP) Iterations: " + loopEnd);
            watch.Reset();
            watch.Start();
            for (int i = 0; i < loopEnd; ++i)
            {
                DummyValue += i;
            }
            watch.Stop();
            Console.WriteLine("  Reference Loop (NOP) Elapsed Time (ms): " + ((double)watch.ElapsedTicks / Stopwatch.Frequency * 1000).ToString());
            Console.WriteLine();
            Console.WriteLine("Query Environment.TickCount");
            watch.Reset();
            watch.Start();
            for (int i = 0; i < loopEnd; ++i)
            {
                DummyValue += Environment.TickCount;
            }
            watch.Stop();
            Console.WriteLine("  Query Environment.TickCount Elapsed Time (ms): " + ((double)watch.ElapsedTicks / Stopwatch.Frequency * 1000).ToString());
            Console.WriteLine();
            Console.WriteLine("Query DateTime.Now.Ticks");
            watch.Reset();
            watch.Start();
            for (int i = 0; i < loopEnd; ++i)
            {
                DummyValue += DateTime.Now.Ticks;
            }
            watch.Stop();
            Console.WriteLine("  Query DateTime.Now.Ticks Elapsed Time (ms): " + ((double)watch.ElapsedTicks / Stopwatch.Frequency * 1000).ToString());
            Console.WriteLine();
            Console.WriteLine("Query Stopwatch.ElapsedTicks");
            watch.Reset();
            watch.Start();
            for (int i = 0; i < loopEnd; ++i)
            {
                DummyValue += watch.ElapsedTicks;
            }
            watch.Stop();
            Console.WriteLine("  Query Stopwatch.ElapsedTicks Elapsed Time (ms): " + ((double)watch.ElapsedTicks / Stopwatch.Frequency * 1000).ToString());
        }
        
    }
}

Here we have the Results for some machines (1.000.000 Iterations, in milliseconds):

Hardware Empty Loop Environment.TickCount DateTime.Now.Ticks
 AMD Opteron 4174 HE 2.3 GHz  8.7 ms  16.6 ms  2227 ms
 AMD Athlon 64 X2 5600+ 2.9 GHz 6.8 ms  15.1 ms  1265 ms
 Intel Core 2 Quad Q9550 2.83 GHz 2.1 ms  4.9 ms  557.8 ms
Azure A1 (Intel Xeon E5-2660 2.2 GHz) 5.2 ms 19.9 ms 168.1 ms

Ok, the single request will only take about 1-2 microseconds for the DateTime.Now call. This is a maximum throughput of 500.000 to 1.000.000 calls per second. In contrast the Environment.TickCount has a maximum throughput of about 600.000.000 calls per second. If a particular operation needs 10 timestamps, it has an maximum throughput of 50.000 operations only because of DateTime.Now.For example an HTTP request that measures response time and throughput (data transfer rate) needs a timestamp for every chunk of data it receives from the web server. Until the operation completes there are at least 3 timestamps (begin, response, end) to measure response time and download time. If the throughput (data transfer rate) is measured it all depends on how many chunks are received. This is even bad for multi-threaded access. Both, Environment.Tick Count and DateTime.Now are shared resources. All calls must go to the synchronization mechanism of them, which means they are nor parallelized.   

Real systems like the Crawler-Lib Engine can perform 20.000 – 30.000 HTTP requests per second on relatively good hardware. So it is obvious that the time measurement cause an impact on the maximum throughput.

Some will argue, that DateTime.Now is much more precise than Environment.TickCount. This is partially true. Here we have a code snippet that measures the granularity of timestamps:

if( Environment.TickCount > int.MaxValue - 60000) throw new InvalidOperationException("Tick Count will overflow in the next minute, test can't be run");
var startTickCount = Environment.TickCount;
var currentTickCount = startTickCount;
int minGranularity = int.MaxValue;
int maxGranularity = 0;
while (currentTickCount < startTickCount + 1000)
{
    var tempMeasure = Environment.TickCount;
    if (tempMeasure - currentTickCount > 0)
    {
        minGranularity = Math.Min(minGranularity, tempMeasure - currentTickCount);
        maxGranularity = Math.Max(maxGranularity, tempMeasure - currentTickCount);
    }
    currentTickCount = tempMeasure;
    Thread.Sleep(0);
}
Console.WriteLine("Environment.TickCount Min Granularity: " + minGranularity + ", Max Granularity: " + maxGranularity + " ms");
Console.WriteLine();
var startTime = DateTime.Now;
var currentTime = startTime;
double minGranularityTime = double.MaxValue;
double maxGranularityTime = 0.0;
while (currentTime < startTime + new TimeSpan(0, 0, 1))
{
    var tempMeasure = DateTime.Now;
    if ((tempMeasure - currentTime).TotalMilliseconds > 0)
    {
        minGranularityTime = Math.Min(minGranularityTime, (tempMeasure - currentTime).TotalMilliseconds);
        maxGranularityTime = Math.Max(maxGranularityTime, (tempMeasure - currentTime).TotalMilliseconds);
    }
    currentTime = tempMeasure;
    Thread.Sleep(0);
}
Console.WriteLine("DateTime Min Granularity: " + minGranularityTime + ", Max Granularity: " + maxGranularityTime + " ms");

Running this on several machines shows that Environment.TickCount has a granularity of about 16ms (15.6 ms) which is the default  system wide timer resolution. The system wide timer resolution can be changed with the timeBeginPeriod function down to 1ms, but this is not generally recommended because it affects all applications. DateTime.Now has on some machines the granularity of 16ms on other machines a better granularity down to 1ms. But it is never much better than 1ms. If you need to measure smaller times, you have to use the System.Diagnostics.Stopwatch class witch is in fact an high resolution timer.  

As a consequence the Crawler-Lib Framework uses Environment.TickCount for timestamps that are needed to measure durations for responses or tasks or whatever. Soon we release the Crawler-Lib Core library for free, which contains a TickTimestamp class that can be used for duration and throughput computations.