Department of InformatiX
Microsoft .NET Micro Framework Tools & Resources

I was so long wondering whether to do stuff this or that way, that I've convinced myself to decide once forever, and measure the real performance differences. These are by no means scientific tests in a lab you can see on TV. Just an easy application anyone can run – I was interested in the results so I thought someone might find this useful too.

Testing

The tests were done on Meridian/P, using release build, with a debugger attached. The code of console application was as follows:

using System; using Microsoft.SPOT; namespace MFConsoleApplication1 { public class Program { public static void Main() { DateTime start; TimeSpan end1, end2, end3; start = DateTime.Now; Test1(); end1 = DateTime.Now - start; start = DateTime.Now; Test1(); end2 = DateTime.Now - start; start = DateTime.Now; Test1(); end3 = DateTime.Now - start; Debug.Print("Test1: " + end1 + "," + end2 + "," + end3); start = DateTime.Now; Test2(); end1 = DateTime.Now - start; start = DateTime.Now; Test2(); end2 = DateTime.Now - start; start = DateTime.Now; Test2(); end3 = DateTime.Now - start; Debug.Print("Test2: " + end1 + "," + end2 + "," + end3); } ... } }

Results

The differences we are interested in are really small, microseconds or less. Such small intervals are difficult to measure, so we have to do them many times to get some constitutive numbers. I've choosed to repeat the instructions 100,000 times, so that the results are around one second. This is long enough to hide CLR's scheduling, interrupts and other noise, and short enough to not fall asleep during testing. Moreover, as you see I did every test three times to verify consistency between results.

I've sorted the cases ascending by difference observed.

Negating boolean variables.

This one is actually the only one which surprised me, although the difference is really negligable. Negation seems to be more costy than xor. I wonder if !b would be more costy than b == false...

Xor Negation
private static void Test1() { bool b = false; for (int i = 0; i < 100000; i++) b ^= true; } private static void Test2() { bool b = false; for (int i = 0; i < 100000; i++) b = !b; }
1,018.4682 ms
1,018.2285 ms
1,017.6350 ms
1,021.5566 ms
1,020.7945 ms
1,020.7081 ms
Average gain: 2.9091 ms

Incrementing variable.

Classical C++ lesson. The post-increment is expected to be slower because it need to allocate and make a temporary copy of the value, increment the value and return the copy. The pre-increment operation just increments the value and returns it directly. Note that the compilation was optimized, and the result is not being used, so actually the difference may be usually bigger. Though there is still a tiny one, interestingly enough.

Pre-increment Post-increment
private static void Test1() { int a = 0, b = 0; for (int i = 0; i < 100000; i++) ++a; } private static void Test2() { int a = 0, b = 0; for (int i = 0; i < 100000; i++) a++; }
921.2818 ms
921.0234 ms
920.4767 ms
924.2776 ms
923.6813 ms
923.5972 ms
Average gain: 2.9247 ms

Parameter size.

The processors we are running .NET Micro Framework on are 32-bit, which means the registers are 32-bit. So naturally, there is a workload when working with long (and ulong and double) types, which are 64-bit. However, what about smaller types? Does it cost anything when passing eg. bytes?

Byte parameter Int32 parameter
private static void Test1() { for (int i = 0; i < 100000; i++) ByteMethod(0); } private static void ByteMethod(byte b) { } private static void Test2() { for (int i = 0; i < 100000; i++) IntMethod(0); } private static void IntMethod(int i) { }
1,438.9812 ms
1,434.6358 ms
1,431.9616 ms
1,430.8419 ms
1,431.9995 ms
1,427.5788 ms
Average gain: 5.0528 ms

Shifting vs. dividing.

Shifting is very easily to implement in hardware (using flip-flops), while division is much more difficult to realize. Thus it is not surprising that the further is faster.

Shifting Dividing
private static void Test1() { int a = 0, b = 0; for (int i = 0; i < 100000; i++) b = a >> 1; } private static void Test2() { int a = 0, b = 0; for (int i = 0; i < 100000; i++) b = a / 2; }
921.2823 ms
921.1136 ms
920.4909 ms
991.2865 ms
990.5353 ms
990.4636 ms
Average gain: 69.7995 ms

Odd number testing.

The remainder of division is a bit faster to get than the quotient, but masking is way faster than shifting. So examining the last bit has a clear favorite:

Modulo Masking
private static void Test1() { int a = 0, b = 0; for (int i = 0; i < 100000; i++) a = b % 2; } private static void Test2() { int a = 0, b = 0; for (int i = 0; i < 100000; i++) a = b & 1; }
985.8108 ms
985.4435 ms
984.9099 ms
900.0148 ms
899.3200 ms
899.3257 ms
Average gain: 85.8346 ms

Iterations.

So here it is. How much is foreach faster than for? For those who don't know why, the foreach needs to create an instance of an enumerator, and every cycle its MoveNext() method is called and Current property read.

static byte[] array = new byte[100000];
For Foreach
private static void Test1() { int a = 0; for (int i = 0; i < 100000; i++) a = array[i]; } private static void Test2() { int a = 0; foreach (int value in array) a = value; }
1,207.0700 ms
1,205.8343 ms
1,205.6557 ms
1,431.5822 ms
1,430.1819 ms
1,430.3134 ms
Average gain: 224.5058 ms

Swapping variables.

Until recently I thought that swapping variables using xor is cool. Then, at university I was told that there are people who think that swapping variables using xor is cool. Though usually slower. Damn – indeed. (I still keep the right to think that swapping pointers to large objects using xor is cool.)

Xor Temporary variable
private static void Test1() { int a = 0, b = 0; for (int i = 0; i < 100000; i++) { a = a ^ b; b = a ^ b; a = a ^ b; } } private static void Test2() { int a = 0, b = 0; for (int i = 0; i < 100000; i++) { int c = b; b = a; a = c; } }
1,409.4336 ms
1,408.6215 ms
1,408.3136 ms
1,116.5805 ms
1,115.8080 ms
1,115.7313 ms
Average gain: 292.7497 ms

Field vs. property.

The second obvious thing after the foreach stuff is to prefer fields to properties (properties are compiled into to get_ resp. set_ methods), but I hadn't realized the difference is that significant. It's like 12 μs per single call!

static Test test = new Test(); class Test { public int Field; public int Property { get { return Field; } } }
Field Property
private static void Test1() { int a = 0; for (int i = 0; i < 100000; i++) a = test.Field; } private static void Test2() { int a = 0; for (int i = 0; i < 100000; i++) a = test.Property; }
1,106.4459 ms
1,104.4360 ms
1,104.3929 ms
2,277.5279 ms
2,281.7123 ms
2,281.3857 ms
Average gain: 1,175.1170 ms

Conclusion

Appart from the differences in test cases, we also gained some idea how costy are the operations in relation to each other, and that every test repetition always resulted in a faster overall time. Anyway, everything here is for informative purposes only. Although embedded devices should be designed with performance in mind, think twice if the code readibility is worth the effect. If your application usability is affected by saving few microseconds, then something other is probably already wrong.

 

Bonus chapter: Bit toggling

How fast can you toggle an output pin? Or more practically, how short impulse can you do? And most importantly, if it is not fast enough, will porting kit help you?

The code to test this is pretty easy:

OutputPort port = new OutputPort(pin, false); bool value = false; while (true) { port.Write(value); value ^= true; }

And here are the results (release build, debugger detached):

Managed code, both pulses width of 18.4 μs.

So basically, with a 100 MHz processor, you can usually do around 20 μs wide pulses using managed code (about 25 kHz).

However, here comes the fact that the .NET Micro Framework is not a real time system: nobody can guarantee such timing. Nothing prevents the runtime to switch the threads in between, or to run the garbage collector. Even if you tried to ensure that, still a native interrupt can come in during the native part of the Write method. See, the cycle above, in a console application containing no other code, produced not only the nice picture we have already seen, but also these ones (and quite often):

Managed code, low pulse 58.4 μs, high pulse 26.8 μs. Managed code, low pulse 17.6 μs, high pulse 77.6 μs.

Okay, so you need faster pulses, for example to implement the 1-Wire® bus, which represents 1 by pulse of maximum width of 15 μs. What now? Obviously you can't do that in managed code on this hardware. So you can either get a faster processor, or try to move the code on the native side. But how fast can you work there?

Actually there are several layers in the porting kit which you can use for this purpose, and it is always a trade-off between speed and abstraction - hardware independency. The shortest pulse possible would require you to find out which processor does the hardware platorm use, get its datasheet, findout which regiter contains state of the pin you are interested in, and use assembly instructions to toggle it. But hey, this is .NET Micro Framework, let's keep it hardware independent even on the native side:

while (TRUE) { ::CPU_GPIO_SetPinState(portId, TRUE); ::CPU_GPIO_SetPinState(portId, FALSE); }

And here we go:

Native code, both pulses width of 2.16 μs.

2.16 μs (about 463 kHz) is pretty good, isn't it? Just for curiosity, because the above snippet did not disabled processor's interrupts either, you again don't get 100 % regular signal:

Native code, low pulse 2.20 μs, high pulse 2.24 μs.

(I have observed pulses from 2.12 μs to 2.56 μs, still pretty good variance) — but this is expected, there is a PWM intended for generating regular signals, not this fancy code!

Comments
Sign in using Live ID to be able to post comments.