Yesterday I tried a mutex based implementation of an atomic counter to see how it compares.
My first test was to read the atomic counter 2^20 (1024x1024) from each core in a tight loop. Times are wall-clock on the host.
RxC Interrupt Mutex ---- ---------- ------ 1x1 0.193488s 0.114833s 1x2 0.317739s 0.122575s 2x1 0.317739s 0.121737s 4x1 0.393244s 0.298871s 1x4 0.393244s 0.361574s 4x2 0.542462s 1.122173s 2x4 0.543283s 0.903163s 4x4 0.849627s 3.493985s
Interesting to note that orientation of a single line of cores makes a difference. May have something to do with using core 0,0 as the location of the mutex. Also of interest is that the 4x4 case is accessing the atomic counter 16x as many times as the 1x1 case - here the low-bandwidth requirements of the interrupt implementation is scaling better than linear compared to the mutex implementation - because the requesting core is effectively batching up multiple requests if they come too fast, rather than having to serialise everything remotely.
But as can be seen - once more than 4x cores are in play the interrupt driven routine starts to win out comfortably. This is despite effectively blocking the host core while the others are running.
But this isn't a very useful test because no practical software simply increments a remote counter. So for something more realistic I added a delay to the loop using one of the ctimers.
So looking at the most congested case of all cores busy:
4x4 Delay Interrupt Mutex ------ ---------- ------ 10 0.965649s 3.138225s 100 1.083733s 3.919083s 200 1.630165s 3.693539s 300 1.780689s 3.792168s 400 2.297966s 3.666745s 500 2.448892s 3.563474s 1000 3.840059s 1.851269s 2000 4.923238s 3.402963s
So the cross-over point is around 1000 cycles worth of work, at least on a 16 core machine. 1000 cycles isn't much.
Given this data, it's probably better to go with a mutex implementation after-all. It requires about 1/3 of the code and doesn't load the servicing core nearly as much. Oh well, worth a try. (hang on, this doesn't let the host participate ... argh).
I have no direct data on the mesh load or the fairness of the count distribution. Intuitively the mutex based implementation wont be as fair due to the way the routing works, but once you have enough work to do the network shouldn't be particularly busy.
I'm hoping a future hardware revision may include an option on the ctimer to count reads of the ctimer register - which would effectively be a 2xatomic counters per core in hardware. In the meantime there's always that FPGA (maybe that's something I will look at, just no experience there).