So it turns out the main problem with the performance issues in my last post wasn't the cache being overloaded after all ... but the compiler running out of registers to unroll a couple of nested loops I asked it to.
It turns out because of a cut and paste job I left in some test code which was limiting the register count for the compiler, and the register spillage was a bit of a disaster and causing the order of magnitude drop. With this fixed things look a lot more promising, although at ~4ms it is still a little slower than I would like and I can't see much being able to be done to improve it.
It's a bit weird that an unconstrained compile for a specific device would choose to use all registers it needs for the code-unrolling, ignoring the device specifics, but I guess this is some particular of the nvidia opencl implementation.
There is still some problems with the texture cache being hit too hard anyway - the integral image has to be stored using the UINT32 data type, and it needs at least 8 lookups per feature tested which is a pretty heavy load to start with. With the other feature tester I could just use 8 bit images which fits better into the cache it seems (contrary to what i'd believed until this point).
Investigations ongoing ...