Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as zed
  & handle of notzed


android (44)
beagle (63)
biographical (87)
blogz (7)
business (1)
code (63)
cooking (30)
dez (7)
dusk (30)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (434)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (224)
java ee (3)
javafx (48)
jjmpeg (77)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (8)
opencl (119)
os (17)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
wanki (3)
workshop (3)
zcl (1)
zedzone (21)
Thursday, 06 October 2011, 15:31

Images vs Arrays 4

Update 7/10/11: I uploaded the array convolution generator to socles

And so it goes ...

I've got a fairly convoluted convolution algorithm for performing a complex wavelet transform and I was looking to re-do it. Part of that re-doing is to move to using arrays rather than image types.

I got a bit side-tracked whilst revisiting convolutions again ... I started with the generator from socles for separable convolution and modified it to work with arrays too. Then I tried a couple of ideas and timed a whole bunch of runs.

One idea I wanted to try was using a rolling buffer to reduce the memory load for the Y convolution. I also wanted to see if using more work-items in a local workgroup to simplify the local memory load would help or hinder. Otherwise it was pretty much just getting an array implementation working. As is often the case I haven't fully tested these actually work, but i'm reasonably confident they should as i fixed a few bugs along the way.

The candidates

This is a simple implementation which uses local memory and a work-group size of 64x4. 128x4 words of data are loaded into the local memory, and then 64x4 results are generated in parallel purely from the local memory.
This uses no local memory, and just steps through the addresses vertically, producing 64x4 results concurrently. As all memory loads are coalesced it runs quite well.
This version tries to use extra work-items just to load the memory, after wards only using 64x4 threads. In some testing I had for small jobs this seemed to be a win, but for larger jobs it is a big hit to concurrency.
This version uses a 64x4 `rolling buffer' to cache image values for all items in the work-group. For each row of the convolution, the data is loaded once rather than 4x.
imagex, imagey
Is from the socles implementation in ConvolveXYGenerator which uses local memory to cache input data.
simplex, simpley
Is from the socles implementation in ConvolveXYGenerator which relies on the texture cache only.
Is a version of convolvex_a which attempts to only load the amount of memory it needs, rather than doing a full work-group width each time.
Is a version of convolvex_a which uses simple vector types for the local cache, rather than flattening all access to 32-bits to avoid bank conflicts. It is particularly poor with 4-channel input.

The array code implements CLAMP_TO_EDGE for source reads. The image code uses a 16x16 worksize, the array code 64x4. The image data is FLOAT format, and 1, 2, or 4 channels wide. The array data is float, float2, or float4. Images and arrays represent a 512x512 image. GPU is Nvidia GTX 480.


The timing results - all timings are in micro-seconds as taken from computeprof. Most were invoked for 1, 2, or 4 channels and a batch size of 1 or 4. Image batches are implemented by multiple invocations.

                        batch=1                 batch= 4
channels                1       2       4       1       2       4

convolvex_a             42      58      103     151     219     398
convolvey_a             59      70      110     227     270     429

convolvex_b             48      70      121     182     271     475
convolvey_b             85      118     188     327     460     738

imagex                  61      77      110     239     303     433
imagey                  60      75      102     240     301     407

simplex                 87      88      169
simpley                 87      87      169

convolvex_a (limit)     44      60      95      160     220     366
convolvex_a (vec)               58      141


When I get time and work out how i want to do it i'll drop the array code into socles.

Tagged opencl, socles.
Java v OpenCL/CPU | Images vs Arrays 3
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!