Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as zed
  & handle of notzed


android (44)
beagle (63)
biographical (87)
blogz (7)
business (1)
code (63)
cooking (30)
dez (7)
dusk (30)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (434)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (224)
java ee (3)
javafx (48)
jjmpeg (77)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (8)
opencl (119)
os (17)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
wanki (3)
workshop (3)
zcl (1)
zedzone (21)
Monday, 13 February 2012, 14:40

PDFZ searching

So after finding and fixing the bug in my outline binding - a very stupid paste-o - I added a table of content navigator to PDFReader, and then I had a look at search.

Which ... I managed to get working, at least as a start:

As can be seen, it's a little flaky - mupdf is adding spaces here and there in the recovered text, and i'm not sure i'm processing the EOL marker properly (and possibly I have a bug in the search trie code too). But as I said - it's a start.

I decided to use a Trie for the search (Aho-Corasick algorithm) - because I know it's an efficient algorithm, and because I know there was a good implementation in evolution. So I grabbed an old copy from the GPL sources and modified it to work on the mupdf fz_text_span code. Thanks Jeff ;-) Basically it's a state machine that can match multiple (possibly overlapping) words whilst only ever advancing the search stream one character at a time.

I tried to copy the emacs mode of searching to some extent:

I put some code in there to abort the search if the page is changed while it's still searching (because I hooked the search into the page loader/renderer), but on the documents i've tried it on it's been so fast I haven't been able to test it ...

This is one of the first times I used JNI to create complex Java objects from C - the array of results for a given page. It turns out it's fairly clean and simple to do.

I guess the next thing is to see if i can integrate the search functionality into ReaderZ. Time to write one of those horrible on-screen keyboards I guess ...

But for now ... the weather's way too nice to be inside, so I think it's off to the garden, beer in hand ...

Tagged hacking, java, mediaz, pdfz.
Australian politics and the reporting thereof. | PDFZ stuff
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!