Wilkommen, Bienvenu, Welcome... Sziasztok!

Welcome to The Lotus Position, an intermittent collection of extempore navel gazings, ponderings, whinges, whines, pontifications and diatribes.

Everything is based on a Sample of One: these are my views, my experiences... caveat lector... read the Disclaimer

The Budapest Office - Castro Bisztro, Madach ter

The Budapest Office - Castro Bisztro, Madach ter
Ponder, Scribble, Ponder (Photo Erdotahi Aron)

Monday 11 May 2009

*Nearly" Done

Yes, "IT" is nearly done (watch this space, seriously) - but I mean the story, not the whole project (still have corrections and proofreading to do but it will be the first time there has been an end-end story). However this post is not so much about literary achievement as nerdulent smugness...

As part of the aforementioned forthcoming proofing process the book will be checked for various, grammatical, syntactic, rhythmic, stylistic and other infelicities - such as over use of certain phrases and words and the use of words just too obscure to be easily digested.

To this end I wrote some concordance code - i.e. a program that works through the whole text and accumulates lists of all single words, all double word pairs and all triple words combinations and links them to the source text, which lists are then sorted alphabetically and placed in a spreadsheet (which links back to the text so that I can inspect individual occurrences).

Now, when I wrote the code, the sort algorithm was very simple - and slow. So I improved it with some cool tricks that sorted all three groups at once and did a few other neat things, but since I didn't really investigate sort algorithms even the improved code was slow - just not quite as slow as it had been (it's running in VBA as well): there's a note in the code to the effect that "by the time the book reaches 100,000 words this will be soooooo slow."

Now the book is >300,000 words. I started a concrodance running Saturday morning on the old, old 1.4GHz Athlon powered desktop PC knowing it was going to take an age, but even I was shocked. Eventually I discovered the sort alone took over 29 hours to finish (and tabulating the results into Excel took a further 8 hours or so), but even while it was running it seemed desirable to rewrite the code.

I'd found and implemented a QuickSort a long time ago, and decided to rejig it for the present purpose. It took a couple of hours, but after some preliminary testing (and still the other PC was chugging away, indirectly consuming gigatonnes of CO2 even as it performed a glacial sort) I was ready to try it out on "IT".

Everything sorted in about three and half minutes! Direct comparison showed that the new sort was 500x faster, but after allowing for the fact that the new code was running on my Core 2 Duo laptop and that the laptop was about 5x faster than the desktop, that's still a 100x improvement!

Result: QuickSort really is quick, even with very large arrays.

However, interesting results from perusing the 80MB of concordances generated were that the vocabulary of the book seems quite limited and there weren't as many outrageous words as I had thought.

There are about 15,000 words unique words, which number comes down to about 9,000 once the roots of the words had been identified so that "decay", "decays", "decayed", "decaying" etc. are considered as one item of vocabulary (I used Porter Stemming code I found somewhere and ported to VBA). So, maybe it won't be quite as linguistically challenging as I had feared... at least vocabularistically.

That's all for now - have to Do if it will be Done.

Marathon Stuff.

0 comments: