Is anyone curious about vocabulary?
Not in the snobby-stiff-upper-lip kind of way :P Rather, just-interested-in-words kind of way. I have an idea for a dynamic website that will process a given chunk of text and render pretty plots about the set of words it contains. Some characteristics of words that I’m inexplicably curious about are the following.
- Alliterative words that begin with the same letter in a sentence or a sequence of sentences in a paragraph. Strictly speaking, these are words that begin with the same sounding first syllable, which is basically the same as the first letter. If you haven’t noticed by now, the aforementioned italicised words all begin with s :P It would be fun to find out what their radix tree looks like, similar to the diagram in that Wikipedia article. Such lexicographic prefixes of words is just one way of exploring vocabulary or a set of words - that’s how dictionaries are organised obviously.
- Automatic extraction of word stems to link together families of words and their genealogy. A lot of words obviously have the same meaning but with minor variations due to inflections (i.e., verb conjugation and noun declension). There are some nifty algorithms in the world of natural language processing that do an impressive job (e.g., Porter and Lancaster stemmers).
- Generating the concordance and collocation of a given word or phrase. This can provide an interesting view of how certain phrases and words are used differently or similarly within a piece of text or across multiple texts. Automated translators, such as the ones from Google and Microsoft, rely on this feature of various spoken languages to easily extract common meanings of different phrases (e.g., “Happy Birthday” is a common sentiment that comes about primarily due to the collocation of two or more disjoint words in most languages). Once again, there are cool algorithms for doing this kind of analysis.
- Word frequencies are always interesting, especially when rendered as a word cloud like the one from Wordle. Moreover, the existence of power-law distribution (strictly speaking, Zipf distribution) in word frequency plots is among the defining statistical signatures of natural languages compared to synthetic or programming languages. There are fascinating links to how the redundancy of spoken languages is tied with their respective grammar and how these rules have a fractal-like structure that recursively divides into self-similar trees.
- Browsing the meaning of words by traversing their hyponym and hypernym hierarchies. WordNet is an impressive lexical database from Princeton University that contains over 117K words with intricate links depicting the various networks of synonyms and lemmas in English nouns, verbs, adjectives and adverbs. If I wanted to trivialise it, then I would say that it’s a souped up thesaurus :P But it’s far more insightful and informative than that. It’s a sort of ontological map of meanings and concepts that underlie the English language whereby words are just convenient labels for abstractions residing in our minds.
These are just the ones that I can think of immediately. There are more that will come to me soon after I’ve hit “Create post” :P But anyway, the idea for a dynamic website is purely to have fun with HTML5 and JavaScript since they’re ridiculously cool technologies that I’m just itching to dive right into.
There’s an interesting library for doing various natural language processing in JavaScript called natural. It appears to me that cool new tools like CouchDB, which combines a web server and an elegant new database model into one package, make it very worthwhile to fiddle around with web development nowadays. In fact, CouchDB is lightweight enough to run on a standard laptop with support for all the different platforms (Mac, Windows and Linux). Even more lightweight and FAST is Node.js, which has a built-in web server (but no database) based on Chrome’s V8 engine. All these tools make it near-trivial to run a full blown website on a laptop or tablet device for the shortest possible latency and full-privacy when it comes to storing data. I guess I just have to start coding rather than blogging about it :P LOL



