Alif Wahid

Posts tagged with "vocabulary"

Is anyone curious about vocabulary?

Not in the snobby-stiff-upper-lip kind of way :P Rather, just-interested-in-words kind of way. I have an idea for a dynamic website that will process a given chunk of text and render pretty plots about the set of words it contains. Some characteristics of words that I’m inexplicably curious about are the following.

  • Alliterative words that begin with the same letter in a sentence or a sequence of sentences in a paragraph. Strictly speaking, these are words that begin with the same sounding first syllable, which is basically the same as the first letter. If you haven’t noticed by now, the aforementioned italicised words all begin with :P It would be fun to find out what their radix tree looks like, similar to the diagram in that Wikipedia article. Such lexicographic prefixes of words is just one way of exploring vocabulary or a set of words - that’s how dictionaries are organised obviously.
  • Automatic extraction of word stems to link together families of words and their genealogy. A lot of words obviously have the same meaning but with minor variations due to inflections (i.e., verb conjugation and noun declension). There are some nifty algorithms in the world of natural language processing that do an impressive job (e.g., Porter and Lancaster stemmers).
  • Generating the concordance and collocation of a given word or phrase. This can provide an interesting view of how certain phrases and words are used differently or similarly within a piece of text or across multiple texts. Automated translators, such as the ones from Google and Microsoft, rely on this feature of various spoken languages to easily extract common meanings of different phrases (e.g., “Happy Birthday” is a common sentiment that comes about primarily due to the collocation of two or more disjoint words in most languages). Once again, there are cool algorithms for doing this kind of analysis.
  • Word frequencies are always interesting, especially when rendered as a word cloud like the one from Wordle. Moreover, the existence of power-law distribution (strictly speaking, Zipf distribution) in word frequency plots is among the defining statistical signatures of natural languages compared to synthetic or programming languages. There are fascinating links to how the redundancy of spoken languages is tied with their respective grammar and how these rules have a fractal-like structure that recursively divides into self-similar trees.
  • Browsing the meaning of words by traversing their hyponym and hypernym hierarchies. WordNet is an impressive lexical database from Princeton University that contains over 117K words with intricate links depicting the various networks of synonyms and lemmas in English nouns, verbs, adjectives and adverbs. If I wanted to trivialise it, then I would say that it’s a souped up thesaurus :P But it’s far more insightful and informative than that. It’s a sort of ontological map of meanings and concepts that underlie the English language whereby words are just convenient labels for abstractions residing in our minds.

These are just the ones that I can think of immediately. There are more that will come to me soon after I’ve hit “Create post” :P But anyway, the idea for a dynamic website is purely to have fun with HTML5 and JavaScript since they’re ridiculously cool technologies that I’m just itching to dive right into.

There’s an interesting library for doing various natural language processing in JavaScript called natural. It appears to me that cool new tools like CouchDB, which combines a web server and an elegant new database model into one package, make it very worthwhile to fiddle around with web development nowadays. In fact, CouchDB is lightweight enough to run on a standard laptop with support for all the different platforms (Mac, Windows and Linux). Even more lightweight and FAST is Node.js, which has a built-in web server (but no database) based on Chrome’s V8 engine. All these tools make it near-trivial to run a full blown website on a laptop or tablet device for the shortest possible latency and full-privacy when it comes to storing data. I guess I just have to start coding rather than blogging about it :P LOL

Visualising Vocabularies

Have you ever seen someone’s vocabulary? We can hear what others speak, and read what they write; but I’ve often thought about what someone’s vocabulary might look like. May be quite colourful? Or perhaps, a little dense in some way? Over the years, I’ve learned that my own vocabulary is, in fact, quite small. It’s a cause for some sadness, seeing as it stems from my rigidly logical style of writing. Sometimes it reminded me of a musical analogy, whereby a small vocabulary is akin to a short vocal range for a singer. Anyway, recently I thought of trying to visualise vocabularies in some way, and this post is basically going to blabber on about what I’ve come up with, so far.

The common currency definition of the term vocabulary is simply the full set of words either spoken or written by a person. This leaves out anything to do with frequencies, as to how often particular words may be used by a person. But I guess everyone sort of intuitively understands that about vocabulary anyway. The interesting question then becomes, how do you paint a picture of this set of unique words used by a person? This can’t be the same as a word cloud since that’s conditional upon frequencies, and results in a corresponding scaling of the illustrated words’ sizes. What about visual ideas of colour, texture and density?

Well, I’m particularly interested in the idea of density, which I associate with how broad and how deep a person’s use of words beginning with various letters are. For example, I tend to write overwhelmingly in straightforward prose, so my use of words beginning with ‘k’, ‘j’ and ‘x’ are quite rare (ignoring the fact that there are not that many words that begin with those letters). Even when I do use a k-word, for instance, there wouldn’t be much density to speak of in terms of the number of different k-words that I might have used, or the longest for that matter. Consequently, once you start to describe a person’s vocabulary by this idea of rooting words to their initial letters and then seeing how they branch out into a tree, you begin to get a tangible structure that manifestly represents that person’s vocabulary.

In fact, this idea is not new. In the realm of computer programming, tree data structures are used for all kinds of text processing and manipulation. And it certainly comes to me as no surprise that there is a beautifully rich and diverse set of techniques for visualising vocabularies; unfortunately you don’t see these tools brought out into the open often enough. So I’m going to try and demonstrate one such technique that is available, and paint pictures of the vocabularies of Shakespeare and Joyce in order to show you the vast differences in their respective use of words.

This technique is called Radix Trees, or sometimes PATRICIA Tries, in the computer programming world. The name is basically self-evident as to its purpose and function. It is an abstract tree structure that represents strings by their common prefixes. Strings can be any arbitrary sequence of characters but I’m only going to use words. So a Radix Tree is something that organises words by pulling out lexicographic prefixes that are common - just like how a dictionary is organised. In other words, it is a lexicographic ordering of words so that the common prefixes become easily visible and shared across words.

That’s enough words spent describing something so abstract; now it’s time for a diagram! Below is the plot that I copied from the Wikipedia page on Radix Trees (it comes with a creative commons license, CC-BY v2.5). The example in this diagram uses seven r-words: romane, romanus, romulus, rubens, ruber, rubicon and rubicundus, in order to build a radix tree that contains 13 individual nodes and shares various prefixes among the words stored within it. One way to read this radix tree is to start at the bottom nodes that are numbered, which are also called leaves, and then work your way up the tree by following the parent of each node. Along the way, you simply keep prepending each label that you come across until you reach the root of the tree and retrieve the original word. You can also start at the root and work your way down each child node by appending the labels. I personally just like to go up a tree, not down. Have a go with this diagram tracing out the seven words that are stored, and convince yourself that there’s no false magic in this. It’s quite fun actually.

The planar diagram corresponding to a radix tree is, in and of itself, a vivid visualisation of vocabularies. In this toy example, the vocabulary only contains seven words and the diagram is at least sketchable by hand, if not neatly drawable by a program. But the problem is when you want to visualise thousands of words for each letter in the alphabet. Imagine the number of branches that you could have in such a radix tree, and just how deep the number of levels might actually go! There is not enough space in a 2D plane to put all of the little nodes and connect them with edges and create something resembling a tree. You could do it if hard pressed, but it would look like a jumbled black box from all of the overlapping lines and dots. Unfortunately, this is a problem in any kind of data visualisation where scalability is the major bottle neck because there are far too many dimensions to fit into a 2D or 3D visual space.

So what can we do? Well, we have to rely on statistics to reduce some of these dimensions and plot the relationships between key parameters while ignoring the rest. There’s no other way of reducing dimensions - you have to throw information away to make room! The trick is to throw away the redundant information that does not add much value to the nature of the underlying tree structure. So in the case of a radix tree, what we have is an easy to see relationship between the depth of the tree and the breadth of the tree. This is because we can trace the level of a node down the tree (i.e., its depth), and for any given depth in the tree we can count the number of nodes that are actually at that level, (i.e., the breadth of the tree). This gives us a 2D relationship for any given tree. We can then extend this across many radix trees that are each rooted at a specific letter of the alphabet. In the same way that the diagram above is rooted at ‘r’, you can have rooted radix trees for any of the other letters depending on what words are present in a vocabulary. This would extend the 2D depth-breadth relationship to 3D and still make it easily plottable in a graph for visualisation.

And that’s what I did, by taking the set of unique words from Hamlet and Ulysses in order to build their corresponding radix trees (that is, one tree for each letter of the alphabet for each piece of text using all of the unique words found). I then computed the distribution of the depth and breadth of these trees and plotted them as a density map (sometimes called a heat map). These figures are shown below for Ulysses and Hamlet respectively. The x-axis is the tree depth parameter while the y-axis is the root letter that identifies a tree. The density of the colour mapping corresponds to the breadth of the trees at the given level of depth. As you can see, deep blue colours are close to zero while darkish red are close to the saturation point of the data. Hence, this luminosity of the colours is normalised to the inherent dynamic range of the larger data set out of the two texts, which is of course Ulysses.

Personally, I find these plots to be quite rich with information, as well as being colourful pictures that let me see what shapes the vocabularies of these two giants actually take on. For any given letter in the alphabet, you simply trace across horizontally and get a measurement of the depth-breadth relationship corresponding to the underlying radix tree. If a tree is quite bushy with lots of branches at a given depth, then that means that there are lots of unique words without common prefixes in that region (i.e., large vocabulary). Similarly, if a tree is naturally pruned for any given depth, then that means that there are lots of common prefixes shared among words, which in turn means that there are few unique words in that region (i.e., small vocabulary). The combined effect in 3D space is the manifestation of dense blobs for certain groups of unique words and their prefixes. Notice how the radix trees corresponding to ‘j’, ‘k’ and ‘x’ are not dense at all in the Ulysses plot. Alternatively, there’s a dense blob in the region of ‘a’, b’, ‘c’ and ‘d’ collectively.

Can you see how Joyce is so much broader and denser in his vocabulary than Shakespeare (at least in this partial comparison with Hamlet, as opposed to ALL of the plays)? The saturation of Ulysses’ density map is so much higher such that Hamlet’s density map is barely visible! Consequently, by doing a direct visual comparison, the absolute scale of the difference in breadth and depth is evidently clear. However, to be fair to Shakespeare, we should at least do a relative comparison of the breadths and depths by rescaling the Hamlet plot to saturate at much lower level, around 300 instead of 1500 (which is a reduction by a factor of 5). So there’s Hamlet rescaled below. Interestingly, very much the same kinds of dense blobs manifest in Hamlet as in Ulysses, except they are roughly 5 times smaller in scale. That is not surprising given that Ulysses has ~30,000 unique words compared to the ~5,000 unique words of Hamlet.

I guess I haven’t succeeded in as much visual appeal as I was subconsciously looking for. Ah well, that happens with any kind of endeavour. Even though these plots are not as instantly understandable as a word cloud, I think they convey something deeper and broader about the writing style and content of these authors. The difference between them is evidently clear in a quantified multi-dimensional manner, which speaks densely to the way they used language in order to achieve distinct effects of their choice. Joyce being the supreme word inventor, while Shakespeare being that unparalleled succinct poet. Just a word of caution about these plots before I conclude, they are preliminary as my implementation of the radix tree data structure has not been reviewed by anyone. It is plausible that some life threatening bugs are hiding in my code somewhere even though I’ve tested and checked the data quite a bit (within the confines of a hobby project that is). So please don’t launch into a war using my data, for I will bear no responsibility of your silliness :P