Alif Wahid

Posts tagged with "semantics"

Primitive thoughts on tracking changes in a document

The change tracking feature of MS Word is a real mixed bag for me. I mostly understand (and thoroughly appreciate) the algorithmic trickery involved in implementing such a feature so that my thoughtless opinion about it is positively biased. But whenever I actually put it through its paces in a thoughtful work-flow that requires merging two or more versions of a document, I realise just how clumsy and needlessly counter-intuitive it is. Therein lies a dilemma that I want to resolve.

The process of merging two documents in order to derive a third one is based on the formal concept of edit distance. Imagine two primitive operations that can be performed on a document: insert and delete. You can insert any number of letters into a document and/or delete any number of letters from a document in any interleaved sequence. If you take two adjacent versions of a document, say X and Y, then there exists a shortest sequence of insertions and deletions for transforming X into Y, and vice versa. The proof is not hard but editing mathematics on Tumblr is too hard.

The length of this sequence of operations is a quantitative measure of the difference between the two versions. This happens to be powerful enough so that it is a proper metric satisfying triangle inequality in the presence of a third version, which means that it can be used to construct at least a partially ordered topology of a document’s history. In practical terms, the process of merging two documents is reduced to primitive operations like inserting and deleting characters in an interleaved sequence, whereby it is guaranteed that one can always transform a document within a finitely efficient number of steps (i.e., polynomial running time). Then the pertinent question is who should perform all of these steps all of the time - computers or users?

I’m inclined to think that it’s the computer’s job to do all those tedious comma insertions and typo deletions and all other trivial steps of editing. To be fair, this is possible by telling Word to accept and apply all of the changes between two versions of a document. But more often than not, Word gets it utterly wrong and mangles the document to the extent that I have to revise it from start to finish out of shear distrust. Hence, the counter-productiveness and the counter-intuitiveness of thinking that this is a positively useful feature. It is not! Rather, it is time consuming and annoying, to say the least.

The operations that I would prefer to perform myself are much less primitive, since the computer is much better at doing primitive stuff. For instance, I want to view the semantic difference between two documents in terms of chapters, sections, paragraphs, figures, tables etc. Not commas, spaces, carriage returns, typos and the rest. Thereofore, I want a formal conception of semantic distance instead of edit distance such that I can operate at a higher level of abstraction which is less tedious and more productive. I wonder if such a formalism exists already? Any pointers?

I guess it necessarily requires setting out the structure of the document in some standard form since the semantic meta-data cannot be legible to a computer otherwise. A book is a good standard structure. It’s usually organised into chapters with short headings, sections within chapters, sub-sections within sections, and so on. Thus the difference between two adjacent versions/editions of a book ought to be expressable in the form of this hypothetical semantic distance that I’m postulating.

I think it must be intuitively far less tedious to view (and merge) two paragraphs displayed side by side as opposed to the differing characters within them, which generate the edit distance. By extension of this structural analogy, it must also be intuitively far less tedious to view the difference in two tables of contents, side by side, in order to quickly sense the overall semantic distance between them. So I suspect that this conception will most likely be a hierarchical one at different levels of semantic abstraction whereas the formalism for edit distance is necessarily flat at the syntactical level of individual characters. I have to ponder some more before the dilemma might go away, although the general idea seems sound thus far.

Visualising Hyponym/Hypernym Hierarchies

A hyponym and its corresponding hypernym are linguistic jargon that I will allow Wikipedia to explain. What matters for the sake of this post is that two or more visually different words can have semantic relationships of some kind. For instance, the statement “red is a colour” asserts that “red” and “colour” are related in some specific manner (usually called is-a relationship in Computer Science). Hence, a thesaurus is quite useful because it exposes a relationship between different words that have the same meaning (i.e., synonyms).

Anyhow, it turns out that you can elegantly visualise these relationships as a network (or in strict mathematical terms, as a graph containing vertices and edges) using NLTK, NetworkX and Matplotlib libraries for the Python programming language. NLTK ships with a lexical database called WordNet, which contains 155,287 English words and 117,659 synonym sets. These sets are organised into hierarchies with a root word, for example “car”. Various synonyms then branch out (and merge) from that root, and their subsequent synonyms branch out from them, and so on. Thus you have this idea of a network or graph. So, I plotted quite a few such hyponym/hypernym hierarchies and here’s the graph corresponding to “fear”.

Unfortunately I haven’t, as yet, figured out a way to align the labels and edges correctly so that they don’t overlap (which makes it hard to read the graph obviously). Nevertheless, the network structure is quite interesting. Here’s the actual list of words that NLTK printed out as the synonym set of “fear”.

horror
hysteria
intimidation
apprehension
trepidation
gloom
foreboding
presage
shadow
chill
suspense
alarm
timidity
shyness
diffidence
hesitance
unassertiveness
cold_feet
panic
swivet
scare
frisson
creeps
stage_fright

It appears that WordNet actually includes phrases like “cold feet” and defines them as words. Not sure that I agree entirely with this idea, but there certainly is a synonymous relationship in this particular instance, nonetheless.

Well, I’ve got tonnes of plots like the one above and there simply isn’t enough space to put them here :( The best way to see them is to start using these tools yourself and generate your own plots. I’ve used a slightly modified chunk of code from the NLTK book available here.