Shakespearean Statistics - Encore
In my previous post Shakespearean Statistics, I analysed the raw text of Hamlet, Macbeth and Othello to see if the pervasive Zipf’s Law was inherently present (non-rigorous visual inspection seemed to suggest that it was present). In particular, the resulting frequency distributions of letters and words painted a picture that there are some letters and words which are two to three orders of magnitude more frequent than the rest of their counterparts. The basic conclusion was that Shakespearean text is no different from other text in terms of information redundancy, and empirical laws, like that of Zipf, hold well. Obviously, such statistics leave out the inherent meaning or connotations of words that one might say is characteristic of that incomparable Shakespearean flavour. So, this is an encore exercise to that post with the goal of examining the individual unique words of these plays, based on their frequencies that I counted.
The interesting thing that I thought of was to query the word frequency distributions for the “top N” words. The criterion for choosing a value of N was totally arbitrary, but the ranking criterion for defining “top” was the descending order of sorted word frequencies beginning with the most frequent. So, lets begin with the top 10 words from Hamlet along with their corresponding number of appearances as follows (note that every word is case insensitive as far as my analysis is concerned in these couple of posts).
the = 1142
and = 964
to = 737
of = 669
i = 567
you = 546
a = 531
my = 513
hamlet = 462
in = 436
I suppose these frequent words are not surprising since English grammar necessitates their presence in most sentences. The frequency of “the” is interesting, since that explains why the consonant ‘t’ was so prominent in the distribution of letter frequencies shown in the previous post. Also not surprising is the protagonist, “hamlet”, featuring in the top 10 most frequent words, although word frequency is a crude measure of the significance of a noun or name in these plays. Nevertheless, the high frequency of “hamlet” suggests that a large number of dialogues involve him directly and consequently, his name is explicitly mentioned on the left-hand margin quite often.
Looking at the next few dozen words in this list is not very interesting since they’re all basic articles, conjunctions, pronouns and auxiliary verbs (which are all predictable from English grammar). However, the next character’s name to feature frequently is “horatio” (157 times), followed by “claudius” (120 times) and “polonius” (119 times). I don’t think there’s any major significance in that ordering apart from the fact that they’re all important characters in this tragedy. Interestingly, “gertrude” (95 times) and “ophelia” (86 times) are much further down this list despite their significance as the leading female characters. However, “queen” features more prominently with 118 appearances. So I guess that compensates for the fact that Hamlet, and other leading characters, spend a lot of time referring to Gertrude in the third person as “Queen”.
Certain words that are representative of that period seem to feature frequently. For instance, “thou” (103 times), “thy” (87 times), “tis” (73 times), “hath” (62 times) and “thee” (58 times). They’re very much archaic nowadays but provide that essential Elizabethan taste accompanying the Shakespearean english. Another indicator of this period were some infrequent words that have apostrophes substituted in place of certain syllables and letters to achieve accentual effects and manipulate the tempo of speech. I came across “do’t” (9 times), “on’t” (8 times), “to’t” (7 times), “seal’d” (6 times), “drown’d” (5 times), “kill’d” (5 times), “damn’d” (5 times), “honour’d” (4 times), “return’d” (4 times), “ne’er” (3 times), and quite a few others.
One can go on making more detailed inferences based on the frequency of pronouns in the first person correlated to the second and third persons, since that might reveal something about the underlying character relationships and the actual characterisation process. For example, notice how both “i” and “you” are in the top 10 list above with over 500 appearances each; could they potentially indicate something about the manner in which Shakespeare used the relationship between two characters? This is not what I’m looking to do right now, however (but I might come back to it in the future, as it could be quite revelatory). I wonder if linguists and scholars have done this analysis already?
Moving onto Macbeth, a similar pattern emerges where the most frequent word is “the” (732 times), while the title character follows soon with 278 appearances (6th most frequent in this case). The subsequent characters to follow on this list are “macduff” (106 times), “banquo” (71 times) and “malcolm” (58 times). Seems like a reasonable enough occurrence to me. Interestingly, “lady” features prominently in this play with 95 appearances. I guess that’s a proxy for the lead female character since she was always addressed by her married name (as far as I can recall correctly). The word “murderer” is in the top 100 with 34 appearances. Again, reasonable enough given the plot of this play involves regicide.
Moving onto Othello, the same pattern emerges but the first character to appear in the top 10 is the greatest prick of all time - “iago” - with 359 appearances. Interestingly, the most frequent word is actually “i” (830 times), followed by “and” (791 times) and “the” (757 times). The protagonist appears 321 times, followed shortly thereafter by his muse, “desdemona”, appearing 224 times. Some nouns and adjectives that struck me in the top 100 list were “good” (93 times), “love” (77 times) and “heaven” (60 times). Sounds like a product of the plot again?
So, what can we glean from this encore exercise of querying the top N words in each play as ranked by their frequencies? Well, the redundancy is evidently clear once again. It is not a characteristic Shakespearean “thing” but an inherent feature of the English language. I don’t think there has been anything major revealed above apart from the prominence of character names in these top N lists that suggest a very crude correlation between their frequencies and their plot significance. I think a bit of digging around is necessary to see what other people have done in this field, especially about the possibility of word frequencies revealing a hidden dimension of the characterisation process in Shakespearean tragedies (and plays in general). Comments and suggestions are most welcome in this regard.