Identify individual words that uniquely define a book’s vocabulary is fairly easy. But what about doing sentence level analysis? What if we wanted to find “the most unique sentence” in the entire book?
The primary challenge here is deciding just what qualifies as a unique sentence. After a little thought I came up with a few reasonable seeming definitions:
1) The total uniqueness of a sentence is equal to the sum of the uniqueness of every non stop word in that sentence
2) The mean uniqueness of a sentence is equal to the sum of the uniqueness of every non stop word in that sentence divided by the number of non stop words in that sentence
Where a stop word is defined as the various bits of uninteresting grammatical glue that holds English sentences together such as “an” or “the”. Ignoring these is easy enough as spaCy automatically labels them for us.
Hopefully that seems like a reasonable approach to sentences. But how do we define the uniqueness of a word? Once again there are a few reasonable possibilities.
1) The book specific uniqueness of a word is proportional to how often a word is used in a given book compared to modern language in general. Ex: A word a book frequency that is ten times higher than “normal” usage frequency is more unique than a word with only double the baseline frequency
2) The general uniqueness of a word is inversely proportional to how seldom it is seen in normal writing. Ex: A word that shows up only once every hundred thousand words is more unique than a word that shows up once every ten thousand words.
But how to actually calculate these uniqueness scores? A direct comparison might not be the best strategy. For instance, we could calculate general uniqueness as the inverse of word frequency such that a one in a hundred thousand word is worth a hundred thousand points while a one in ten thousand word is worth ten thousand points.
This would mean that a single very rare word would absolutely overpower anything short of a dozen fairly rare words, and that just doesn’t feel quite right. To my mind, at least, a sentence with several slightly unusual words is much more noteworthy than an otherwise normal sentence with a single very strange word.
The obvious solution is base ten logarithms. By comparing everything in terms of powers of ten a very rare word is now only worth one more point than a rare word, and an extremely rare word worth only one more than that. This helps solve the problem of having a single strange word skew our sentence results.
Before I show off the results of this logarithmic ranking there is a similar problem for mean sentence uniqueness that needs to be tackled: Very shot sentences with a single very rare word
For example: “Have you been on the moor already?” has the highest average non-stop-word uniqueness score because “moor” is individually very rare and there are basically no other words to pull the average up.
But that’s not very interesting, so let’s add a requirement that a sentence has to have a minimum number of non-stop-words to qualify for our analysis. We already know that the median sentence length in The Hound of the Baskervilles is 13 words. Most sentences include at least a handful of stop words such as “a” or “the” so let’s carve that down a bit and say we only care about sentences with at least 8 non-stop-words in them.
That leads to a much more interesting results. And now with all those definitions and caveats out of the way we can finally see the statistical results for the most unique sentences in The Hound of the Baskervilles:
Sentence with the highest total uniqueness score based on book word frequencies
Rolling pasture lands curved upward on either side of us, and old gabled houses peeped out from amid the thick green foliage, but behind the peaceful and sunlit countryside there rose ever, dark against the evening sky, the long, gloomy curve of the moor, broken by the jagged and sinister hills.
Sentence with the highest mean uniqueness score based on book word frequencies
Our wagonette had topped a rise and in front of us rose the huge expanse of the moor, mottled with gnarled and craggy cairns and tors.
Sentence with the highest total uniqueness scored based on general word frequencies
But it was not the sight of her body, nor yet was it that of the body of Hugo Baskerville lying near her, which raised the hair upon the heads of these three dare-devil roysterers, but it was that, standing over Hugo, and plucking at his throat, there stood a foul thing, a great, black beast, shaped like a hound, yet larger than any hound that ever mortal eye has rested upon.
Sentence with the highest mean uniqueness score based on general word frequencies
Our wagonette had topped a rise and in front of us rose the huge expanse of the moor, mottled with gnarled and craggy cairns and tors.
It’s mildly interesting to note that the most unique sentence based on mean word uniqueness is the same regardless of whether we are looking at general word frequencies or at the relative word frequencies calculated by comparing the book frequencies to general frequencies. Mere coincidence, or does this say something worthwhile about Conan Doyle’s writing style?
No clue!