Moby Dick is a very very long book, partially because the author enjoys going off on tangents and partially because he insists on providing thorough background details on the biology of whales and the nature of the whaling industry.
And that’s fine! The book’s unique appeal is precisely because it’s a strange mix of philosophy and dreams and history and tall tale ramblings all as told through from the perspective of a rather introspective and somewhat unreliable sailor.
But what if we didn’t want that? What if we just wanted to read about the whale hunts and the high sea disasters and Ahab’s struggles with his own obsessive madness?
The obvious answer would be to read an abridged version of the story. And the most obvious way to get one of those would be to buy one from a human.
But what if we’d rather save a few bucks and have a computer do it for us? And what if we’re short on time and only have a couple afternoons? What can we accomplish?
We’re short on time so starting from scratch is obviously unacceptable. Fortunately people much smarter than I have published all sorts of free use tools across the internet, such as the machine learning library scikit-learn.
Approach is simple enough. We use spaCy to split the book into sentences and then we randomly choose a chunk of them and manually decide whether they are related to the main plot or not. Once that is done we convert the sentences into a so called “bag of words” representation that converts every word into a number in a set of large vectors that scikit-learn can do math with.
We can then use a Naive Bayes classifier that will take our big set of word vectors along with our manual “plot or not” labels and try to find an equation that accurately sorts sentence vectors into the right category.
Once scikit-learn has found the best equation it can we can then abridge the book by passing every sentence in the book (not just our sample sentences) through the classifier equation and only keeping those that get sorted into the “plot” class.
Of course, the accuracy of the equation created by scikit-learn will depend to some degree on how much data it has to work with. If we only provide a handful of labeled sentences from a single chapter about blacksmithing then the system won’t have any chance at successfully classifying sentences about whales and ships and storms.
After throwing together some code to do the above I decided to do a little experiment by running my code not once but three times. The first time I only be labeled10 randomly chosen sentences. The second time labeled 100 sentences. Finally I labeled 1000 random sentences, which is nearly a full tenth of the book. I quite at that point, as any more than that and we’re starting to miss the point of automation (plus I had no desire to re-read half of Moby Dick out of order).
Let’s start by examining the lengths of the resulting abridgements:
The 10 sentence training set reduced the book to roughly 5000 sentences.
The 100 sentence training set reduced the book to roughly 6500 sentences.
The 1000 sentence training set reduced the book to roughly 7250 sentences.
At first this seems a little counterintuitive. Wasn’t giving the algorithm more data points supposed to make it better? Why did the results get less and less abridged?
But abridging a book isn’t just about making it short. Reducing the entire book down to the single sentence “Such a funny, sporty, gamy, jesty, joky, hoky-poky lad, is the Ocean, oh!” would be very efficient but wouldn’t give us a very good idea about what the book was actually about.
So let’s talk about how good each abridgment was.
One of my criteria for a abridgement was the removing of the more academic explanations from the book, none of which stands out more than Chapter 32, in which Herman Melville classifies and describes all the different types of whales known to man at the time.
The 10 sentence training set reduced Chapter 32 to 110 lines
The 100 sentence training set reduced Chapter 32 to 121 lines
The 1000 sentence training set reduced Chapter 32 to 77 lines
So we can see here that while the 1000 sentence training set resulted in the longest overall abridgment it did the best job at removing the specific kind of sentences we didn’t want.
The obvious follow on question is: How well did each model do at preserving the sentences we actually wanted to keep?
First off the book defining line “From hell’s heart I stab at thee; for hate’s sake I spit my last breath at thee” was only preserved by the 1000 sentence training set. Both the 10 and 100 training sets marked it as unimportant and left it out. That’s a pretty big deal.
For another example consider this passage from the original work, in which the crew of a passing whaling ship tell Captain Ahab about their encounter with Moby Dick:
When this interlude was over, Captain Mayhew began a dark story concerning Moby Dick; not, however, without frequent interruptions from Gabriel, whenever his name was mentioned, and the crazy sea that seemed leagued with him.
It seemed that the Jeroboam had not long left home, when upon speaking a whale-ship, her people were reliably apprised of the existence of Moby Dick, and the havoc he had made.
Greedily sucking in this intelligence, Gabriel solemnly warned the captain against attacking the White Whale, in case the monster should be seen; in his gibbering insanity, pronouncing the White Whale to be no less a being than the Shaker God incarnated; the Shakers receiving the Bible.
But when, some year or two afterwards, Moby Dick was fairly sighted from the mast-heads, Macey, the chief mate, burned with ardour to encounter him; and the captain himself being not unwilling to let him have the opportunity, despite all the archangel’s denunciations and
forewarnings, Macey succeeded in persuading five men to man his boat.
With them he pushed off; and, after much weary pulling, and many perilous, unsuccessful onsets, he at last succeeded in getting one iron fast.
Meantime, Gabriel, ascending to the main-royal mast-head, was tossing one arm in frantic gestures, and hurling forth prophecies of speedy doom to the sacrilegious assailants of his divinity.
Now, while Macey, the mate, was standing up in his boat’s bow, and with all the reckless energy of his tribe was venting his wild exclamations upon the whale, and essaying to get a fair chance for his poised lance, lo!
a broad white shadow rose from the sea; by its quick, fanning motion, temporarily taking the breath out of the bodies of the oarsmen. Next
instant, the luckless mate, so full of furious life, was smitten bodily into the air, and making a long arc in his descent, fell into the sea at the distance of about fifty yards. Not a chip of the boat was harmed, nor a hair of any oarsman’s head; but the mate for ever sank.
Now let’s examine the 10 sentence training set abridgment
I tell thee again, Gabriel, that—” But again the boat tore ahead as if dragged by fiends.
a broad white shadow rose from the sea; by its quick, fanning motion, temporarily taking the breath out of the bodies of the oarsmen.
Next instant, the luckless mate, so full of furious life, was smitten bodily into the air, and making a long arc in his descent, fell into the sea at the distance of about fifty yards.
The biggest problem here is that we lose context. One second the captain is talking to Ahab and then suddenly a while whale shows up and kills a sailor. The fact that the death happened inside a story that the captain was telling to Ahab is lost, leading to quite a bit of confusion in the story’s timeline.
However both the 100 and 1000 training sets manage to properly recognize that the majority of the passage is indeed part of an interesting story.
When this interlude was over, Captain Mayhew began a dark story concerning Moby Dick; not, however, without frequent interruptions from Gabriel, whenever his name was mentioned, and the crazy sea that seemed leagued with him.
It seemed that the Jeroboam had not long left home, when upon speaking a whale-ship, her people were reliably apprised of the existence of Moby Dick, and the havoc he had made.
But when, some year or two afterwards, Moby Dick was fairly sighted from the mast-heads, Macey, the chief mate, burned with ardour to encounter him; and the captain himself being not unwilling to let him have the opportunity, despite all the archangel’s denunciations and forewarnings, Macey succeeded in persuading five men to man his boat.
With them he pushed off; and, after much weary pulling, and many perilous, unsuccessful onsets, he at last succeeded in getting one iron fast.
Meantime, Gabriel, ascending to the main-royal mast-head, was tossing one arm in frantic gestures, and hurling forth prophecies of speedy doom to the sacrilegious assailants of his divinity.
Now, while Macey, the mate, was standing up in his boat’s bow, and with all the reckless energy of his tribe was venting his wild exclamations upon the whale, and essaying to get a fair chance for his poised lance, lo!
Next instant, the luckless mate, so full of furious life, was smitten bodily into the air, and making a long arc in his descent, fell into the sea at the distance of about fifty yards.
Not a chip of the boat was harmed, nor a hair of any oarsman’s head; but the mate for ever sank.
Between these two examples I think we have pretty solidly demonstrated why the filter that filters most is not necessarily the filter that filters best.
Of course even the better abridgments aren’t very good. You can’t really shorten a story by picking and choosing sentences. You have to read the story, understand the story, carefully decide what to discard and then modify the remnant to better hold together as necessary.
But it was nonetheless fun to see what we could achieve in a single day with the help of open source software.