We’ve gotten pretty good at finding words that an author uses unusually often. But what about the opposite? What about words that the author uses less than we expect, or even not at all? After all, Cuchulain, the Hound of Ulster was defined as much by it’s avoidance of contemporary English as it was by its use of mythological language.
The trick here will be not to flood ourselves with useless information. The nature of writing means that most books will have tremendously more unused words than used words. Most books only have one or two main topics but thousands upon thousands of not topics.
Let’s not spend too much time overthinking this though. While the “best” way to solve this problem likely involves very sophisticated algorithms trained on massive data sets we can start out with a much more naive solution: We’ll just make a list of the most common words for a given era and then see which of those words either don’t show up or show up with less than 10% relative frequency in Cuchulain. The main trick will be limiting the size of the common word list so as to not produce a massive output file.
The most obvious solution is to set a cutoff value for how frequent a word has to be used to be considered “common”. If our result set is too large we’ll make the threshold larger. If it is too small we’ll make the threshold smaller.
Based on experimentation I found that a threshold frequency about 0.00005 seems to work fairly well. That’s 1 in every 20,000 words.
Based on that here are the words the show up significantly less than expected in Cuchulain compared to modern English:
- A.
- America
- American
- B
- B.
- C
- D.
- E.
- London
- M.
- National
- States
- U.S.
- United
- University
- York
- activities
- activity
- addition
- al
- amount
- analysis
- area
- areas
- associated
- available
- b
- based
- basis
- case
- cases
- class
- community
- control
- cost
- data
- de
- decision
- described
- design
- developed
- development
- economic
- effects
- energy
- et
- evidence
- example
- function
- government
- groups
- growth
- important
- include
- including
- individual
- information
- involved
- issues
- levels
- local
- major
- management
- market
- method
- model
- money
- movement
- national
- non
- particularly
- patient
- patients
- per
- percent
- personal
- physical
- policy
- political
- population
- pp.
- problem
- problems
- process
- production
- program
- provide
- provided
- quality
- questions
- really
- related
- relationship
- research
- response
- result
- results
- rights
- role
- s
- section
- services
- shown
- significant
- similar
- specific
- states
- structure
- students
- studies
- study
- system
- systems
- term
- theory
- treatment
- type
- values
- view
- wanted
- working
Based off of this it looks like modern English must be primarily dominated by socio-economic policy reports and discussions. Which sounds plausible and is obviously not something we’re going to see in a collection of Irish legends.
What if we compare it to word frequencies from the early 1900s when the book was written?
- A.
- America
- American
- B.
- Church
- Dr.
- E.
- Fig
- French
- Government
- London
- M.
- Mr.
- Mrs.
- No.
- St.
- States
- United
- William
- acid
- amount
- attention
- case
- cases
- cent
- church
- class
- de
- development
- etc.
- generally
- government
- important
- individual
- influence
- law
- method
- money
- obtained
- office
- p.
- per
- persons
- political
- pressure
- process
- really
- result
- results
- s
- shown
- similar
- solution
- study
- system
- thousand
- town
- view
I’m seeing a lot of legal and policy words again.
While this experiment may not have told us much about Eleanor Hull’s approach to writing about mythology it did tell us a lot about our Google word count based data set. Apparently a huge proportion of human writing is related to law, government, business and money.
Which makes sense! Those things are all very important and tend to get written about a lot. It does cause some potential problems for our goal to analyze stories instead of news and reports but we will continue to do the best we can with the data we have.