In preparation for a blogpost I'm going to make some time this week I found myself wanting to somehow parametrize vocabulary richness in a piece of text.
btw, Code at bottom ;)
It's an interesting problem because when you read something, it's pretty easy to see when an author is using rich vocabulary, but trying to reduce this observation to a simple number turns out to be a bit of a brainfuck. It's obviously somehow related to word frequencies and it seems obvious that what we want to measure is the distribution shape of word frequencies.
Luckily googling around for about an hour turned up a clue. Way back in 1944 a statistician called G.U. Yule cracked this problem in a paper titled The statistical study of literary vocabulary. What he came up with is the so called Yule's K characteristic.
Wikipedia is scarce on these things, so we know we're treading strange strange ground here. Persistent searching with google scholar turned up a version of the paper that wasn't paywalled to oblivion. However since it was on google books it was missing some key pages.
Luckily what looks like a random homework for R explains perfectly how to implement Yule's K value:
A complementary way of assessing the vocabulary difficulty of texts is to measure their lexical richness. Two indices one could use are Yule's K or Yule's I. These two are defined as follows: (1) Yule's K = 10,000โ ๎M 2โM 1 ๎รท๎ M 1โ M 1๎ (2) Yule's I = ๎M 1โ M 1๎รท๎M 2โM 1 ๎ where M1 is the number of all word forms a text consists of and M2 is the sum of the products of each observed frequency to the power of two and the number of word types observed with that frequency (cf. Oakes 1998:204). For example, if one word occurs three times and four words occur five times, M2=(1*32)+(4*52)=109. The larger Yule's K, the smaller the diversity of the vocabulary (and thus, arguably, the easier the text). Since Yule's I is based on the reciprocal of Yule's K, the larger Yule's I, the larger the diversity of the vocabulary (and thus, arguably, the more difficult the text).
Unfortunately I don't have the link anymore. It was a seriously random pdf I found online, the title seems to be "Quantitative corpus linguistics with R: a practical introduction"
In hopes this blogpost saves somebody a few hours of googling when trying to measure vocabulary richness, here's my python implementation of Yule's K characteristic (or rather its inverse, Yule's I)
from nltk.stem.porter import PorterStemmer
from itertools import groupby
def words(entry):
return filter(lambda w: len(w) > 0,
[w.strip("0123456789!:,.?(){}[]") for w in entry.split()])
def yule(entry):
# yule's I measure (the inverse of yule's K measure)
# higher number is higher diversity - richer vocabulary
d = {}
stemmer = PorterStemmer()
for w in words(entry):
w = stemmer.stem(w).lower()
try:
d[w] += 1
except KeyError:
d[w] = 1
M1 = float(len(d))
M2 = sum([len(list(g))*(freq**2) for freq,g in groupby(sorted(d.values()))])
try:
return (M1*M1)/(M2-M1)
except ZeroDivisionError:
return 0
For example the output of that function for this post is 21.6
Just wish I knew how to make that middle part more functional-like. I don't like having weird for loops strewn about my code like that.
Continue reading about Measuring vocabulary richness with python
Semantically similar articles hand-picked by GPT-4
- I want to analyze your blog
- Science Wednesday: Towards a computational model of poetry generation
- Evolving a poem with an hour of python hacking
- I learned two things today 5.8.
- Summaries and kisses
Learned something new?
Read more Software Engineering Lessons from Production
I write articles with real insight into the career and skills of a modern software engineer. "Raw and honest from the heart!" as one reader described them. Fueled by lessons learned over 20 years of building production code for side-projects, small businesses, and hyper growth startups. Both successful and not.
Subscribe below ๐
Software Engineering Lessons from Production
Join Swizec's Newsletter and get insightful emails ๐ on mindsets, tactics, and technical skills for your career. Real lessons from building production software. No bullshit.
"Man, love your simple writing! Yours is the only newsletter I open and only blog that I give a fuck to read & scroll till the end. And wow always take away lessons with me. Inspiring! And very relatable. ๐"
Have a burning question that you think I can answer? Hit me up on twitter and I'll do my best.
Who am I and who do I help? I'm Swizec Teller and I turn coders into engineers with "Raw and honest from the heart!" writing. No bullshit. Real insights into the career and skills of a modern software engineer.
Want to become a true senior engineer? Take ownership, have autonomy, and be a force multiplier on your team. The Senior Engineer Mindset ebook can help ๐ swizec.com/senior-mindset. These are the shifts in mindset that unlocked my career.
Curious about Serverless and the modern backend? Check out Serverless Handbook, for frontend engineers ๐ ServerlessHandbook.dev
Want to Stop copy pasting D3 examples and create data visualizations of your own? Learn how to build scalable dataviz React components your whole team can understand with React for Data Visualization
Want to get my best emails on JavaScript, React, Serverless, Fullstack Web, or Indie Hacking? Check out swizec.com/collections
Did someone amazing share this letter with you? Wonderful! You can sign up for my weekly letters for software engineers on their path to greatness, here: swizec.com/blog
Want to brush up on your modern JavaScript syntax? Check out my interactive cheatsheet: es6cheatsheet.com
By the way, just in case no one has told you it yet today: I love and appreciate you for who you are โค๏ธ