Please send exports of your wordpress blogs to swizec@swizec.com! For science!

Word and sentence length

But a little background :)

Lately I’ve been noticing a certain lack of the artsy side of life around me. It would seem that somehow despite being a great fan of theatre and real books and things of such nature I have slipped into a very techy existence where the closest I come to appreciating good art are the books laying around my desk waiting to be read since two years ago.

Being a man of a scientific bent I couldn’t just make such claims without at least attempting to verify them. Plus it’s a cool thing to do and I had a bit of time.

Luckily for me, there’s this brilliant dataset, a cool inside into my mind … this blog. I started writing back when I was smack dab in the middle of my last year of high school – what I consider to be the height of my habit of regularly reading real books. With years this has slowly declined. The natural thing to do, therefore, was to see if this reflects in the way I write, which coincidentally reflects the way I think.

It’s the perfect experiment!

A general decline of being well read should reflect in my writing style. I tracked six parameters:

  1. Length of words – more syllables mean heavier, awesomer words
  2. Sentence length – longer sentences are a sign of not speaking like a marketer or the internet
  3. Flesch-kincaid – this is a measure of readability, or how educated one must be to understand your text
  4. Yule’s I – this is a measure of vocabulary richness, where I’m assuming a broader active vocabulary is a positive thing
  5. Length – simple, how many words and how many sentences there are in a post

The graphs are somewhat slow to load, click here to see shiny graphs in full javascript glory.

For everyone else, here are some screenshots:

Word and sentence length

Readability

As you can see, the data is very squiggly, but there is no trend showing. The problem with this is that I just don’t know how to draw a conclusion from these graphs. They seem to suggest that in five years of blogging I haven’t progressed as a writer in the least bit.

On the other hand they seem to suggest that despite four years of university education, my writing has stayed at a high school level, which seems to lend a credence of support to the original hypothesis.

In order to better judge what’s going on I need to compare data with other blogs. See if trends are cropping up anywhere at all. So please, if you have a wordpress blog, send me an export of your posts to swizec@swizec.com. It’s for science!

And you get to show off some shiny graphs and find out about how you stack up as a writer.

Enhanced by Zemanta
  • Anonymous

    It’s an interesting idea but I have a suspicion the data set it going to be too noisy to draw any real conclusions from without doing some cross-referencing. Each of the metrics you list when charted on its own doesn’t really show much interesting (and I doubt it would, they’re better for doing comparative analysis of two or more individuals writing about the same thing). If you do some comparisons between those graphs though I suspect you might find some interesting numbers however. One thing that does worry me is looking at that graph of average syllable length I find it hard to believe that with a single exception every last post you’ve ever made has a mean syllable length of exactly 1. To me that looks like at the very least a floating to integer truncation error. Either that or it’s a median measure and not a mean. Flesch-kincaid is probably the best metric and shows a strong correlation to Yule’s I which is encouraging, but is still I suspect largely useless for what you’re trying to do. The problem is not every topic is as advanced as every other, and so you’re vocabulary richness will tend to vary drastically. Furthermore, because each post is on one topic in general, the vocabulary used will tend to be somewhat homogenous throughout that post. If you computed the Yule’s I of all your posts I think you’d find that number to be somewhat higher than any individual post if only because you’d be covering a broader swath of topics to really demonstrate your full vocabulary range. Perhaps it would be more interesting to computer the Flesch-kincaid and Yule’s I of 6 months of posts at a time.

    Another thing to consider is that the horizontal scale of your graphs is a bit misleading as your posts are scattered around but the scale appears linear. If you did some plots with an accurate x axis the graphs would be sparser but perhaps more accurate. Of course it could turn out that those graphs don’t really provide any more insight than the current ones.

  • http://swizec.com Swizec

    You’re right, I should devote some attention to cross-referencing the metrics. Especially Yule’s I and Flesch-Kincaid should provide interesting data. Especially because the latter already incorporates all the other metrics in itself.

    As for the syllable count. I suspect this comes from the large predominance of short basically meaningless words in english text since I didn’t do any stopwording. This should become much more interesting if I removed at least all prepositions and articles.

    What I think you’re suggesting are sort of n-grams of posts? That could be very interesting, I’ll give it a look and see if something interesting comes up.

    The scale is linear simply because I had too much trouble making it nonlinear. The graphs wouldn’t parse properly and I didn’t feel at the time like ironing out all the kinks, assuming these graphs will be good enough for a first approximation.