swizec.com

#### Senior Mindset Book

Get promoted, earn a bigger salary, work for top companies

Learn more

# Markov chain poem trainer+generator in 29 sloc of Haskell

Another step in my Automagic Poetry Generation project.

when i think we have tickets we't my abortion i drank it't my best friend melissa mahoney oh we tried to the building

~The Markov Chain, based on Amanda Palmer's Oasis

I'm making an evolutionary algorithm to generate poetry, but it needs a good base to start from. Random data is preferred, but being completely random means you waste the first couple of epochs just getting to something solid to work from - that's a waste of time.

To make everything a bit smoother, I've written a markov chain generator in Haskell that generates the initial population. The principle is very simple:

1. Read seed poem from file
2. Tokenize
3. Train markov chain
4. A generator spits out an infinitely long list based on the markov chain
5. Take X tokens
6. Detokenize
7. Done

Simple.

A quick brush up for people who don't breathe "hardcore" Comp Sci stuff: A markov chain is essentially a probabilistic state automata, usually represented with a probability matrix. You take the current state and translate it into the next state by observing the proper probabilities.

The output is a random string that looks a lot like some drunk tweets you might see on a friday night.

## Le Code

Our main function is a simple convolution of a bunch of things:

```.css-1yb0ye3{font-family:monospace;color:#728fcb;background-color:#faf8f5;font-size:0.9em;padding-left:0;padding-right:0;}.css-1yb0ye3 .comment,.css-1yb0ye3 .prolog,.css-1yb0ye3 .doctype,.css-1yb0ye3 .cdata,.css-1yb0ye3 .punctuation{color:#b6ad9a;}.css-1yb0ye3 .namespace{opacity:0.7;}.css-1yb0ye3 .tag,.css-1yb0ye3 .operator,.css-1yb0ye3 .number{color:#063289;}.css-1yb0ye3 .property,.css-1yb0ye3 .function{color:#b29762;}.css-1yb0ye3 .tag-id,.css-1yb0ye3 .selector,.css-1yb0ye3 .atrule-id{color:#2d2006;}.css-1yb0ye3 .attr-name{color:#896724;}.css-1yb0ye3 .boolean,.css-1yb0ye3 .string,.css-1yb0ye3 .entity,.css-1yb0ye3 .url,.css-1yb0ye3 .attr-value,.css-1yb0ye3 .keyword,.css-1yb0ye3 .control,.css-1yb0ye3 .directive,.css-1yb0ye3 .unit,.css-1yb0ye3 .statement,.css-1yb0ye3 .regex,.css-1yb0ye3 .at-rule{color:#728fcb;}.css-1yb0ye3 .placeholder,.css-1yb0ye3 .variable{color:#93abdc;}.css-1yb0ye3 .deleted{text-decoration-line:line-through;}.css-1yb0ye3 .inserted{text-decoration-line:underline;}.css-1yb0ye3 .italic{font-style:italic;}.css-1yb0ye3 .important,.css-1yb0ye3 .bold{font-weight:700;}.css-1yb0ye3 .important{color:#896724;}.css-1yb0ye3 .highlight{background:hsla(0, 0%, 70%, .5);}.css-o6ar0x{font-family:monospace;color:#728fcb;background-color:#faf8f5;font-size:0.9em;padding-left:0;padding-right:0;font-family:monospace;color:#728fcb;background-color:#faf8f5;font-size:0.9em;padding-left:0;padding-right:0;}.css-o6ar0x .comment,.css-o6ar0x .prolog,.css-o6ar0x .doctype,.css-o6ar0x .cdata,.css-o6ar0x .punctuation{color:#b6ad9a;}.css-o6ar0x .namespace{opacity:0.7;}.css-o6ar0x .tag,.css-o6ar0x .operator,.css-o6ar0x .number{color:#063289;}.css-o6ar0x .property,.css-o6ar0x .function{color:#b29762;}.css-o6ar0x .tag-id,.css-o6ar0x .selector,.css-o6ar0x .atrule-id{color:#2d2006;}.css-o6ar0x .attr-name{color:#896724;}.css-o6ar0x .boolean,.css-o6ar0x .string,.css-o6ar0x .entity,.css-o6ar0x .url,.css-o6ar0x .attr-value,.css-o6ar0x .keyword,.css-o6ar0x .control,.css-o6ar0x .directive,.css-o6ar0x .unit,.css-o6ar0x .statement,.css-o6ar0x .regex,.css-o6ar0x .at-rule{color:#728fcb;}.css-o6ar0x .placeholder,.css-o6ar0x .variable{color:#93abdc;}.css-o6ar0x .deleted{text-decoration-line:line-through;}.css-o6ar0x .inserted{text-decoration-line:underline;}.css-o6ar0x .italic{font-style:italic;}.css-o6ar0x .important,.css-o6ar0x .bold{font-weight:700;}.css-o6ar0x .important{color:#896724;}.css-o6ar0x .highlight{background:hsla(0, 0%, 70%, .5);}.css-o6ar0x .comment,.css-o6ar0x .prolog,.css-o6ar0x .doctype,.css-o6ar0x .cdata,.css-o6ar0x .punctuation{color:#b6ad9a;}.css-o6ar0x .namespace{opacity:0.7;}.css-o6ar0x .tag,.css-o6ar0x .operator,.css-o6ar0x .number{color:#063289;}.css-o6ar0x .property,.css-o6ar0x .function{color:#b29762;}.css-o6ar0x .tag-id,.css-o6ar0x .selector,.css-o6ar0x .atrule-id{color:#2d2006;}.css-o6ar0x .attr-name{color:#896724;}.css-o6ar0x .boolean,.css-o6ar0x .string,.css-o6ar0x .entity,.css-o6ar0x .url,.css-o6ar0x .attr-value,.css-o6ar0x .keyword,.css-o6ar0x .control,.css-o6ar0x .directive,.css-o6ar0x .unit,.css-o6ar0x .statement,.css-o6ar0x .regex,.css-o6ar0x .at-rule{color:#728fcb;}.css-o6ar0x .placeholder,.css-o6ar0x .variable{color:#93abdc;}.css-o6ar0x .deleted{text-decoration-line:line-through;}.css-o6ar0x .inserted{text-decoration-line:underline;}.css-o6ar0x .italic{font-style:italic;}.css-o6ar0x .important,.css-o6ar0x .bold{font-weight:700;}.css-o6ar0x .important{color:#896724;}.css-o6ar0x .highlight{background:hsla(0, 0%, 70%, .5);}```start_population::(RandomGen g) => g -> String -> IO String
start_population gen start = do
return . (foldr detokenize "") . (take Config.seed_length) . (produce gen start). chain . tokenize =<< readFile Config.seed_data
``````

Reading it from right to left you can see that it first reads some data, tokenizes it, makes the chain, produces some output, cuts it to the proper length and then shoves it back into a normal string.

Simple.

Because I wanted to handle punctuation and new lines, which are important in poems, I had to write my own tokenization and detokenization functions. Otherwise the built in words function would be sufficient.

``````tokenize::String -> [String]
tokenize s = Prelude.filter (\x -> x /= " " && x /= "") \$
Split.split (whenElt
(\x -> isSeparator x || isPunctuation x || x == '\n')) \$
Prelude.map toLower s

detokenize::String -> String -> String
detokenize a b
| punctuation a || punctuation b = a++b
| otherwise = a++" "++b
where punctuation = (\x -> length x > 0 && isPunctuation (x!!0))
``````

You can see that tokenize splits on pretty much everything and detokenize takes special care not to put spaces around punctuation.

Another important step is building the chain itself.

``````chain::[String] -> Map String [String]
chain [now, last] =
insert now [last] \$ singleton last []
chain (token:xs) =
insertWith (\new old -> new++old) token [xs!!0] \$ chain xs
``````

Simply put - this function builds a HashMap from a token to many tokens. The idea here is to make a note of every token that comes after some other token. To make things simpler, if a pair of tokens happens twice, it will be recorded twice.

This magically gives us the ability to properly weigh the random function that chooses what to generate next.

``````next_token::(RandomGen g) => g -> Map String [String] -> String -> (g, String)
next_token gen map s =
let choices = findWithDefault [] s map
(i, gen') = randomR (0, length choices - 1) gen
in (gen', choices!!i)

produce::(RandomGen g) => g -> String -> Map String [String] -> [String]
produce gen s map =
let (gen', next) = next_token gen map s
in s:(produce gen' next map)
``````

I have a nasty suspicion the next_token and produce functions could be merged, but I found this easier to reason about.

Next_token is the meat of our algorithm - it does nothing but take a token, find a list of its possible successors in the HashMap and return a random member of that list. To avoid any issues it will return an empty string if nothing is found.

The produce function takes care of driving next_token and makes sure it gets a fresh random generator every time.

A problem with my technique is that once you give a random generator to the markov chain, you're not getting it back. While it does ensure the result will always be fresh, you might be using a stale generator in other parts of your code if you're not careful.

Maybe I should finally look into that random monad I've been hearing about.

Either way, here's the full code, which is 38 sloc because I like including the function headers - makes code easier to think about, but I've tried and it does work without any type hints. Haskell is smarter than I am. Plus I added the whole part that only exposes start_population to the outside world, which isn't otherwise necessary.

``````module Initiators.MarkovChain (
start_population
) where

import System.Random
import Data.HashMap
import Data.List.Split as Split
import Data.Char

import Config

-- read corpus data
-- build markov chain
-- spit out data
start_population::(RandomGen g) => g -> String -> IO String
start_population gen start = do
return . (foldr detokenize "") . (take Config.seed_length) . (produce gen start). chain . tokenize =<< readFile Config.seed_data tokenize::String -> [String]
tokenize s = Prelude.filter (\x -> x /= " " && x /= "") \$
Split.split (whenElt
(\x -> isSeparator x || isPunctuation x || x == '\n')) \$
Prelude.map toLower s

detokenize::String -> String -> String
detokenize a b
| punctuation a || punctuation b = a++b
| otherwise = a++" "++b
where punctuation = (\x -> length x > 0 && isPunctuation (x!!0))

chain::[String] -> Map String [String]
chain [now, last] =
insert now [last] \$ singleton last []
chain (token:xs) =
insertWith (\new old -> new++old) token [xs!!0] \$ chain xs

next_token::(RandomGen g) => g -> Map String [String] -> String -> (g, String)
next_token gen map s =
let choices = findWithDefault [] s map
(i, gen') = randomR (0, length choices - 1) gen
in (gen', choices!!i)

produce::(RandomGen g) => g -> String -> Map String [String] -> [String]
produce gen s map =
let (gen', next) = next_token gen map s
in s:(produce gen' next map)
``````
Published on September 21st, 2012 in Amanda Palmer, Arts, Haskell, Markov chain, Randomness, Uncategorized

#### Continue reading about Markov chain poem trainer+generator in 29 sloc of Haskell

Semantically similar articles hand-picked by GPT-4

### Senior Mindset Book

Get promoted, earn a bigger salary, work for top companies

Learn more

Have a burning question that you think I can answer? Hit me up on twitter and I'll do my best.

Who am I and who do I help? I'm Swizec Teller and I turn coders into engineers with "Raw and honest from the heart!" writing. No bullshit. Real insights into the career and skills of a modern software engineer.

Want to become a true senior engineer? Take ownership, have autonomy, and be a force multiplier on your team. The Senior Engineer Mindset ebook can help 👉 swizec.com/senior-mindset. These are the shifts in mindset that unlocked my career.

Curious about Serverless and the modern backend? Check out Serverless Handbook, for frontend engineers 👉 ServerlessHandbook.dev

Want to Stop copy pasting D3 examples and create data visualizations of your own? Learn how to build scalable dataviz React components your whole team can understand with React for Data Visualization

Want to get my best emails on JavaScript, React, Serverless, Fullstack Web, or Indie Hacking? Check out swizec.com/collections

Did someone amazing share this letter with you? Wonderful! You can sign up for my weekly letters for software engineers on their path to greatness, here: swizec.com/blog

Want to brush up on your modern JavaScript syntax? Check out my interactive cheatsheet: es6cheatsheet.com

By the way, just in case no one has told you it yet today: I love and appreciate you for who you are ❤️

Created by Swizec with ❤️