Another step in my Automagic Poetry Generation project.

when i think we have tickets
we’t my abortion i drank it’t my best friend melissa mahoney
oh
we tried to the building

~The Markov Chain, based on Amanda Palmer’s Oasis

I’m making an evolutionary algorithm to generate poetry, but it needs a good base to start from. Random data is preferred, but being completely random means you waste the first couple of epochs just getting to something solid to work from – that’s a waste of time.

To make everything a bit smoother, I’ve written a markov chain generator in Haskell that generates the initial population. The principle is very simple:

  1. Read seed poem from file
  2. Tokenize
  3. Train markov chain
  4. A generator spits out an infinitely long list based on the markov chain
  5. Take X tokens
  6. Detokenize
  7. Done

Simple.

A quick brush up for people who don’t breathe “hardcore” Comp Sci stuff: A markov chain is essentially a probabilistic state automata, usually represented with a probability matrix. You take the current state and translate it into the next state by observing the proper probabilities.

The output is a random string that looks a lot like some drunk tweets  you might see on a friday night.

Le Code

Our main function is a simple convolution of a bunch of things:

start_population::(RandomGen g) => g -> String -> IO String
start_population gen start = do
  return . (foldr detokenize "") . (take Config.seed_length) . (produce gen start). chain . tokenize =<< readFile Config.seed_data

Reading it from right to left you can see that it first reads some data, tokenizes it, makes the chain, produces some output, cuts it to the proper length and then shoves it back into a normal string.

Simple.

Because I wanted to handle punctuation and new lines, which are important in poems, I had to write my own tokenization and detokenization functions. Otherwise the built in words function would be sufficient.

tokenize::String -> [String]
tokenize s = Prelude.filter (\x -> x /= " " && x /= "") $
             Split.split (whenElt
                          (\x -> isSeparator x || isPunctuation x || x == '\n')) $
             Prelude.map toLower s
 
detokenize::String -> String -> String
detokenize a b
  | punctuation a || punctuation b = a++b
  | otherwise = a++" "++b
  where punctuation = (\x -> length x > 0 && isPunctuation (x!!0))

You can see that tokenize splits on pretty much everything and detokenize takes special care not to put spaces around punctuation.

Another important step is building the chain itself.

chain::[String] -> Map String [String]
chain [now, last] =
  insert now [last] $ singleton last []
chain (token:xs) =
  insertWith (\new old -> new++old) token [xs!!0] $ chain xs

Simply put – this function builds a HashMap from a token to many tokens. The idea here is to make a note of every token that comes after some other token. To make things simpler, if a pair of tokens happens twice, it will be recorded twice.

This magically gives us the ability to properly weigh the random function that chooses what to generate next.

next_token::(RandomGen g) => g -> Map String [String] -> String -> (g, String)
next_token gen map s =
  let choices = findWithDefault [] s map
      (i, gen') = randomR (0, length choices - 1) gen
  in (gen', choices!!i)
 
produce::(RandomGen g) => g -> String -> Map String [String] -> [String]
produce gen s map =
  let (gen', next) = next_token gen map s
  in s:(produce gen' next map)

I have a nasty suspicion the next_token and produce functions could be merged, but I found this easier to reason about.

Next_token is the meat of our algorithm – it does nothing but take a token, find a list of its possible successors in the HashMap and return a random member of that list. To avoid any issues it will return an empty string if nothing is found.

The produce function takes care of driving next_token and makes sure it gets a fresh random generator every time.

A problem with my technique is that once you give a random generator to the markov chain, you’re not getting it back. While it does ensure the result will always be fresh, you might be using a stale generator in other parts of your code if you’re not careful.

Maybe I should finally look into that random monad I’ve been hearing about.

Either way, here’s the full code, which is 38 sloc because I like including the function headers – makes code easier to think about, but I’ve tried and it does work without any type hints. Haskell is smarter than I am. Plus I added the whole part that only exposes start_population to the outside world, which isn’t otherwise necessary.

module Initiators.MarkovChain (
  start_population
  ) where
 
import System.Random
import Data.HashMap
import Data.List.Split as Split
import Data.Char
 
import Config
 
-- read corpus data
-- build markov chain
-- spit out data
start_population::(RandomGen g) => g -> String -> IO String
start_population gen start = do
  return . (foldr detokenize "") . (take Config.seed_length) . (produce gen start). chain . tokenize =<< readFile Config.seed_data tokenize::String -> [String]
tokenize s = Prelude.filter (\x -> x /= " " && x /= "") $
             Split.split (whenElt
                          (\x -> isSeparator x || isPunctuation x || x == '\n')) $
             Prelude.map toLower s
 
detokenize::String -> String -> String
detokenize a b
  | punctuation a || punctuation b = a++b
  | otherwise = a++" "++b
  where punctuation = (\x -> length x > 0 && isPunctuation (x!!0))
 
chain::[String] -> Map String [String]
chain [now, last] =
  insert now [last] $ singleton last []
chain (token:xs) =
  insertWith (\new old -> new++old) token [xs!!0] $ chain xs
 
next_token::(RandomGen g) => g -> Map String [String] -> String -> (g, String)
next_token gen map s =
  let choices = findWithDefault [] s map
      (i, gen') = randomR (0, length choices - 1) gen
  in (gen', choices!!i)
 
produce::(RandomGen g) => g -> String -> Map String [String] -> [String]
produce gen s map =
  let (gen', next) = next_token gen map s
  in s:(produce gen' next map)
Enhanced by Zemanta