Swizec Teller - a geek with a hatswizec.com

Senior Mindset Book

Get promoted, earn a bigger salary, work for top companies

Senior Engineer Mindset cover
Learn more

    Markov chain poem trainer+generator in 29 sloc of Haskell

    Another step in my Automagic Poetry Generation project.

    when i think we have tickets we't my abortion i drank it't my best friend melissa mahoney oh we tried to the building

    ~The Markov Chain, based on Amanda Palmer's Oasis

    I'm making an evolutionary algorithm to generate poetry, but it needs a good base to start from. Random data is preferred, but being completely random means you waste the first couple of epochs just getting to something solid to work from - that's a waste of time.

    To make everything a bit smoother, I've written a markov chain generator in Haskell that generates the initial population. The principle is very simple:

    1. Read seed poem from file
    2. Tokenize
    3. Train markov chain
    4. A generator spits out an infinitely long list based on the markov chain
    5. Take X tokens
    6. Detokenize
    7. Done

    Simple.

    A quick brush up for people who don't breathe "hardcore" Comp Sci stuff: A markov chain is essentially a probabilistic state automata, usually represented with a probability matrix. You take the current state and translate it into the next state by observing the proper probabilities.

    The output is a random string that looks a lot like some drunk tweets you might see on a friday night.

    Le Code

    Our main function is a simple convolution of a bunch of things:

    start_population::(RandomGen g) => g -> String -> IO String
    start_population gen start = do
      return . (foldr detokenize "") . (take Config.seed_length) . (produce gen start). chain . tokenize =<< readFile Config.seed_data
    

    Reading it from right to left you can see that it first reads some data, tokenizes it, makes the chain, produces some output, cuts it to the proper length and then shoves it back into a normal string.

    Simple.

    Because I wanted to handle punctuation and new lines, which are important in poems, I had to write my own tokenization and detokenization functions. Otherwise the built in words function would be sufficient.

    tokenize::String -> [String]
    tokenize s = Prelude.filter (\x -> x /= " " && x /= "") $
                 Split.split (whenElt
                              (\x -> isSeparator x || isPunctuation x || x == '\n')) $
                 Prelude.map toLower s
    
    detokenize::String -> String -> String
    detokenize a b
      | punctuation a || punctuation b = a++b
      | otherwise = a++" "++b
      where punctuation = (\x -> length x > 0 && isPunctuation (x!!0))
    

    You can see that tokenize splits on pretty much everything and detokenize takes special care not to put spaces around punctuation.

    Another important step is building the chain itself.

    chain::[String] -> Map String [String]
    chain [now, last] =
      insert now [last] $ singleton last []
    chain (token:xs) =
      insertWith (\new old -> new++old) token [xs!!0] $ chain xs
    

    Simply put - this function builds a HashMap from a token to many tokens. The idea here is to make a note of every token that comes after some other token. To make things simpler, if a pair of tokens happens twice, it will be recorded twice.

    This magically gives us the ability to properly weigh the random function that chooses what to generate next.

    next_token::(RandomGen g) => g -> Map String [String] -> String -> (g, String)
    next_token gen map s =
      let choices = findWithDefault [] s map
          (i, gen') = randomR (0, length choices - 1) gen
      in (gen', choices!!i)
    
    produce::(RandomGen g) => g -> String -> Map String [String] -> [String]
    produce gen s map =
      let (gen', next) = next_token gen map s
      in s:(produce gen' next map)
    

    I have a nasty suspicion the next_token and produce functions could be merged, but I found this easier to reason about.

    Next_token is the meat of our algorithm - it does nothing but take a token, find a list of its possible successors in the HashMap and return a random member of that list. To avoid any issues it will return an empty string if nothing is found.

    The produce function takes care of driving next_token and makes sure it gets a fresh random generator every time.

    A problem with my technique is that once you give a random generator to the markov chain, you're not getting it back. While it does ensure the result will always be fresh, you might be using a stale generator in other parts of your code if you're not careful.

    Maybe I should finally look into that random monad I've been hearing about.

    Either way, here's the full code, which is 38 sloc because I like including the function headers - makes code easier to think about, but I've tried and it does work without any type hints. Haskell is smarter than I am. Plus I added the whole part that only exposes start_population to the outside world, which isn't otherwise necessary.

    module Initiators.MarkovChain (
      start_population
      ) where
    
    import System.Random
    import Data.HashMap
    import Data.List.Split as Split
    import Data.Char
    
    import Config
    
    -- read corpus data
    -- build markov chain
    -- spit out data
    start_population::(RandomGen g) => g -> String -> IO String
    start_population gen start = do
      return . (foldr detokenize "") . (take Config.seed_length) . (produce gen start). chain . tokenize =<< readFile Config.seed_data tokenize::String -> [String]
    tokenize s = Prelude.filter (\x -> x /= " " && x /= "") $
                 Split.split (whenElt
                              (\x -> isSeparator x || isPunctuation x || x == '\n')) $
                 Prelude.map toLower s
    
    detokenize::String -> String -> String
    detokenize a b
      | punctuation a || punctuation b = a++b
      | otherwise = a++" "++b
      where punctuation = (\x -> length x > 0 && isPunctuation (x!!0))
    
    chain::[String] -> Map String [String]
    chain [now, last] =
      insert now [last] $ singleton last []
    chain (token:xs) =
      insertWith (\new old -> new++old) token [xs!!0] $ chain xs
    
    next_token::(RandomGen g) => g -> Map String [String] -> String -> (g, String)
    next_token gen map s =
      let choices = findWithDefault [] s map
          (i, gen') = randomR (0, length choices - 1) gen
      in (gen', choices!!i)
    
    produce::(RandomGen g) => g -> String -> Map String [String] -> [String]
    produce gen s map =
      let (gen', next) = next_token gen map s
      in s:(produce gen' next map)
    
    Published on September 21st, 2012 in Amanda Palmer, Arts, Haskell, Markov chain, Randomness, Uncategorized

    Did you enjoy this article?

    Continue reading about Markov chain poem trainer+generator in 29 sloc of Haskell

    Semantically similar articles hand-picked by GPT-4

    Senior Mindset Book

    Get promoted, earn a bigger salary, work for top companies

    Learn more

    Have a burning question that you think I can answer? Hit me up on twitter and I'll do my best.

    Who am I and who do I help? I'm Swizec Teller and I turn coders into engineers with "Raw and honest from the heart!" writing. No bullshit. Real insights into the career and skills of a modern software engineer.

    Want to become a true senior engineer? Take ownership, have autonomy, and be a force multiplier on your team. The Senior Engineer Mindset ebook can help 👉 swizec.com/senior-mindset. These are the shifts in mindset that unlocked my career.

    Curious about Serverless and the modern backend? Check out Serverless Handbook, for frontend engineers 👉 ServerlessHandbook.dev

    Want to Stop copy pasting D3 examples and create data visualizations of your own? Learn how to build scalable dataviz React components your whole team can understand with React for Data Visualization

    Want to get my best emails on JavaScript, React, Serverless, Fullstack Web, or Indie Hacking? Check out swizec.com/collections

    Did someone amazing share this letter with you? Wonderful! You can sign up for my weekly letters for software engineers on their path to greatness, here: swizec.com/blog

    Want to brush up on your modern JavaScript syntax? Check out my interactive cheatsheet: es6cheatsheet.com

    By the way, just in case no one has told you it yet today: I love and appreciate you for who you are ❤️

    Created by Swizec with ❤️