Kinda okay generated text

I recently wrote a Markov chain package which included a random text generator. The generated text is not very good. Here’s a sample, based on War and Peace:

everything Make way to a child till late and was ready or to crush the Kiev And what he said Prince I will write a leather apron under this We shall go and pressed close up that’s why they are you spent every word from advancing to write to prevent his whole army their guilt which the wounded yet dies as he exclaimed Natásha of the mountains of the inn they all and to Znaim road crossed himself up to the midst of musketry on the big windows The major quietly with cards with his lips a bleeding and all

It’s just bad. Lips a bleeding and all.

I don’t have a good excuse to make a random text generator. They’re silly and mostly useless. But since I already made one, I might well make a better one. Here’s what I ended up with:

It flopped into something moist, and the Moscovites who took their opinions from others– Ilyá Rostóv among them– remained for a long time settling into her place. Natásha involuntarily gazed at that neck, those shoulders, and chest and stomach involuntarily protruding, had that imposing and stately appearance one sees in very young children and very old people was particularly evident in her. Then it would suddenly seem to him that his long-felt wish, which constitutes the sinew of war, but only talked, and seldom listened; now he seemed to see through him, and it was with this that Daniel’s voice, excited and dissatisfied, interrupted by another, especially in clothes that were still moist.

I wouldn’t read a book of it, but it’s kinda okay. If you need a random text generator, give it try: github.com/pboyd/randtxt

The rest of this post covers the evolution of the main algorithm.

Parts of speech

My first thought to fix the generated text was to use a part of speech (POS) tagger and insert tagged words into the chain. If you remember diagramming sentences in grade school, POS tagging mostly the same. The tagger takes some text, say:

It’s not a question of where he grips it! It’s a simple question of weight ratios! A five ounce bird could not carry a 1 pound coconut.

And tags the parts of speech:

It/PRP ’s/VBZ not/RB a/DT question/NN of/IN where/WRB he/PRP grips/NNS it/PRP !/. It/PRP ’s/VBZ a/DT simple/JJ question/NN of/IN weight/NN ratios/NNS !/. A/DT five/CD ounce/NN bird/NN could/MD not/RB carry/VB a/DT 1/CD pound/NN coconut/NN ./.

(See here for a list of tags.)

It isn’t perfect (e.g. “grips” is a verb in the first sentence). But I hoped it would it would lead to generated text that made more sense.

I experimented with a few POS taggers. For example, here’s text generated from a chain tagged by NLTK’s default POS tagger:

Well you do not at the situation genius utilized the Peace of the service A murmur of the Moscow to congratulate Your Majesty From the truly … Pierre He replied the village they fled across the military council was surrounded by giving final orders to them silently at that was a very prince who get what he came upon his battalion rushes straight up to cry on them to Alpátych was purely out of bathing and ruined and wishing for what happened not understand why for had not be his old countess instigation gathered round as if at the very

And here’s the Stanford POS Tagger:

he had been promoted to unite them. he could be too simple and since he was in French circle, took him with her dancing hall, too oppressive and burst in his pockets and again. Do you, in the piano in a map. know him with her mother’s room. Someday I have seen in chief. Not only redder than an officer was in the prospective purchaser was accepted. he could hear the work of Natásha was vaguely affectionate

And gopkg.in/jdkato/prose.v2:

and many of boxes and began thinking of the Russians , that dreadful if he kept asking the commissioner , he ’ll have quite useless to seem to the icon that prayer . “ But his exceedingly lofty , remember what they will remain . When your children and then . His whole bosom , the way that he shouted , I ? What a thousand men , obviously did not need to Moscow had looked the significance of marriage occupied , “ said to the policeman , outflankings , last seen the portmanteau and even thought she knew

(Apologies for the differences in punctuation, I was also working on that at time.)

POS tagging doesn’t seem to make much difference. Theoretically, it should matter, so I’ve continued to do it.

N-grams

The biggest problem is that the generator chooses the next word based only the previous word. Which is why there are phrases like “I will write a leather apron”. Each pair of words makes sense, but it’s nonsense when strung together.

N-grams are one way to fix that. The idea is to split the text into N word chunks, usually 2 or 3 words but sometimes more. For example, take the first sentence from War and Peace:

Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes.

Ignoring punctuation, when N=2 that’s: Well Prince, Prince so, so Genoa, Genoa and, and Lucca, etc.

And N=3 is Well Prince so, Prince so Genoa, so Genoa and, etc.

I didn’t want to do it that way, because it produces duplicate words. But I did a quick test where N=2 and manually removed the duplicates:

Prince Vasíli announced his hopes and wishes about it , but it wo n’t lack suitors . “ What an argumentative fellow you are in it a new , envious glance at all surprised by what he had continually sought to substitute republican for monarchical institutions . These reasons were very insufficient and obscure , impalpable , and by the position of the victory and no clear idea of letting her slender arms hang down as if struggling with himself that it was in high spirits . “ ” So you are welcome to anything—we shall be our last

It’s better. I probably should have kept it that way. But to simplify the code, I opted for “non-overlapping N-grams”. That is dividing the text into even size chunks like Well prince, so Genoa, and Lucca, etc. Here’s the generated text from that (N=2 again):

Prince , de Hohenlohe , and now with his right hand the squadron over to the cottage he occupied with learning : that is impossible for them not to unload the carts must be a wonderful raconteur , “ said he to go up for examination ? “ ” Mon très préopinant”—”My very honorable opponent “ ) , and then more confident of victory than the winning of two or three friends—you among them—and as for the success of one army against another is the duty of Rhetor

And N=3:

, so Genoa and Lucca are now just family estates of the Buonapartes . But I warn you that if you dare to play the fool in my presence , I will teach you to behave yourself . “ But Rostóv bowed himself away from the deepest reflections to say a civil word to someone who comes in and can then return again to his own . The Emperor proposes to give all commanders of divisions the right to shoot marauders , but I cannot say so , judging by what the doctor said , it was witty and

N=3 looks great. But it actually just strung together large chunks of input text. For example, “right to shoot” only occurs once in War and Peace:

The Emperor proposes to give all commanders of divisions the right to shoot marauders, but I much fear this will oblige one half the army to shoot the other.

That’s not exactly random.

But N=2 worked alright. And after cleaning up the punctuation and capitalization, I ended up with text like:

No, I won’t promise. There was a ford one third of a regiment of Foot Guards he heard a tread of heavy boots and with belts and girdles tightened, were taking leave of her. At length appears the lieutenant general, our life is to love God. He does not want to.

If it seems disjointed that’s because I was feeding punctuation into the chain as independent entities. So the text after every period or comma had nothing to do with the text before it.

N-grams, take 2

I was going to call N=2 non-overlapping n-grams good enough, but I found a paper from Valentin Kassarnig on generating congressional speeches. He was trying to generate a speech for or against particular issues, and he didn’t use a Markov chain, so it’s not really the same thing. But what stuck out to me was that he used four words to choose the next one. I suppose this is the correct way to use n-grams.

The algorithm takes 4 words as a seed, say:

strange/JJ women/NNS lying/VBG in/IN

Which is used to pick the next word, ponds/NNS. And the seed for the word after is:

women/NNS lying/VBG in/IN ponds/NNS

The process continues like that, new words are added to the end and old ones are dropped from the beginning.

This was the result when I tried it with N=5:

Genoa and Lucca are now just family estates of the Buonapartes . but I warn you , if you do n’t feel strong enough to control yourself , she would reply sadly , trying to comfort her husband . among the gentry of the province Nicholas was respected but not liked . he did not concern himself with the interests of his own class , and consequently some thought him proud and others thought him stupid . the whole summer , from spring sowing to harvest , he was busy with the work on his farm . in autumn he gave himself up to

Unfortunately, it’s just copying the book again. Everything after “if you don’t” is a verbatim copy.

The problem is that 4 word sequences are frequently unique in my dataset. Kassarnig used a bag of words (that is, ignored the order) to pick the next word (he joined them back together at the end), which should have reduced the number of unique phrases. He also had a larger dataset, and congressional speeches are repetitive anyway, so it may have not been an issue.

But N=3 works pretty well for me:

? I would have offered you something . they ’ll cweep up to the road which gleamed white in the mist with a plaintive sound passed out of hearing . another musket missed fire but flashed in the pan . Rostóv turned his horse about and galloped off . but no , it should be , to avoid misunderstandings , that you must apologize to the colonel gave Pierre an angry look and wanted to put up with … . is everything quite all right ? said the old count felt this most . he would make that foxy old courtier feel that the childish

(And, yes, the book does include “cweep”.)

That’s the algorithm I’m stopping with. The text is still obviously auto-generated, which I’m a bit disappointed with. But it might be fine for test data or placeholder text which just needs a realistic facade.

Markov Chains

The first time I ever heard of a Markov chain was overhearing a conversation at work. My coworker was reading generated text from a Markov chain of the King James Bible, and was commenting on how ghastly the produced text was. Here’s a sample,

Thus saith the children of the liver of God and the tabernacle and forsake my doors of the breast and she sitteth in Seir most holy in at an homer of Pharez and ye may enjoy the needy

Hmm..

Markov chains sounded like an amusement I could live without and ignored it accordingly. Which was a shame, because they actually have serious applications. Google’s PageRank algorithm is a prominent example. Otherwise, they’re used for DNA sequencing, speech recognition and a whole lot more.

Even the text generator isn’t entirely useless. My smartphone suggests the next word based on the last word entered. I don’t know how that’s built, of course, but it could be a Markov chain. Perhaps if I only sent text messages quoting the King James Bible, it would eventually suggest “children of the liver”.

Markov chains were invented by Andrey Markov,a Russian mathematician who lived in St. Petersburg during the end of the Russian Empire. Markov was outspoken and rebellious throughout his life, which led to a feud with another mathematician, Pavel Nekrasov. Markov and Nekrasov held opposing views on everything from religion to politics, and they frequently quarrelled over all of it.

In 1902, Nekrasov published a paper in support the Russian Orthodox Church’s stance on free will, which claimed that the law of large numbers requires independence.

In case it’s been a while since you studied statistics. The law of large numbers predicts, for example, that a coin will land on heads half the time if it’s repeatedly flipped. Since coin flips don’t depend on the result of the previous coin flip, they are independent events. If I draw a card from a deck of playing cards it has a 1:4 chance to be a heart. If I draw a second card (without replacing the first one), the chances of drawing a heart has changed slightly (either up or down), so that’s a dependent event.

Markov was an atheist, and had no intention of leaving Nekrasov’s “abuse of mathematics” unchallenged. So he invented a chain of events, where each event depends on the previous one, but the law of large numbers still holds. Thus disproving Nekrasov’s claim.

Markov used this chain to study the distribution of vowels and consonants in text. It’s simple enough to do by hand. For instance, here’s the opening of (Markov’s contemporary, and fellow St. Petersburg resident) Leo Tolstoy’s War and Peace:

Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes.

I’m using c for consonant, v for vowel, w for “word breaks” (spaces and the beginning and end) and punctuation is removed:

wcvccwccvccvwcvwcvcvvwvccwcvccvwvcvwcvcwcvccwcvcvcvwvccvcvcwvcwccvwcvvcvcvccvcw

Now just count up the combinations and calculate a probability. For instance, there are 37 c’s and 21 cv’s, so 21 out of 37 consonants are followed by vowels or about 57%. Here’s all the combinations:

37 consonants:
    cv: 57% (21)
    cc: 24% (9)
    cw 19% (7)

27 vowels:
    vc: 67% (18)
    vw: 26% (7)
    vv: 7% (2)

14 spaces:
    wc: 71% (10)
    wv: 29% (4)
    ww: 0% (0)

That was 79 events, which I found tedious. Markov used a similar technique on 20,000 characters from Eugene Onegin, and subsequently analyzed 100,000 characters of a novel.

That seems like a better job for a computer. The manual algorithm translates to a computer, more or less. First the data structure:

type Chain map[interface{}]*Link

type Link struct {
	Value    interface{}
	Children []*ChildLink
}

type ChildLink struct {
	*Link
	Count int
}

This is in Go, but the language doesn’t matter much. If you’re unfamiliar with Go, just know that:

The main point is that each Link has child Linkss. And ChildLinkss have an associated Count of the number of times that child has been seen in relation to the parent.

Chain is just a pool of Links. The structure is recursive, and it’s important that Link value’s aren’t duplicated. A map of values to links is a simple way to accomplish that.

Items in a Markov chain are technically linked with a probability, not a count. But that makes it difficult to insert items, because each additional item requires recalculating the probability for it’s siblings. It’s simple enough to calculate the probability from Count and the sum of all the sibling’s Counts.

New Links are inserted into Chain, and a reference is added to the parent’s Children. To strengthen an existing Link, only the Count needs to be incremented.

func (c Chain) Increment(parent, child interface{}) {
	link := c.getOrCreateLink(parent)

	childLink := link.Find(child)
	if childLink == nil {
		childLink = &ChildLink{
			Link: c.getOrCreateLink(child),
		}
		link.Children = append(link.Children, childLink)
	}

	childLink.Count++
}

func (c Chain) getOrCreateLink(value interface{}) *Link {
	if _, ok := c[value]; !ok {
		c[value] = &Link{
			Value:    value,
			Children: []*ChildLink{},
		}
	}

	return c[value]
}

func (l *Link) Find(value interface{}) *ChildLink {
	for _, l := range l.Children {
		if l.Value == value {
			return l
		}
	}

	return nil
}

That’s it for the Chain itself. To use it to count vowels and consonants, I’ll need to introduce another data type:

type CharClass int

const (
	Consonant CharClass = iota
	Vowel
	Space
)

var AllCharClasses = []CharClass{Consonant, Vowel, Space}

func (cc CharClass) String() string {
	switch cc {
	case Space:
		return "Space"
	case Vowel:
		return "Vowel"
	case Consonant:
		return "Consonant"
	default:
		return "Unknown"
	}
}

It’s still just 3 items: consonants, vowels, and spaces.

const Vowels = "aáàäâæeéèëêiíïîoóôöœuüúý"

func BuildChain(r io.Reader) (Chain, error) {
	bf := bufio.NewReader(r)

	chain := make(Chain, len(AllCharClasses))

	var last CharClass = Space

	for {
		r, _, err := bf.ReadRune()
		if err != nil {
			if err == io.EOF {
				break
			}
			return nil, err
		}

		var next CharClass
		if r == ' ' {
			next = Space
		} else if unicode.IsLetter(r) {
			r = unicode.ToLower(r)
			if strings.ContainsRune(Vowels, r) {
				next = Vowel
			} else {
				next = Consonant
			}
		} else {
			continue
		}

		chain.Increment(last, next)
		last = next
	}

	return chain, nil
}

This is insufficient for anything real (e.g. I’m sure my list of vowels is incorrect), but this is just a demonstration.

BuildChain reads one character (rune in Go’s terminology) at a time, determines it’s type and feeds it into the chain under the previous character type.

Now to put it all together:

func main() {
	chain, err := BuildChain(os.Stdin)
	if err != nil {
		panic(err)
	}

	for _, cc := range AllCharClasses {
		link := chain[cc]
		if link == nil {
			continue
		}

		fmt.Printf("%s:\n", link.Value)

		total := float64(link.Sum())

		for _, childCC := range AllCharClasses {
			count := 0
			if child := link.Find(childCC); child != nil {
				count = child.Count
			}

			probability := float64(count) / total * 100
			fmt.Printf("%14s: %0.2f%% (%d)\n",
				childCC,
				probability,
				count)
		}

		fmt.Print("\n")
	}
}

func (l *Link) Sum() int {
	sum := 0
	for _, l := range l.Children {
		sum += l.Count
	}
	return sum
}

Finally, I can run it against all of War and Peace:

$ < war-and-peace.txt go run main.go 
Consonant:
     Consonant: 31.55% (493431)
         Vowel: 45.23% (707477)
         Space: 23.22% (363274)

Vowel:
     Consonant: 73.87% (698213)
         Vowel: 10.57% (99957)
         Space: 15.56% (147054)

Space:
     Consonant: 72.79% (372539)
         Vowel: 26.92% (137790)
         Space: 0.29% (1483)

Markov’s point was that a letter depends on the previous letter. Which is shown above, a consonant is likely to be followed by a vowel, even though grabbing a random letter from the alphabet is more likely to produce a consonant.

The Markov chain code presented above is a simplified version of my first attempt. I refined that version a bit and built a word generator. It would read some text, build chains for letters and word lengths and then use those chains to generate nonsense words:

t hebopas shéopatow icadsanca rb l inlisee enoh obe ndw aheaprpa nce lssover
en yhetrthie soh edgoany ermewha péndhesy sh evendat hau ssh ico ngowoul

The problem was that it had to read the text each time it ran. Reading an entire book just to generate a lines of gibberish is kinda nuts. Clearly it needed to store the chain on disk.

One approach to store a chain on disk is basically the same as the report format above. For example, here’s the Vowel/Consonant info in JSON (without Space, just to keep it simple):

{
  "consonant": {
    "consonant": 751267,
    "vowel": 812915
  },
  "vowel": {
    "consonant": 812915,
    "vowel": 132309
  }
}

That’s enough detail to recreate the chain, and writing it out and reading back isn’t too difficult.

The first issue with this approach is that the values are repeated. If there are lots of values, or if they’re especially large, the file size will be huge.

I wasn’t planning to store the chain in a relational database, but if I were, I’d have two tables:

values:
------------------
| ID | value     |
------------------
|  1 | consonant |
|  2 | vowel     |
------------------

value_link:
---------------------------------
| parent_id | child_id | count  |
---------------------------------
|         1 |        1 | 751267 |
|         1 |        2 | 812915 |
|         2 |        1 | 812915 |
|         2 |        2 | 132309 |
---------------------------------

If the IDs are sequential, that can be represented in JSON by parallel arrays:

{
  "values": [
    "consonant",
    "vowel"
  ],
  "links": [
    [
      { "child": 0, "count": 751267 },
      { "child": 1, "count": 812915 }
    ],
    [
      { "child": 0, "count": 812915 },
      { "child": 1, "count": 132309 }
    ]
  ]
}

Now the value is only stored once, so it can be a giant blob a text if someone wants. Since values are stored in an array, mapping an ID to a value can be done in constant time. Unfortunately, going the other way and mapping a value to an ID is slower, but that can be worked around.

I can re-implement the chain data structure this way too:

type Chain struct {
	values []interface{}
	links  [][]ChildLink
}

type ChildLink struct {
	ID    int
	Count int
}

(Of course, that requires a different set of functions, which I won’t go into here, but that’s essentially MemoryChain if you want to see an implementation.)

My first attempt to write the chain to disk was to take an in-memory chain and dump it into a binary file all at once. The format was the same concept as the JSON format, with an “index” area that mapped IDs to values, and a “link” area that kept relations between the IDs.

That worked, but the chain still had to exist in memory when it was built and when it was used. Which means the size is limited by available memory, and that’s a shame. So I decided to make a chain that could be run entirely from disk.

I wanted the disk chain to be equivalent to the memory chain. Ideally, the code that uses it should be ignorant of how it’s being stored. For instance, there are few different ways to walk through the chain:

It would be terrible to implement disk-based and memory-based versions of all those. Sounds like I need an interface:

type Chain interface {
	Get(id int) (interface{}, error)
	Links(id int) ([]Link, error)
	Find(value interface{}) (id int, err error)
}

type WriteChain interface {
	Add(value interface{}) (id int, err error)
	Relate(parent, child int, delta int) error
}

As long as MemoryChain and DiskChain conform to that, the same code can operate on both.

Markov chains are built from a stream of input items. Consequently, it’s never clear how many unique items will be encountered until after the chain has been built. I still wanted the file to contain a value index that was separate from the links. I was planning to store the index at the beginning of the file, so the challenge was storing all the new values as they show up in the index, without spilling over into the links.

I knew of a few ways to approach this:

  1. When the index is full, create a new file with a bigger index and copy everything into it.
  2. Use two files.
  3. Just use SQLite
  4. Chain several smaller index sections together.

(There must be better solutions, since databases do this kind of thing all the time, but I don’t know what they do.)

Resizing the index be quite slow. It also potentially wastes a lot of space.

Using two files would be simple to implement, since everything would be appended to the file. But, I would still prefer a single file, since it’s easier to use.

Just using SQLite would have worked fine, but that would require C bindings, and I was trying to keep it in pure Go.

Finally, I was left with splitting up the index into multiple chunks. It had the advantage of not seeming awful. It reminded me of malloc, and how appending to a file is actually quite similar to, say, sbrk. And SSDs are increasingly common, which makes seek fairly fast.

I decided to pretend the file was a heap, and use file offsets like memory addresses. Nothing would ever be unallocated, so new space would be allocated by appending to the file. With those primitives in place, I could implement data structures like I would elsewhere.

The links between values would be stored in a linked list, each node would contain an ID, the number of times the value was encountered, and the offset of the next node. Each ID would be the offset of a structure containing the raw value and the offset of the head of the linked list.

It looked like this:

Item entry:
---------------------------------------------
| Value length | Value | Linked list offset |
---------------------------------------------

Link:
---------------------------------
| ID | Count | Next item offset |
---------------------------------

Mapping values to IDs needs to be done frequently, so quick lookups were desirable. A hashmap would work, but that would require a bunch of empty buckets which would bloat the file size. I settled on a binary search tree.

When I finally finished (and it took longer than I care to admit), I generated a chain from the words in War and Peace. The in-memory version took a few hundred milliseconds, the disk version took 13 minutes. I expected a slight slow down, but 4 orders of magnitude?

I inspected it with the profiler and found that it spent all that time on Read and Write system calls. This really shouldn’t have been surprising. Every level of that binary search tree required a Seek and Read call. Reading a value required a Seek, then a Read for the value size and a Read for the value itself. The linked list probably the worst, it required a Seek and Read for every item, and every item had to be read.

I was able to speed it up a little. But it was clear pretty quickly that nothing short of a redesign would help. It was just too many tiny I/O operations. Hard disks are not RAM, no matter how much I want them to be.

It makes sense that larger reads and writes would incur less overhead and therefore be faster. I never appreciated how much faster though. Here’s a table with the result of some benchmarks. Each test wrote and then read back a 1MB file using a buffer of the given size (in bytes):

Chunk size Time
8 460ms
32 103ms
64 50m
1k 3.4ms
4k 0.9ms
8k 0.6ms
16k 0.4ms

My benchmarks were certainly not scientific, but they do show that it’s much faster to use larger chunk sizes than smaller ones. There must be a point where a larger chunk size doesn’t make a difference, but I should group as many I/O operations together as I can.

To simplify things, I dropped the on disk index. A decent tree structure in memory with hashed values would only require 32 bytes per entry, so 1 million entries is only 30MB (probably a bit more in reality). That’s small enough for the smallest cloud computing instance, or a Raspberry Pi. I think a B-tree on disk would do nicely, but that would be a large tangent on what’s already too large of a tangent. I do hope to correct this at some point (or, at least, implement the tree in memory, since it still uses a map right now).

The links between items were the biggest problem. Finding the next item in the chain requires reading every node in the linked list. Those can be in an array (sort of) instead of a linked list. The only problem is that each item has variable number of items, and I don’t know how many. I ended up with a structure that was similar to the linked list, except instead of individual items it linked to a variable-size “bucket” of items:

List header:
--------------------------------------------
| bucket length | item size | first bucket |
--------------------------------------------

Each bucket:
--------------------------------
| next | item 1 | item 2 | ... |
--------------------------------

This allows a bunch of items to be read at one time, and it was much faster. War and Peace only took 16s. Unfortunately, the file size was enormous (65MB for a 3.5MB input file). The number of items per bucket turns out to be hugely important. A large bucket size is fast, but wastes space with mostly empty buckets. A small bucket size will produce a smaller file, but will be slower because of the time taken to read all the extra buckets. I picked something in the middle.

The other problem was the values themselves, they still required two read calls to get the value: one for the size and one for the value itself. That can’t be avoided without fixed length values. Lists have the same problem, the list header has to be read to tell where the bucket ends. I can’t avoid that either. But, at the very least, all of it can be combined into a single record:

---------------------------------------------------------------------
| value size | bucket length | item size | value ... | first bucket |
---------------------------------------------------------------------

That only requires two reads, one for the first three fields (which have a fixed length), and one for the value and the first list bucket.

Even after all of that, building a chain on disk was still rather slow. I had to compromise on the bucket size, which brought the time to process War and Peace to 25s. I found that it’s a lot faster to build the chain in memory and then copy it disk. It certainly won’t work for every dataset, but when it does it’s a lot faster. This also allows the first list bucket to contain the whole list, since it knows the correct size from the memory chain. After that War and Peace took about 2s to generate on disk.

The disk chain fell a bit short in the end (it’s still too slow, and it requires an index in memory), but it seems like an alright Markov chain implementation. The code is on github, if you’re interested.