-
Notifications
You must be signed in to change notification settings - Fork 0
/
sql:markov-chains.xhtml
444 lines (412 loc) · 15.6 KB
/
sql:markov-chains.xhtml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
<?xml version="1.0" encoding="utf-8" ?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>SQL(ite): markov.sql: Higher Order Markov Chains</title>
<meta name="author" content="Magnus Achim Deininger" />
<meta name="description" content="A simple and straightforward implementtion of higher order markov chains in SQL(ite)." />
<meta name="date" content="2013-06-04T17:45:00Z" />
<meta name="mtime" content="2013-06-11T19:04:00Z" />
<meta name="category" content="Software" />
<meta name="category" content="SQL" />
<meta name="unix:name" content="sql:markov-chains" />
</head>
<body>
<h1>Introduction</h1>
<p>I promised I'd expand on those other uses for <a href="/sql:sequence-generator">that SQL sequence generator</a>, so here we are: higher order markov chains! This particular piece of code I'll be introducing allows generating strings with a model you train with other strings - for example, you could train the model with first names and then tell it to generate a first name based on that.</p>
<p>Why am I doing this in SQL? Because I can! And because, actually this really is kind of straightforward. The implementation is in SQLite, as usual.</p>
<p>I've also taken the time to create a separate GIT repository for this in case anyone's interested. The repository contains a set of pre-trained models with 5000 corporation names from the US tax bureau and the 5000 most popular last names from the US Census Bureau from 1990, along with the most popular female and male first names from the same year. The provided makefile will download and train that data for you, as well.</p>
<p><em>Update: <a href="/postgresql:markov-chains.xhtml">Release 2 of markov.sql is a port to PostgreSQL; have a look at that if you prefer PostgreSQL, or if you enjoy reading about weird SQL hacks in general</a>. As part of this update I've renamed some files in the repository to make it clear which database they're supposed to be used with and I've changed the links in this article to reflect this.</em></p>
<h1>Quick Start</h1>
<p>The repository for this little project is available at: <a href="http://git.becquerel.org/jyujin/markov.git">http://git.becquerel.org/jyujin/markov.git</a> - you can either grab a tarball from there or browse the source code (only 2 files, really). Alternatively, you could also issue the following command on your shell to check out a copy of the repository:</p>
<pre><code><![CDATA[$ git clone git://git.becquerel.org/jyujin/markov.git]]></code></pre>
<p>This will create a directory called <em>markov</em> all set up and ready to play.</p>
<p>Once you have the sources, there is <a href="http://git.becquerel.org/jyujin/markov.git?a=blob_plain;f=dist/sqlite3.markov-data.sql;hb=HEAD">a pre-compiled, pre-trained SQLite database dump under dist/sqlite3.markov-data.sql in that repository</a>. To play with that, issue the following on a command line in this new <em>markov</em>-directory:</p>
<pre><code><![CDATA[$ sqlite3 markov.sqlite3 < dist/sqlite3.markov-data.sql]]></code></pre>
<p>Now for the fun part: to create names, insert the type of name you want in the <em>markovconstruct</em> table, then insert the maximum number of iterations in the <em>vautoconstruct</em> view, like this:</p>
<pre><code><![CDATA[$ sqlite3 markov.sqlite3
SQLite version 3.7.13 2012-06-11 02:05:22
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> insert into markovconstruct (id) values (3), (3), (3), (3), (3);
sqlite> insert into vautoconstruct (steps) values (40);]]></code></pre>
<p>The pre-trained IDs are: 2 for corporation names, 3 for female first names, 4 for male first names and 5 for last names. Once it's done, you should find the results in the <em>markovresult</em> table:</p>
<pre><code><![CDATA[sqlite> select * from markovresult;
3|1|alaron
3|2|roma
3|3|julia
3|4|shanie
3|5|daine]]></code></pre>
<p>The <em>markovconstruct</em> table has an optional column <em>depth</em>, which controls the order of the model that is used to create the output. The default is 3. To train new models yourself, insert the strings to train with into the <em>vtrain</em> table, like this:</p>
<pre><code><![CDATA[sqlite> insert into vtrain (id, data) values (42, 'foo'), (42, 'frob'), (42, 'barf'), (42, 'bar');]]></code></pre>
<p>That view should take care of everything. Once the <em>insert</em> operation completes the model will have been updated and you can generate strings based on it like you did with the names above.</p>
<h1>Implementation</h1>
<p>Apart from <a href="/sql:sequence-generator">the sequence generator</a>, the following SQL trains and uses the model. A <a href="http://git.becquerel.org/jyujin/markov.git?a=blob_plain;f=src/sqlite3.markov.sql;hb=HEAD">combined source file is available in the previously mentioned GIT repository</a>.</p>
<p>First we need a few tables to hold the trained model; the tables <em>markov3</em>, <em>markov2</em> and <em>markov1</em> hold this data for us - third order, second order and first order models, respectively:</p>
<pre><code><![CDATA[create table markov3
(
id integer not null,
mroid integer not null primary key,
c0 text null,
c1 text null,
c2 text null,
next text null,
cnt integer not null
);
create unique index markov3P on markov3 (id, c0, c1, c2, next);
create table markov2
(
id integer not null,
mroid integer not null primary key,
c0 text null,
c1 text null,
next text null,
cnt integer not null
);
create unique index markov2P on markov2 (id, c0, c1, next);
create table markov1
(
id integer not null,
mroid integer not null primary key,
c0 text null,
next text null,
cnt integer not null
);
create unique index markov1P on markov2 (id, c0, next);]]></code></pre>
<p>To train the models, we need somewhere to store the training data along with the parser state. Ordinarily we could omit this, but SQLite doesn't like self-recursive triggers, so we're stuck with a temporary data table.</p>
<p>To perform the actual training, all we then have to do is to examine the current state and the following symbol and increase the "cnt" in the model by one. Then the symbols are shifted through and the first symbol of the remainder of the string is used as the next symbol. We repeat that until the string is empty. NULLs are used as sentinels for start and end characters in the models.</p>
<p>Usually, in SQL, we'd use a stored procedure for this purpose - but SQLite doesn't support those. So insted we create a handful of views and create <em>instead of insert</em> triggers for them. Instead of calling these procedures, we have to insert into that view, but the net effect is the same as calling a procedure - that's a pretty handy trick in SQLite, by the way.</p>
<pre><code><![CDATA[create table trainstate
(
id integer not null,
c0 text null,
c1 text null,
c2 text null,
c3 text null,
remainder text null,
foreign key (id) references markov3(id)
);
create view vtrainstep as
select
null as step;
create view vtrain as
select
null as id,
null as data,
null as steps;
create view vautotrain as
select
null as steps;
create trigger vtrainstepInsert instead of insert on vtrainstep
for each row begin
update trainstate
set c0 = c1,
c1 = c2,
c2 = c3,
c3 = substr(remainder, 1, 1),
remainder = substr(remainder, 2);
update trainstate set c0 = null where c0 = '';
update trainstate set c1 = null where c1 = '';
update trainstate set c2 = null where c2 = '';
update trainstate set c3 = null where c3 = '';
update trainstate set remainder = null where remainder = '';
update markov3
set cnt = cnt + 1
where mroid in (select mroid
from trainstate as s, markov3 as m
where m.id is s.id and m.c0 is s.c0 and m.c1 is s.c1 and m.c2 is s.c2 and m.next is s.c3);
update markov2
set cnt = cnt + 1
where mroid in (select mroid
from trainstate as s, markov2 as m
where m.id is s.id and m.c0 is s.c1 and m.c1 is s.c2 and m.next is s.c3);
update markov1
set cnt = cnt + 1
where mroid in (select mroid
from trainstate as s, markov2 as m
where m.id is s.id and m.c0 is s.c2 and m.next is s.c3);
insert or replace into markov3
(id, c0, c1, c2, cnt, next)
select s.id, s.c0, s.c1, s.c2, 1 as cnt, s.c3 as next
from trainstate as s
left join markov3 as m on m.id is s.id and m.c0 is s.c0 and m.c1 is s.c1 and m.c2 is s.c2 and m.next is s.c3
where m.cnt is null;
insert or replace into markov2
(id, c0, c1, cnt, next)
select s.id, s.c1 as c0, s.c2 as c1, 1 as cnt, s.c3 as next
from trainstate as s
left join markov2 as m on m.id is s.id and m.c0 is s.c1 and m.c1 is s.c2 and m.next is s.c3
where m.cnt is null;
insert or replace into markov1
(id, c0, cnt, next)
select s.id, s.c2 as c0, 1 as cnt, s.c3 as next
from trainstate as s
left join markov1 as m on m.id is s.id and m.c0 is s.c2 and m.next is s.c3
where m.cnt is null;
delete from trainstate where c3 is null;
end;
create trigger vtrainInsert instead of insert on vtrain
for each row begin
insert into trainstate
(id, c0, c1, c2, remainder)
values
(new.id, null, null, null, lower(new.data));
insert into vtrainstep (step) select b from seq8 where b < coalesce(new.steps, length(new.data));
end;
create trigger vautotrainInsert instead of insert on vautotrain
for each row begin
insert into vtrainstep
(step)
select b
from seq8
where b < coalesce(new.steps, length((select remainder
from trainstate
order by length(remainder) desc
limit 1)));
end;]]></code></pre>
<p>And finally, to create strings with our shiny new model, all we do is we start with a blank state and look up the probabilities for the next character in the appropriate model. This is where that handy sequence generator comes in.</p>
<p>You may have noticed that the model doesn't contain any totals, and the probabilities aren't expressed as floating point numbers. Instead, the only bit of data in the model is the number of times that the training data had a certain character follow the current state. If this were any other language, we'd have to massage and finalise this model to create a probability (between 0 and 1) to work with, then follow the regular algorithm. But this is SQL!</p>
<p>Instead, we create a view that selects all possible characters that could come next, given the current state and the model. We then repeat each of these possible next states as often as they occured in the training session with the help of our sequence generator and the cartesian product (or <em>natural join</em> in SQL). For example, if we had "foo" as the current state, and the next character could be an "a", a "b" or a "c" with counts 1, 2 and 3, respectively, then our view would return the result rows "a", "b", "b", "c", "c", "c". From this view we pick <em>one row at random</em>, using a <em>select ... order by random() limit 1</em> clause. This has the same effect as calculating the totals, the probabilities and then choosing a number between 0 and 1 - except it's much more natural in SQL and quite a lot more straightforward. Well, as natural and straightforward as an <em>order by random()</em> select can possibly be, anyway :D.</p>
<pre><code><![CDATA[create table markovconstruct
(
id integer not null,
mvcid integer not null primary key,
c0 text null,
c1 text null,
c2 text null,
depth integer not null default 3,
data text not null default '',
foreign key (id) references markov3(id)
);
create table markovresult
(
id integer not null,
mvcid integer not null primary key,
result text not null,
foreign key (id) references markov3(id)
);
create view vmarkovprobabilities as
select
c.id as id,
c.mvcid as mvcid,
3 as depth,
m.next as next
from markovconstruct as c
left join markov3 as m on c.id is m.id and c.c0 is m.c0 and c.c1 is m.c1 and c.c2 is m.c2,
seq8 as n
where n.b < coalesce(cnt, 1)
union all
select
c.id as id,
c.mvcid as mvcid,
2 as depth,
m.next as next
from markovconstruct as c
left join markov2 as m on c.id is m.id and c.c1 is m.c0 and c.c2 is m.c1,
seq8 as n
where n.b < coalesce(cnt, 1)
union all
select
c.id as id,
c.mvcid as mvcid,
1 as depth,
m.next as next
from markovconstruct as c
left join markov1 as m on c.id is m.id and c.c2 is m.c0,
seq8 as n
where n.b < coalesce(cnt, 1)
;
create view vconstructstep as
select
null as step;
create view vautoconstruct as
select
null as steps;
create trigger vconstructstepInsert instead of insert on vconstructstep
for each row begin
update markovconstruct
set c0 = c1,
c1 = c2,
c2 = (select next
from vmarkovprobabilities as p
where p.mvcid = markovconstruct.mvcid
and p.depth = markovconstruct.depth
order by random()
limit 1)
where c2 is not null
or (c0 is null and c1 is null and c2 is null);
update markovconstruct
set data = data || coalesce(c2, '');
insert into markovresult
(id, mvcid, result)
select id, mvcid, data
from markovconstruct
where c2 is null;
delete from markovconstruct where c2 is null;
end;
create trigger vautoconstructInsert instead of insert on vautoconstruct
for each row begin
insert into vconstructstep
(step)
select b
from seq8
where b < coalesce(new.steps, 50);
end;]]></code></pre>
<h1>Sample Results</h1>
<p>This is a very primitive markov chain implementation, but it gets the job done. So I figured it'd be nice to include a few samples with this article. First, a few female first names based on the 1990 US Census Bureau data:</p>
<p><em>dani
tristine
odesire
bryn
ang
jewell
kylee
maxie
athrine
don
shari
coriannetta
hery
margarda
chella
zuletta
caretta
frana
janilseannelle
vergie
sanne
bonna
lakiaraine
antin
winny
mona
ardanna
selena
christashirleshanh
pat
kim
ren
ann
kenza
lauree
nichanne
rey
tranchel
nandra
vira
tessidy
noelleneena
cele
chae
cleonie
zenor
margenisha
sommer
shelleeshamerri
chrin</em></p>
<p>This was generated with the following one liner in the training data: <em>delete from markovresult; delete from markovconstruct; insert into markovconstruct (id) select 3 as id from seq8 where seq8.b < 50; insert into vautoconstruct (steps) values (40); select result from markovresult;</em> - if we use <em>4</em> as the id, we get male names:</p>
<p><em>conrad
jeremil
karendellenard
pedrigoberrolas
luis
bricenzo
johnnis
lyle
ross
rill
keenardo
darrolfonzo
carson
tore
doyle
vicent
wilson
kinley
joesph
santon
cecil
marcelo
tyron
wm
joseven
irwinston
brand
joan
judson
jamerlint
mckie
gail
willis
lylestefan
fletch
johnie
neil
sil
mac
den
shelby
haron
jord
fere
bur
shirley
lamario
normand
kenny
wilmerlin
</em></p>
<p>And finally, id <em>5</em> in that dataset creates last names:</p>
<p><em>clandallamy
busseno
boa
mcnaily
choward
cavez
reinriggins
mundtremblinclaross
buchen
ness
mccliff
fowley
frazier
crosby
pharterider
bean
segayler
barth
nutson
hessiter
gain
true
aquett
monds
frie
colemendrew
cosgow
wen
wald
hawkins
dard
thale
hamer
alf
dell
clangum
bland
ruck
parrower
humann
mas
varra
spearce
mire
lanill
ves
sweedlindes
marabtreauch
bron
druffmancough
</em></p>
<p>Gotta love the US Census Bureau :D. Enjoy!</p>
</body>
</html>