How to Make an Artificial Windbag

Have you ever had the pleasure of speaking with those rare individuals who can spout out reasonable-sounding technical babble, that upon closer inspection turns out to be completely meaningless?  Yes?  Ah, I see that we have met.  I hope you have found time to read my treatise on the hydromechanical properties of digital systems analysis tools?  Good, good.  I can autograph that for you if you want.  No?  Are you sure?

Academic types share a sort of superpower with car mechanics, doctors, and tax accountants.  The jargon of academia is designed to sound impressive, even when absolutely nothing is being said.  There have been lots of famous examples of supposedly revolutionary academic papers being accepted into prestigious journals, only to be revealed later as meaningless balls of jargon tied together to sound good.  For some reason, the physical science types get away with this pretty often at the expense of sociology and literary criticism journals.  Mark Twain devoted one of his great short stories to tell how he had been duped by an impressive-sounding “expert”, who was merely stringing together plausible-sounding sentences randomly.

It turns out that creating convincing-sounding english text is easy enough for a computer to do — these days, any number of websites will create random essays for you on a topic of your choice.  Some of the computer programs behind these artificial text-generators are quite complicated — they include algorithms that take into account grammar, how often different phrases pop up in speech — basically little linguistics experts.  I decided to see what could be done with a very simple approach — how well could a program do that only knows how often particular letters crop up next to each other?

For example, think of the poor eccentric letter “Q”, who almost never appears in public without his companion “U”.  The letter “U” gets around fine — you see him everywhere.  But when “Q” pops up, you can bet the deed to the house that “U” will be next.  This is an extreme example, but the same thing happens to a lesser extent with other letters — think how often you see “T” followed by “H”, thanks to words like “THE” (or “THANKS”, for that matter).  And how often do you see “J” followed by another “J”?

So I wrote a simple little computer program that “invents” words, simply by using how often letters appear next to other letters in english text.  Let’s say it has a letter “G”, and is trying to decide what letter to put down next.  It looks up what letters are most commonly found after “G” in english, and chooses one at random.  And it will weight the choice more heavily toward letters that are more common.  So for the letter “Q”, almost all of the time it will drop a “U”.  And it just marches along, spouting out letters (and punctuation and spaces and line returns), printing out fake english as long as you want.

And how does it know which letters are most common?  Well, that’s easy — just grab a pen and paper and start counting.  Pick up any english language text you want, march through it one letter at a time, and count how many times you see letters next to each other.  This is called the “training text” — this is essentially how the little computer program teaches itself english.  The seriously nerdy among you will understand when I say that you estimate the probability distribution function for pairs of numbers appearing near each other, by counting occurrences in real data.  Got that?  Good, because the overwhelmingly nerdy will wonder why we stop at one letter.  Well, you don’t have to — you could, for example, look up how often the letter “H” follows the four letters “BEAC”, and repeat the process for every combination of four letters.  The overwhelming majority never appear in english (for example “FBRA”), so only a handful get any likelihood of being chosen at all.

I like to think of this as the program looking in the “neighborhood” of the current letter, to decide what to put next.  If we look backward a neighborhood of four letters, then we choose a new letter that is likely to be found next to those four (in the correct order) in real english text.  The effect is that if you make the neighborhood bigger, you constrain the fake words more and more to real english.  The letter “X” will never be randomly chosen after the four letters “BEAC”, since it’s never seen in real life.  But if the program looks back and sees the letters “STAR”, it has several choices — it could put down an “E”, a “T”, or even a space, since any of those pop up here and there in real english.

The cool thing about this is that you don’t need to create grandiose algorithms that embody the latest discoveries in linguistics — just a couple lines of computer code will look at your training text to pick off the likely letters, and another couple lines will use the list of common letters to generate new text.  All you need to do is feed it some nice, wholesome training text, and let it do the magic.  A relatively dumb little computer program does surprisingly well at imitating english — the output is complete gibberish, of course, but you see a lot of real english words and quite a lot more that are “english-ish”.

For the training text, I decided to use Mark Twain’s hilarious account of a long tour of the holy lands, “Innocents Abroad” — Mark Twain truly is the Moses of smartasses everywhere.  In this actual passage from the book, Mark and company torment a poor guide trying to show them a document written by Christopher Columbus:

“Come wis me, genteelmen! — come! I show you ze letter writing by Christopher Colombo! — write it himself! — write it wis his own hand! — come!”

He took us to the municipal palace. After much impressive fumbling of keys and opening of locks, the stained and aged document was spread before us. The guide’s eyes sparkled. He danced about us and tapped the parchment with his finger:

“What I tell you, genteelmen! Is it not so? See! handwriting Christopher Colombo! — write it himself!”

We looked indifferent — unconcerned. The doctor examined the document very deliberately, during a painful pause. — Then he said, without any show of interest:

“Ah — Ferguson — what — what did you say was the name of the party who wrote this?”

“Christopher Colombo! ze great Christopher Colombo!”

Another deliberate examination.

“Ah — did he write it himself; or — or how?”

“He write it himself! — Christopher Colombo! He’s own hand-writing, write by himself!”

Then the doctor laid the document down and said:

“Why, I have seen boys in America only fourteen years old that could write better than that.”

To think, when I was a smartass in history class, I was actually holding up a rich tradition of American humorist culture.  Or something.  We’ll let the little program chew on the entire book, so the resulting fake text should resemble (hopefully) Twain’s prose style.

Let’s see what happens for a “neighborhood” of just one letter — the program only looks back one single letter to decide what letter to put down next.  We get the following impressive speech:

ombo  theanusssid verad ta outinys, o wiline. t  wey whon.  trif  be tpoal, ang ory be
shaner t s the cof hathicrt wlo utre toullit eys f
w id, s an, rmpuin at hes  mid mbe b
anonth. at, iste
be t
ce therestil oous ad we abm t colyollare he g ativad s. ongit  visthean ghere, t ffouckntous  yo wand d
and thend tyexhe on whe g t ay ald t ttrigherthatofr mstu ipove t acimatene anspe   iroleab t ymaraicof bopondored gesquthe ha, st, theduralid a
s  my l it
flion as  ichise.  hin amous. tren ire,  owal t, tsande st
petinch ps thect h thit
in
hemberr tgo on
llat of mbe ag m ways  lersthere ta scher w the icabut do thecabure dnghe l villf id pey hed bl necoty
se goud aco   thrsud the ous. an,
athelde thenge cima tl thimayf s swed worgond ati  byatye me psere andaces ore ap
pove wit  s iengouts, fehe a, in orth, oora antreyont an bey ay athofll ply mitreld weain w
s g d sulofthen brorofowecthelor dro ott str,  oritovede fuglatou at
chasmake icowan

Amazing, isn’t it?  Completely indistinguishable from actual Mark Twain prose.  Aside from the unfortunate-sounding “ffouckntous”, that is — Twain never worked blue.  But seriously — the program clearly is doing pretty poorly at coming up with realistic-sounding speech.  Virtually all of it is complete nonsense, hardly a real word in the bunch.  To its credit, it’s getting a few things right — the “words” are pretty much the right length, so the spaces are more or less in the right place.  But in general, this won’t fool anyone who isn’t seriously drunk.

It’s starting to look a little better when we increase the “neighborhood” up to three letters in either direction:

he andsome as sun from was it is prom are up saidst the schile storican were bill the can execuries adown imply the was did them. we can her such in mothe were was, anch, so
could he can when, and see to went. and which gived the fire moves of thed seemly hador.
to seen from with way and be, afted us summita crose they squiet are gars a drivater blampeian not suprinkinnade a most
care iden
yearly news to expecut in the brainstent pure nevery  ove see tearial estlet the emplims frometir alled yet that commist,
glad officent parience towers a boy, pries it is a rought, lyings and was wall gond ave old of ment a part the pare greatrible hold by calced the
sixty of theles was fren, and bloom. the casterrants ratican
oristombs solumbold,
newspaid thesty hole  a volvere gods one offic ves you was fasciends age better with alway the fathe shoe, we had was , for feets for church betty manned the hoperfor the here numble wont its ther feet alous coup theave res

This is still not exactly English, but it’s not bad — you actually do see english words here and there — “which gived the fire moves”…  But it’s starting to sound more realistic.  When you read it aloud, it sounds roughly english-like, even though we rarely see a real word.  Imagine speaking in fake Italian, using the right accent and rhythm but without any real Italian words.  The computer program is doing the equivalent, for english.

Finally, let’s try a distance of four — this requires about as much computer time as I’m willing to stand (which means I put the laptop down for about one “Good Eats” epsiode1).

among patient do with taken callium the satisfied by shot. the project is great suite desolately may before read see a blesome
time hour ferocious pilgrimagicians color of napolished upon to eacherinthian labor, and whole lost came not knobbing
poets and with but he honored upon the do then in thy
mendom to see a time three them now. we tops of glassenger, and before water
once commit was glandsomebody spyglassembled and throat you will week. i couples, and all columns.
them was
did which it, they do it. hers so amply abbot only have who was. ever staried his this
israel wall, they drank. some sight from that a day follows of thought still battemples, but i have obsolutional, as now. one will conce weight stablatured. when stretching, st. petermination, he had out all over though, the
grotto othere and
weak of genuity of esau
for all they short, under, their
are cabin, tradiate
i did not. if he dwell feel about beforeign scene
of the we rene, and thr

Still not quite up to compare with Mark Twain, but probably good enough to write for “People”.  Most of the words are honest-to-goodness real english, and many of the nonsense words are now starting to resemble english words — check out “spyglassembled”, or “battemples”.  It looks like the program stumbled onto one word, then switched midway and finished with another.  It’s even got traces of grammar in there:  “them was did which it, they do it”.  How true.

Again, remember the program is deciding everything, even when to put in spaces, punctuation, and when to hit “return”.  And it doesn’t know anything about english grammar, or how to make a nonsense word sound plausible for english and not, say, french.  It’s just kept track of how often each letter appears in combination with others, up to 4 letters away.  I think it’s pretty remarkable that it has done so well with an algorithm you could practically get up and running on a calculator.  Now imagine what could be done when you inject a little more intelligence into it, like rules of grammar, or some concept of how phrases fit together.  Why, you could probably write a passable essay using it.  Maybe even an educational article or two…

——————————————————————————–
Footnotes:

1.   The episode about pickling your own vegetables.  I really like Alton’s shows, but when the hell am I ever going to pickle my own carrots?

© 2011 TimeBlimp Thith ith a pithy statement. Suffusion theme by Sayontan Sinha