I was just wondering, is it fairly trivial to generate gramatically correct sentences using a computer program(but not neccesarily physically correct ie. the mat sat on the cat would be gramatically correct) and more importantly, is it possible to generate gramatically INCORRECT sentences consistantly?
What I was thinking would to have several databases with nouns, verbs, conjunctions, adjectives, adverbs etc. (just exactly what would I need) and then have another database with details about how to put these word together (ie, adjective, noun verb etc…).
That seems the easy part, now is ther a way for a computer program to generate consistently INCORRECT sentences. Maybe not 100% of the time but at least 90% of the time, keeping in mind that some nouns are also verbs and stuff like that.
Also, could somebody point me to somewhere where I can get some info on generating basic sentences (ie what goes where and what everything is called) and, if you can, a list of common english words (about 100 - 1000), preferably with noun/verb etc attached to them.
grammatical incorrect. Easy, try verb verb verb noun noun noun adjective adjective adjective. works 100% of the time
As to getting 100% grammatically correct, difficult except in trivial cases. Most computer written articles can be told apart from a human author. One problem as you mention is words that can have many meanings such as rose. A computor could be simply programmed to generate simple sentences such as “the cat sat on the hat”. But it has great difficulty “understanding” grammar, to the point it could write interesting new articles.
Okay, What the real purpose of this is, the computer generates a sentence and then asks the “person” (who may be either a computer or a human) whether this sentance is grammatically correct. The whole point being to differentiate the computer from the human. Thus, I need some way of generating incorrect sentences as well as correct ones.
I also need a fairly plausible method of generating incorrect sentences which would not be cracked easily (ie. two vowels in a row would be a dead giveaway).
I don’t need a program that can “write” english, just something recognisable as a sentence.
BTW: Random fact: Buffalo buffalo, Buffalo buffalo buffalo, buffalo Buffalo buffalo is a gramatically correct sentence
The point you are illustrating above is the difference between syntax and semantics. “The mat sat on the cat” is syntactically correct but semantically incorrect.
Computers tend to be better at syntax than semantics, simply because syntax rules are fewer and tend to be more general - apply to groups of words - whereas semantics pretty much need to be defined for each word, and sometimes for combinations of words too.
For example, going back to the sentence above, you can’t make a semantic rule that (verb) “sat” cannot apply to (subject noun) “mat” because “the mat sat on the floor” is semantically correct. Nor can you say that (verb) “sat” cannot apply to (object noun) “cat” because “the flea sat on the cat” is also semantically correct.
Try looking up “bullshit generator” on Google (no, I’m not being funny!) for some amusing examples of how easy it is to generate syntactically correct phrases.
You also might want to do a search on context-free grammars, which is probably what you’ll have to resort to in order to make the program at all easy to write. English isn’t really a context-free language, but you should be able to capture a large subset of it through a CFG.
From what you are saying (looks like an educational program) the best way is for you to sit down, write 10,000 correct and incorrect sentences into a simple program and do a random generation. Believe me, it will end up being 100 times easier than what you are proposing
It is relatively easy to automatically generate syntactically correct sentences automatically. Generating all of them is a different matter (almost no grammar I have seen will generate, “The more, the merrier.” since it has neither a noun nor a verb). It is even relatively easy to generate sentences with embedded clauses and the like. This is usually done using rewrite rules of the following sort:
S --> NP VP
NP --> Det N
Det --> the|a(n)|my|your|his|her|its|our|their
N --> hat|mat|John|Mary|…
VP --> Vi|Vt NP|Vi PP|Vt NP PP|Vg NP NP|Vg NP to NP|…
S --> S Conj S
Conj --> and|or|but
and so on. Most of these are obvious, but I should say that Vg is the class of “give” type verbs. I have seen one classification of verbs into 52 separate categories depending on their argument structures. (E.g. “choose” takes two objects, but is not in the Vg class since you cannot say, “choose x y”, but need a word like “as” or a phrase like “to be”.)
A much harder question is whether natural languages are recursive, meaning there is an algorithm that is guaranteed to tell you in a finite number of steps whether or not a given string of words in grammatical. For example, I admit to being buffaloed by the sentence in one of the earlier posts. The best evidence is that language is probably recursive, but we do not implement interally such a decision procedure. My best guess is that when faced with an unfamiliar sentence we see if we can generate it according to our rules. If we can we consider it grammatical, otherwise not. For example, most people will claim that the perfectly grammatical sentence, “The horse raced past the barn fell.” (no commas) is not grammatical or else try to assign it some obscure meaning.
Others have pointed out the difference between syntactic and semantic correctness. If all you want is fairly reliable syntactic correctness, it’s not that difficult. It really depends on how accurate you want it to be.
One quick-and-dirty method is to use an n-gram model. This assumes that the likelihood of a word occuring is based on the preceding n-1 words. For example, a bigram model assumes that the probability of a word depends just on the previous word, while a trigram model assumes it depends on the previous two words. Obviously this is a simplification of how language actually works.
However, it can be useful (and fun!) to try this: Using a large body of text from an author, generate an n-gram model. (Pick whatever n suits you. 3 or 4 usually works well.) This is done by simply parsing through the text and counting each time an n-gram of the chosen size occurs. In the end, you’ll have a set of numbers assigned to each set of words. From these you can create probabilities. Using those probabilities, you can randomly generate sentences that sometimes look remarkably similar in style to those of the author you created the model on.
The benefit for your purposes would be that by choosing low probability words instead of high probability ones, you could fairly reliably produce grammatically incorrect sentences.
Another way to generate sentences is using an artificial grammar like Hari Seldon showed. It won’t be absolutely perfect, but it might suit your needs well. You’d a seperate way of generating non-grammatical sentences though.
(BTW Seldon, what are “Vi”, “Vt”, etc.? I know NP, VP, S, N, Det, and PP. Verbs of some type, of course, but my mind isn’t working today. )
Context free grammer is definitely the way to go. My favorite CFG-based computerized text generator is the complaint generator. IMO, it writes better complaints than roughly 2/3rds of the letters featured in zdnet’s “talkback” sections.
Mr. Pakin was even nice enough to send me a (presumably stock) letter explaining how it worked when I sent him my glowing praise.
Did you know that old hilbillys used to call the grain silo a “barn fell”
On a more serious note:
If you want to make this program yourself and you have no provious programming experience you should check out the language LOGO. It is such an easy language that I learned it in kindergarten, and yet advanced enough to do what you want. There are many versions of free logo available, including MSWlogo, BerkleyLogo, and Elicia (which I don’t fully understand)
All the versions of LOGO differ in major ways, and you should check them all out before choosing one. I currently use a ten-year old DOS version of the LCSI program LOGOWRITER, which even though it has no multimedia support, limited program size (due to the 640K DOS limit) and no mouse I have been able to do quite a lot with it.
A program to generate a grammatically correct sentence (written for LOGOWRITER 2.01) might look like the following:
to generate
name [train hit grab eat copy destroy find confirm] "verbs
name [bat tree food mouse house fork fell ] "nouns
name [I You He She It They] "pronouns
name pick :verbs "MyVerb
name pick :nouns "MyNoun
name pick :pronouns "MyPronoun
print (se :MyPronoun "will :MyVerb "the :MyNoun ". )
end
to pick :list
output item (random (count :list)) + 1 :list
end
This ends up with a lot of sentences that sound like this:
She will eat the mouse.
I will hit the bat.
They will eat the food.
She will copy the fell.
It works, and it only took me about five minutes to make (more to think of interesting nouns)
The only reason, “The more, the merrier.” is called a sentence is that people consider it one. It is the linguist’s job to explain why. (Just for the record, I am not a linguist, at least not professionally.) Yes, Vi and Vt are verb, intransitive and verb, transitive. Finally, compare the perfectly obvious sentence, “The horse ridden past the barn fell.” The difference is that “ridden” cannot be (mis)interpreted as a past tense, but only as a participle. And you can’t put in commas, since that changes the meaning from a restrictive to descriptive adjective.
I really didn’t think it would be all that complicated, what I envisioned would be a data base of nouns/verbs etc and then just anoth database of ways to arrange these words together to form a sentence. I don’t want to include tenses, ambiguity or any “complex” stuff into the sentence, just VERY simple, easy to recognise as being correct sentences.
ie. just say my DB is
Det a|the
N cat|mat
V sat|stood
something on|beside|inside
then I could say that a possible sentence structure would be [Det N V something Det N]
With this, I could generate
the cat sat on the mat,
the mat sat on the cat,
a mat sat on the cat,
a cat stood inside the mat,
the mat stood beside the cat etc.
Not all of these are semantically possible yet they are all easily recognised as sytactically possible.
Now, would it be sufficient for generating incorrect sentences to just get a random arrangement?
ie. using a random num gen, I got [something V something Det V N]
which would be something like
on sat beside the sat mat.
What I am planning now is to give the “person” 5 sentences and ask them to choose the most correct one so it doesnt matter if something is “strictly correct” (ie. the horse barn example) as long as there is a better choice.
Well, I made a quick and dirty prototype using less that 9 different words for each category and 7 different “rules” (and, using and, another 49 different rules) and here is what I get:
Right:
"Jack heard Carol using his dog "
"Richard cooked a mat "
"her flower ate with their dog "
Wrong:
"smooth his to the left of black ate fat quickly and "
"ate black her ate ran heard stupidly "
"Mary kicked crankily except black "
Conjugated:
"a door ate on top of a flower except Alice cooked Mary with her dog "
"a door ate on top of a flower except their cat stood beside her dog "
"a door ate on top of a flower except their quick hat ran behind their corn "
"a door ate on top of a flower except their quick hat ran behind their black flo
wer "
"a big mat slept with his fat cup or a door ate on top of a flower "
"Mary kicked her dog except Jack looked at their cat "
"Mary kicked Steve but a white book rolled on a door "
I think this is going pretty good so far, any thoughts on improvments?
This reminds me of the children’s game of “Consequences” where a paper is filled out by different people, one line each and it is folded in such way no one sees what the others wrote.
Mr…
met Miss…
at…
and she said to him…
the consequence was…
and the world said…
which, would yield things like: Mr Bush met miss Meir at the lavatory and she said to him “Don’t do that to me”. The consequence was a nasty smell and the world said “Serves him right”.
I was easily amused as a ten year old… I am almost as easily amused today _