Michael A. Covington      Michael A. Covington, Ph.D.
Books by Michael Covington
  
Previous months
About this notebook
Search site or Web
Ichthys

Daily Notebook

Links to selected items on this page:
Futrell and Mahowald, linguistics and LLMs

This web site is protected by copyright law. Reusing pictures or text requires permission from the author.
For more topics, scroll down, press Ctrl-F to search the page, or check previous months.
For the latest edition of this page at any time, create a link to "www.covingtoninnovations.com/michael/blog"
I am a human author. Nothing on this web site is AI-generated unless specifically marked as such.

2026
April
4

Do LLMs show that linguistics is bunk?
A seminal paper by Futrell and Mahowald

This will be one of the longest and most academic Daily Notebook entries I've ever written — hang on to your hats. Two researchers have synthesized the work of many others to answer a question with implications about the significance of everything I studied in graduate school and for many years afterward. And the result is good news.

The paper I'm talking about is Futrell and Mahowald, How Linguistics Learned to Stop Worrying and Love the Language Models. I've linked to an arXiv preprint, but it's soon going to appear in a major journal. Here I want to summarize it so that non-linguists will understand at least part of it, and linguists who have been away from this subfield will understand more.

The question is whether the branch of theoretical linguistics founded by Noam Chomsky, studying sentence structure, is rendered obsolete or refuted by large language models (LLMs, chatbots). After all, we are getting high-quality natural language processing without having to feed the computers any kind of analysis, Chomskyan or not — we only give the computers large (huge) (titanically huge) samples of the language.

First, some cautions:

  • Chomsky's generative grammar and transformations have nothing to do with today's generative AI and transformers — same words, entirely different meanings.
  • None of this has anything to do with Chomsky's radical politics.
  • For brevity, I'm not giving hyperlinks to linguistic literature that I'm referring to. Everything I mention should be easy to find with a search engine. Most of it would require a library's access to academic journals.

Now then. Let's to back to 1955 or so. Scientific linguistics at the time had a good grip on phonetics, phonology, and word formation, but sentence structure seemed elusive; most descriptions of languages said relatively little about it. You have a fixed vocabulary of word and a fixed set of word forms, but a seemingly unlimited supply of sentences; certainly, you normally produce and understand sentences you've never heard before. How to proceed?

The one thing that was clear about sentence structure is that it is tree-like, like this:

Picture

Whenever elements of a sentence are moved around, substituted, or questioned, they follow the grouping shown by the tree. For example, based on this sentence, you can easily construct questions whose answer is "the cat" or "my favorite pet" but not "is my" or "cat is." (See Wells, "Immediate Constituents," 1947, and Chapter 4 of my own NLP textbook.)

Enter Noam Chomsky, who proposed a way to model this precisely. The sentence structure of a language is described by a set of phrase-structure rules that say how elements go together, such as:

S → NP VP
VP → Copula NP

N → cat

A sentence is grammatical ("is generated by the rules") if everything in the tree is permitted by a phrase-structure rule.

That's known as a context-free phrase-structure grammar and is mathematically well understood. But Chomsky noticed something else. That type of grammar is almost but not quite adequate for human language. There are things it doesn't handle, such as formation of English questions. A question like What did you see him climbing on yesterday? has what at the beginning and a missing noun phrase somewhere else in the sentence (in this case after on). Phrase-structure rules can't express how that works.

Chomsky's answer was to add a second type of grammar rules, transformations, that allow rearranging the output of the phrase-structure rules in specific ways. The result was called transformational-generative grammar.

And why was this interesting? Because if we can figure out what the grammar of human languages is made of, we'll know what the human brain processes when we speak and understand speech. We'll also know what the human brain learns when children learn to talk. It is clear that the human brain has some built-in language capabilities; this is a way to discover something about how they work.

One cautionary note. Many linguists at the time (1950s to 1980s) mistakenly thought the phrase-structure rules and transformations were operations performed by the speaker's and hearer's mind. That need not be the case. They are just descriptions of the what is computed, so to speak, not how to compute it. Confusion about this point was widespread.

The immediate impact of transformational-generative grammar ("TG" to its friends) was that we discovered a lot more about the grammar of English and other languages. Trying to write precise phrase-structure rules and transformations, we described languages explicitly in ways that had never been done before. I think the crown jewel of this was Jackendoff's X-bar theory. Paul Postal wrote a whole book about part of the way English forms subordinate clauses (On Raising). Most of what we all know about our native language was not in the grammar books!

That's where I came in (mid-1970s). Some linguists, including me, continued to be interested mainly in making grammar explicit (notably, GPSG, and Jackendoff's later "Simpler Syntax," both of which I followed closely). But by the early 80s, Chomsky and his close associates were moving toward something else: principles and parameters (P&P). The idea was that the phrase structure rules and transformations don't need to be written out because they follow from more basic principles, together with language-specific parameters that tell you, for instance, the word order of one language versus another. The rules were simplified, in a system Chomsky called minimalism.

P&P provided good tools for describing language typology and historical changes (see Ringe and Eska's book on historical linguistics). I, however, stuck with the more descriptive line of investigation and focused on computer algorithms for parsing (finding the tree structure of sentences). It was widely understood in the 1990s that parsing was the key to computer language understanding, and that the other key problem was relating tree structure to meaning (the problem that most interested me).

Now here we are — and LLMs process language extremely well without ever having been given a grammar to work from. Does that mean TG and all its relatives were bunk?

Crucially, an LLM, being trained, learns the usage of each word separately from all the others. It need not have a grammar at all. Superficially, it looks like a huge dictionary, with information only about individual words, although we can't quite see what generalizations might be hidden in its neural network layers. You'd think there'd be no grammar at all — that learning a language has been shown to be nothing but learning individual words and how they are used. If so, we theoretical linguists are out of business.

Well... It is just now becoming possible to probe what goes on inside an LLM, and Futrell and Mahowald have pulled together what is currently known. Here are their key conclusions:

  • Grammatical structure (trees, etc.) is real, and LLMs learn it.

    We have not spent our careers studying a mirage.
  • Unlike Chomskyan rules, LLMs provide for scattered exceptions and gradients of preference.

    In any Chomsky-inspired system, all the rules are crisp and precise, and exceptions are costly and should be rare. But the one thing we all know about grammar is that everything is full of scattered exceptions. Because an LLM learns the grammar of each word along with the word itself, it has no trouble learning exceptions if it has the data from which to learn them. (See M. Gross, "On the failure of generative grammar," 1979.)

    Similarly, a neural network can easily encode preferences and tendencies, rather than just allowing or disallowing structures. Linguists have long recognized things like Keenan and Comrie's noun phrase accessibility gradient but have not been able to express them with grammar rules. Then there are the much bigger and more familiar concepts of marginal or partial grammaticality, variation of grammar with style, etc.
  • Since LLMs and human brains learn languages in quite different ways (the LLM requiring vastly more data than the brain), grammatical structure is not a direct product of the learning mechanism. It is a property of language itself.

    That means psycholinguistics is alive and well if we want to know what goes on in our brains. Interestingly, psycholinguistic experimental methods are also useful for probing how an LLM works! And we don't have LLMs in our heads.
  • On the technical side, the way LLMs get these generalizations is apparently "double descent," a newly discovered characteristic of large machine learning models.

    Traditionally, a model has to be smaller than the training data, or it won't generalize; if it can memorize the training data, it will do so, become "overfitted," and be useless on data that was not in the training set.

    It has been found, however, that if models are much larger, and are trained well beyond the point at which overfitting should be complete, they start to generalize again. After all, they have to do something with their remaining capacity, and generalization is what happens. This may say something profound about our place in nature — a more powerful brain can do what a less powerful one does, and do it in a different way — but it also explains why LLMs learn generalizations that are not immediately needed for their training data.

So there we stand. The story isn't over (fortunately) but I am glad to begin to see how a lot of things are coming out!

2026
April
2

A day in the big city

I learned about the Optimized AI Conference just a couple of days before it took place, but I managed to take in the last two hours of it on March 31. That involved driving to Marietta (north of Atlanta), where Melody lived and worked for a couple of years just before we got married, so I was revisiting old places that had undergone tremendous urban growth.

The conference was a success. I was delighted to see quite a few people I knew from LinkedIn, and one former FormFree colleague who was a keynote speaker and is now very well known for her SQL and data science courses and videos.

From LinkedIn:
Picture

That was in the Cumberland Galleria. Next I went for a walk in Cumberland Mall, where Melody and I used to walk around after going out to dinner, and was pleased to find it thriving. (As you know, the death of malls has bugged me; they were a feature of 1980s living that I very much enjoyed.) This one is doing plenty of business, although, like all malls, it no longer sells much but clothing; there is no bookstore, record store, or Radio Shack.

Picture

It has changed a little; Dick's Sporting Goods occupies what used to be Neiman-Marcus, where we used to look at luxury stereos and the like that we never expected to be able to afford. I'm glad to see the mall prospering, and I think the big problem in the 1980s is that about three times as many malls were built as the economy could support.

Dinner in the Food Court, then a brief visit to Micro Center (formerly MEI Micro Center) in Marietta, which is the oldest Micro Center computer store presently operating, though the chain started earlier, in Ohio. This store dates from 1988 and we visited it occasionally when it was new, although we normally go to the Duluth store now. I am glad to see them catering for hobby electronics; with the demise of both Radio Shack and the local repair-parts jobbers that we used to rely on, it has become very hard to get even the most basic components, supplies, and tools locally. For me, "locally" now means 50 miles away, but the point is, by having these things in a local store, they keep people aware of what they can do.



Off to the moon!

Last night I watched the launch of a crewed spacecraft to the moon, for the first time since 1972. I saw the launches of Apollo 16 and 17 in person, from a swamp near Cape Canaveral, and had watched many previous launched on TV. This will be an orbit of the moon, like Apollo 8, not a landing there.

This time I was using TV, but not broadcast TV. I connected to NASA's web page and used ChromeCast to send my computer's video to our seldom-used TV set.

This picture is from a NASA press release:
Picture

While sending human crews is not the most efficient way to explore space, I think we need to preserve and update the technology we already have, rather than let the knowledge of Apollo be lost to posterity.

God speed, Artemis II.



From twisted pair to fiber

Seven and a half years ago I chronicled the end of POTS (Plain Old Telephone Service) to our house and its replacement by an Internet cable and VOIP. Soon the coaxial cable will be replaced by fiber optics, and AT&T will again be our carrier. And we will have no cable TV service at all.

<< PREVIOUS MONTH


If what you are looking for is not here, please look at index of all months.