Parsing/recognition complexity: (conditional) lower bounds for some usefully expressive formal languages

magnap · October 9, 2024, 7:21pm

Take the language \left\{a^n \mid n \in \mathbb{N}\right\}, consisting of n copies of the symbol a. This is a regular language, and regular languages can be recognized in O\left(n\right), linear time.

Take the language \left\{a^n b^n \mid n \in \mathbb{N}\right\}. This is no longer regular, but it is context-free, and context-free languages can be recognized in O\left(n^3\right), cubic time. However, this particular language can be recognized in linear time.

Take the language \left\{a^n b^n c^n \mid n \in \mathbb{N}\right\}. This is no longer context-free, but it is something called 2-MCFGwn, and those languages can be recognized in O\left(n^6\right). But even this language is still trivial to recognize in linear time!

What’s going on here? These are canonical examples of their respective classes! If they’re so easy to recognize, how hard can it be to recognize other languages in their class?

I’ll start with a little bit of background, but then quickly get into the meat of this post, which answers some questions of the shape “how hard to parse can the languages of this particular grammar formalism get?”.

The k-Clique problem

There’s a famously hard problem, the k-Clique problem (“does this graph contain a clique of size k?”), about which it is conjectured that the lowest exponent for it is \omega (or just straight up 3 if you exclude “algorithms like fast matrix multiplication”, but this is an informal notion). This is the k-Clique hypothesis, and from what I can gather, it’s more towards the \text{P} \neq \text{NP} end of the spectrum when it comes to hypotheses in theoretical computer science.

CFGs

CFGs are a very commonly used class of grammars, for which the best known parsers are in the cubic-ish range of runtimes, specifically O\left(n^3\right) (more specifically O\left(n^3 / \left(\log n\right)^2\right), but the polynomial exponent is the important thing here) without fast matrix multiplication, and O\left(n^\omega\right) with it (where \omega is the matrix multiplication exponent).

But those are upper bounds on the complexity, not lower. Are there CFGs that actually take that sort of cubic-ish time to parse? Turns out, if you believe the k-Clique hypothesis, then yes! This paper gives a CFG (of fixed size) and a way to encode a graph into a string (in time linear in string length) such that the language of the CFG contains that string if and only if the string describes a graph that contains a 3k-clique (that is, a clique of a size that’s a multiple of 3). According to the k-Clique hypothesis, this language cannot be recognized in time O\left(n^{\omega-\varepsilon}\right) for any \varepsilon > 0, or, in other words, in polynomial time with an exponent lower than \omega. If you won’t use things like fast matrix multiplication, the bound is instead 3.

So there we have it, proof that if we pick CFG as our grammar formalism of choice, we are theoretically at risk of ending up with grammars that take pretty much cubic time to parse.

Mildly context-sensitive grammars

There’s just one thing… CFGs aren’t actually expressive enough to capture the entirety of human language. There are languages with structures that require a more expressive grammar, as examined in this article (PDF here). Specifically, Swiss German, which has a construction where you end up with a sequence of NPs with dative or accusative case, followed by a sequence of verbs that require those cases, in the same order as the NPs. We can abstract this into the “copy” language, which consists of two copies of the same string. This language is not context-free.

The temptation is to model this with the class of context-sensitive grammars, but that is far too expressive and leads to huge computational complexity. So we want a weaker formalism, and that’s where mildly context-sensitive grammars come in. One of the more well-studied of these is tree-adjoining grammars (TAG), which can handle the language \left\{ a^n b^n c^n d^n \mid n \in \mathbb{N}\right\} and the copy-language of strings repeated 3 times, but not \left\{ a^n b^n c^n d^n e^n \mid n \in \mathbb{N}\right\} or the copy-language of strings repeated 4 times. Another grammar formalism that isn’t quite as powerful as general context-sensitive grammars is minimalist grammars, formalizations of the sort of syntactical theories used in the minimalist program, the ones with movement you may be familiar with. There’s also multiple context-free grammars, for which there is a good introduction here (PDF). An MCFG with rank r and dimension d is called a d-MCFG(r), and a d-MCFGwn(r) if it’s well-nested. For well-nested MCFG and d,r \geq 2, d-MCFGwn(r) is equivalent to d-MCFGwn(2), so we can forget about the rank and just write d-MCFGwn. MCFG are quite a general formalism: general MCFG are equivalent to minimalist grammars, an extension called PMCFG (parallel MCFG) are equivalent to minimalist grammars with copying, and in fact 2-MCFGwn, which is in some sense the least complex MCFG, is equivalent to TAG! A nice property of MCFG is that you can parse any d-MCFG(r) grammar in time O\left(n^{d\left(r+1\right)}\right).

So how hard can parsing 2-MCFGwn get, then? From the bound above, we can parse 2-MCFGwn (and thus TAG) in O\left(n^6\right) time, and there are in fact TAG parsers with that performance, and even ones that run in O\left(n^{2\omega}\right) time. But as before, that’s just an upper bound, is there a lower bound? Using a very similar technique to the paper about CFGs, this paper constructs a TAG grammar for a language that contains strings that encode graphs iff those graphs have 6k-cliques, and uses it to prove that if you believe the k-Clique hypothesis, TAGs (and by extension 2-MCFGwn) cannot be parsed in a time with a polynomial exponent lower than 2\omega, or, if you won’t use things like fast matrix multiplication, 6.

Thus, if you use TAGs, MCFG, or minimalist grammars, you run the theoretical risk of pretty much O\left(n^6\right) time complexity to parse your language!

Conclusion

Based on a pretty solid conjecture, the upper bounds for our parsing algorithms for context-free and some of the weakest but still context-sensitive grammars are essentially also the lower bounds, at least when it comes to the polynomial degree. Luckily, for most practical grammars the situation isn’t nearly as bad, in fact often parsing in close to linear time, but we know for sure that it is possible to trigger this higher complexity as long as you work within unrestricted versions of these formalisms, which is a bit scary.

Why mention “unrestricted”? Because, at least for CFGs, there are huge families of them that have been proven to be parseable with linear complexity. This is where we get our LR(k) and LALR parsers and so on from. So that’s an option if you want to avoid the spooky cubic monster.

What about the even more expressive formalisms? Higher dimensioned MCFGwn grammars like 3-MCFGwn, or not-necessarily-well-ordered MCFG and minimalist grammars, or even PMCFG and minimalist grammars with copying? What sort of lower bounds do we have for those? To my knowledge, that’s an open question! Exciting, but also a little scary. I wouldn’t be surprised if this k-Clique technique can be extended to push any minimalist grammar or MCFG to its limits.

EDIT: changed the wording a little to reflect that the languages are much bigger than only the proofs-of-clique, see followup

nottext · October 9, 2024, 8:03pm

Please, make a forum post for this on the Logical Languages Lodge server, if you haven’t already. I’ve shared the link to this discussion on Reddit’s subreddit for engineered languages.

magnap · October 9, 2024, 8:16pm

Made a thread in the “tech-talk” forum channel

_at · October 9, 2024, 10:13pm

I think your logic for establishing an upper bound on validation complexity for TAGs/minimalist grammars being O(n^6) is sound. I ask this follow up question: will any human-produced utterances come close to these bounds in any meaningful way?

Specifically, Swiss German, which has a construction where you end up with a sequence of NPs with dative or accusative case, followed by a sequence of verbs that require those cases, in the same order as the NPs

I would hazard a guess that in spoken/written constructions, the count of the noun elements in that sequence must be under 7 or 8 items – the amount commonly accepted as the “number of things” we can remember in short term memory.

I then note that while this is a context free language when the count is unbounded,

\left\{\alpha_1\dots\alpha_n\beta_1\dots\beta_n \mid \alpha_i,\beta_i \in \text{N}\times\text{V}, n \in \mathbb{N}\right\}

I believe that bounding the count turns this into a regular language,

\begin{align} &\left\{\alpha_1\dots\alpha_n\beta_1\dots\beta_n \mid \alpha_i,\beta_i \in \text{N}\times\text{V}, 0 < n < 8 \right\} \\ =\;&\left\{ \left(\alpha_1\alpha_2\alpha_3\alpha_4\alpha_5\alpha_6\alpha_7\beta_1\beta_2\beta_3\beta_4\beta_5\beta_6\beta_7 \vert \dots \vert \alpha_1\beta_1 \vert \emptyset \right) \mid \alpha_i,\beta_i \in \text{N}\times\text{V} \right\} \end{align}

which implies to me that any reasonably spoken utterance is going to be parseable in linear time.

I would be interested to see any pathological cases in spoken/written langauge that cannot be bounded this way. My gut says that these don’t exist & that the bounds on parsing complexity are mostly academic and/or useful for hardening existing parsers from DoS attacks that rely on specially-crafted input.

Also, the course I took on automata was only a quarter of a semester, so I’m butting up against the edges of my understanding. Lmk if I totally missed the mark here.

EDIT:

Why mention “unrestricted”? Because, at least for CFGs, there are huge families of them that have been proven to be parseable with linear complexity. This is where we get our LR(k) and LALR parsers and so on from. So that’s an option if you want to avoid the spooky cubic monster

I realize that what I have proposed above is one of these very restrictions – as far as I can tell I’ve created an LR(8) parseable language.

magnap · October 9, 2024, 11:37pm

A few small corrections:

What I’m talking about here is a lower bound of essentially O\left(n^6\right), since the upper bound is given by the best known parsing algorithm. For TAGs that is indeed O\left(n^6\right), but for minimalist grammars that can be arbitrarily more complicated (more moving parts, literally), the best upper bound I know of is O\left(2^k n^{2k+3}\right) where k is the number of “licensing features”, the ones that trigger movement (from this article).

When they’re in that sequence, I’m pretty sure you have a context-sensitive language, the \beta string would need to be reverse for it to be context-free, not that this really matters for your point.

Speaking of, corrections over, time to address your point.

(ahh yes, my good old friend 7±2)
I will note that the language you end up with is not just regular but even finite! (although of course you could apply the Kleene star and get a regular language, but the finiteness is relevant to my next point)
I will steal an argument from this thesis (PDF; “Generating copies: An investigation into structural identity in language and grammar” by Gregory Michael Kobele; section D-1.2, p. 251, PDF page 265): I cannot present you with any such case, since any spoken or written sentence would necessarily be of finite length. Thus, any discussion about parsing complexity necessarily must be about some ideal or generalized version of the syntax, and the question becomes one of what the “best” generalization is, something without an objective answer which is a matter of debate.
To try to spark that debate, I will first ask if you agree that some combination of grammar size, parsing time complexity, and parsing space complexity is a reasonable figure of merit? If so I will argue this: with O\left(n\right) “vocabulary” rules to determine which \alpha_i goes with which \beta_j, I can define a 2-MCFG with only O\left(1\right) “syntax” rules that generates your unbounded language. Similarly, for Swiss-German, if you abstract away the vocabulary and focus only on matching dative to dative and accusative to accusative, you only need O\left(1\right) rules in total in a 2-MCFGwn, specifically 5 (4 if you accept generating the empty string) to handle any number of pairs; see the MCFGs for linguists paper page 15. A regular grammar, on the other hand, would need O\left(2^n\right) rules to handle n pairs (at least I think so: for each item you have the binary choice of accusative or dative, and you can’t track how each pair matches up). That’s the grammar size, then. If you use a queue, recognition can be done in linear time for this particular 2-MCFGwn, so same as the regular grammar. Recognition space complexity is constant for the regular grammar and linear for the 2-MCFGwn.
Summing up, we have a tradeoff of grammar size O\left(1\right) and space complexity O\left(n\right) for the 2-MCFGwn versus grammar size O\left(2^n\right) and space complexity O\left(1\right) for the regular grammar. I rest my case ;)

magnap · October 9, 2024, 11:39pm

Ah shoot, I missed your edit! In that case I think I might be misunderstanding your \text{N} \times \text{V} notation? Because I think your language is finite and thus parseable in constant time.

_at · October 10, 2024, 12:15am

Addressing corrections first:

Nope, I think you interpreted everything I was saying correctly, it was me who was incorrectly analyzing the language as if there was a Kleene star tacked onto it (hence my thought that it was regular and not finite).

I minced words here, s/upper bound/lower bound/. The argument structure you’ve presented is clear:
for some d, r >= 2,
((being in d-MCFGwn(r) => being in d-MCFGwn(2)) && d-MCFGwn(2) ~= minimalist grammars && 2-MCFGwn(2) ~= TAG)
=> (TAG has max asymptotic bound of O(n^6) by parser construction && k-cliques problem implies you can actually build a TAG grammar of O(n^6))
=> 2-MCFGwn(2) has a lower asymptotic bound of O(n^6)

Now onto content:

I will first ask if you agree that some combination of grammar size, parsing time complexity, and parsing space complexity is a reasonable figure of merit?

Sure, this is a reasonable assumption. On a real corpus I’d be more concerned with amortized time/space complexity but as established by Kobele that’s a cop-out answer.

AFAICT yes this is true – and I hand waved it away just by assuming that a set of preexisting agreeing noun/verb pairs existed. This is the \text{N}\times \text{V} I previously mentioned.

This all tracks – assuming the asymptote we care about is the one correlated with increasing the length of language input unboundedly. I suppose the practical crux here is that in parsing a real text corpus, you might be interested in using a regular grammar where possible if you know the bounds on the input or specific pairs in question, because the constant time/space complexity will be far lower. In an unbounded situation (or one where the bounds are unknown (and greater than 7 ± 2 )), you have successfully convinced me of your case.

magnap · October 10, 2024, 1:01am

This part is wrong, general minimalist grammars are equivalent to general MCFG without even well-nestedness. There’s a transformation from a minimalist grammar with k licensing features to a (k-1)-MCFG (don’t recall the rank nor my source for that ) .

Very important, yes! In this case with n being the number of pairs, I think it works, because our sentence fragments end up being 2n words long.

I think once you add vocabulary (let’s say w words, either accusative or dative), the grammar size (which I would assume determines the constant factor) for the regular language explodes even more, now you have a grammar of size O\left(w^n\right), while there’s a 2-MCFG(2) (I’m pretty sure you lose well-nestedness here, unfortunately) with a constant number of “syntax” rules plus an O\left(w\right) number of “vocabulary” rules. Same time and space complexities as before.

By the way, this whole topic was brought about for me by thinking about parsing real sentences and about the existence of parsing algorithms where you only pay for the complexity you need, since I read this article which gives an algorithm for parsing PMCFG (equivalent to minimal grammars with copying a la Kobele) which, despite having polynomial time complexity with a grammar-dependent degree, empirically performs in seemingly linear time on their 4 natural language grammars. So I wondered “can we push that higher? what would it take? sure is suspicious how the grammars we actually build in practice tend to be parseable in linear time despite the upper bounds on the parsing algorithms for their classes being huge polynomials”. And as regards that latter part, I believe Earley parsing has similar behavior, where it’s cubic in the worst case, but can gracefully scale down to linear if your grammar is deterministic. It also happens to be quadratic for unambiguous grammars (not all unambiguous CFGs are deterministic), and it’s an open question (or at least was as of a decade ago) what the lower bound on parsing unambiguous CFGs is.

magnap · October 10, 2024, 1:13am

I wrote the CFG here in Nearley and tried generating some strings… turns out that sure, it only generates strings encoding a graph iff it has a 3k-clique, but it can also generate a lot of strings! Among others, ones that don’t encode a graph at all (and so far that is in fact all I’ve been able to generate). I’ll edit the post with some careful wording to reflect this.

_at · October 10, 2024, 1:13am

Ahh, I assumed there would be a strategy to allow a regular language parser to nest and not require the asymptotic complexity to be multiplied by the size of the vocab (I suppose this form of nesting turns regex into a context-sensitive parser, ala matching balanced pairs in regex). If this strategy does not exist, then yeah even in the bounded case with constant size input it’s unlikely anything but the most trivial languages would be parsed more efficiently with a regular parser.

Sidetrack: In code, I’ve only ever written recursive-descent parsers, LR(n) parsers, PEG parsers, and whatever parser combinators are. I desperately need to put Earley parsing at the top of my list as the next thing to learn… between this and the quality of the error messages Earley parsing can return I believe there’s a lot I’m missing out on.

magnap · October 10, 2024, 1:37am

Pretty much this, yeah, except even those extensions still only get you CFGs, I think.

Do I ever have some book chapters for you! “Practical General Top-down Parsers”, by Anastasia Izmaylova and Ali Afroozeh, chapters 1, 2, and 7. Describes a way to write recursive-descent-flavored parsers and even parser combinators that have that sweet sweet Earley-like cubic worst-case but you pay for what you use performance, while, in case of ambiguity, generating a representation from which the entire parse forest can be efficiently extracted. I haven’t written my own CFG parser (yet?) but if I were to, this is the approach I would take.