The Recursion Problem in Relational Algebra
Relational algebra is the theoretical backbone of every query you will ever write against a relational database. Selection, projection, join, union — five or six operations, and you can express almost any query imaginable. Almost.
There is one class of problem it cannot touch, and understanding why is one of the more instructive moments in database theory.
What Relational Algebra Is Good At
A relational algebra expression takes one or more relations (tables) as input and produces a new relation as output. The expression is finite: a fixed number of operations, applied once, returning a result. This is a feature, not a bug — it makes queries predictable, optimizable, and mathematically well-behaved.
Consider a simple employee table with columns id, name, and manager_id. Finding all employees who report to a specific manager is straightforward:
Done. One selection operation, one pass over the data.
Where It Breaks Down
Now consider a different question: Find all employees who report to manager 42, directly or indirectly — the entire subtree of the org chart.
This is a transitive closure problem. To answer it, you need to follow manager_id links until they run out. But relational algebra has no concept of “repeat this until nothing new is found.” Every expression has a fixed depth, determined at query-write time, not at query-run time.
You could manually unroll it:
\[\begin{aligned} R_1 &= \sigma_{\text{manager\_id} = 42}(\text{Employee}) \\ R_2 &= R_1 \bowtie_{\text{id} = \text{manager\_id}} \text{Employee} \\ R_3 &= R_2 \bowtie_{\text{id} = \text{manager\_id}} \text{Employee} \\ &\quad\dots \end{aligned}\]But how many levels deep is the hierarchy? You don’t know. And if you did, the query changes every time the data changes. This is not a query — it’s a program.
This is not a gap that got overlooked. Relational algebra was designed around closed-form expressions precisely because that constraint is what makes them analyzable and optimizable. Recursion requires a fixed point — “keep going until the result stops changing” — and that is fundamentally outside the algebra’s scope.
The Standard Answer: A Necessary Compromise
SQL introduced WITH RECURSIVE (common table expressions, CTE) specifically to fill this gap. It works. For org charts, bill-of-materials trees, network routing tables, and any other recursive structure, it is the right tool.
But it is worth being honest about what it is: an extension beyond relational algebra, grafted onto SQL because real data is sometimes recursive and databases need to handle it. It is not elegant theory — it is pragmatic engineering. Use it when you need it, but recognize that you have left the clean mathematical foundation behind the moment you do.
Other approaches handle this more naturally. Datalog, for instance, treats recursion as a first-class concept and sits closer to the theoretical ideal. But Datalog is not what runs in production.
Why This Matters for You
When you model a domain, the choice of data structure has consequences. A flat relation is fast, clean, and fully expressible in relational algebra. A recursive structure — a tree, a graph — requires either a recursive SQL extension, a denormalized adjacency list, or a different database paradigm entirely (graph databases like Neo4j exist for exactly this reason).
Knowing this before you design a schema means you can ask the right question early: Is this data inherently recursive? If yes, plan accordingly. If no, keep it relational and keep your queries clean.
The recursion problem is not a flaw in relational algebra. It is a precise statement of what the model is and what it is not. Understanding the boundary is half the job.