The Spark
I was scrolling Twitter late one night (as one does), when XKCD #2347 floated past my timeline again. You know the one; modern digital infrastructure as a precarious tower of blocks, all somehow balanced on one random project maintained by some dude in Nebraska.
I'd seen it dozens of times, but for whatever reason, I looked at it differently this time. It's not just a joke about fragile infrastructure, it's actually a better way to visualize dependencies!
We spend so much time in software building abstractions like UML, entity-relationship models, or architecture diagrams. They're technically correct, but they don't feel like anything. The world needs more visualizations that tap into intuition rather than requiring you to decode abstract conventions.
If we lean into that intuition, the tower metaphor becomes literal: Why don't we draw actual physical structures where things rest on what they depend on? Your application is a block sitting on library blocks, which sit on framework blocks below.
So I thought, what if I could actually build this? Take a real dependency graph and render it as a stacked tower? How hard could it be?
Famous last words.
The rest of this post is the story of what happened when I tried ...
Welcome to the Side Quest
I thought a few lines of Python would get things moving. Instead, they revealed that I had just enrolled myself in a research project.
Software dependencies form a graph (a directed acyclic graph to be precise), and most graph visualization tools draw node-link diagrams where boxes are connected by arrows. The arrows explicitly show every relationship, and you just arrange things to keep them readable. Some use layered graph drawings with nodes organized into rows.
In node-link layouts (Fig. 1a), the left-to-right order within a row is mostly about aesthetics. Sure, you try to minimize edge crossings (where two arrows pass over each other) because they make diagrams cluttered. But crossings are tolerable. In a stacked tower (Fig. 1b), edge crossings can make a proper structure impossible. Blocks can only support each other if they are in contact.
Fig. 1 (a): Traditional node-link diagram with explicit arrows
Fig. 1 (b): Tower layout where contact implies dependency
So we can't just slap blocks down in any order. We have to find the right order. And since every row can be ordered independently, this is fundamentally an ordering problem. For each horizontal layer, we need to pick the left-to-right sequence that avoids crossings entirely.
Easy, right?
The Combinatorial Explosion
Except, how many orderings are we talking about?
Let's say you have 3 blocks in a row. There are 6 possible orderings (3! = 6). That's fine. But what about 5 blocks? That's 120 orderings. 10 blocks? 3,628,800 orderings. 20 blocks? 2.4 × 1018 orderings.
This is when I learned (or re-learned, if I'm being honest) that the problem I was casually trying to solve is NP-hard. In plain English: there is no known shortcut algorithm. The only way to guarantee the optimal solution is to try a combinatorially explosive number of possibilities.
Fuck.
Finding a Battle Plan
Staring at an NP-hard problem, the obvious move is to nope out. But NP-hard isn't hopeless; it just means there is no known shortcut for every case. In practice, I could use approximation algorithms (settle for 95% optimal in 5% of the time), exploit domain-specific structure to shrink the search space, or combine exact methods for small subproblems with heuristics for the rest. The art is in figuring out what structure your specific problem has and then ruthlessly exploiting it.
So I thought: I'm surely not the first person to arrange nodes in layers and worry about their arrangement. Remember those layered graph drawings from earlier, the ones with arrows between rows? People must have been working on this for some time. Maybe they've found some domain-specific tricks I can borrow.
The Sugiyama Framework
Turns out, they have. The most influential approach is the Sugiyama Framework, published in 1981. It's a multi-step pipeline for rendering node-link diagrams of graphs that goes something like:
- Layer Assignment: Divide nodes into horizontal rows (layers).
- Crossing Minimization: Find the optimal left-to-right ordering within each layer.
- Coordinate Assignment: Assign precise x-y coordinates to nodes.
- Edge Drawing: Render the connecting lines.
For our tower rendering, the layering is natural: the origin of the tower is the root node (no upstream dependencies). Layer 1 is things that the root node depends on. Layer 2 is things that Layer 1 depends on, and so forth. The hard part is the crossing minimization (step 2).
But here's the catch: Sugiyama's crossing minimization aims to minimize edge crossings, not eliminate them entirely. For towers, that's not good enough. I needed zero crossings, not just fewer crossings.
So Sugiyama gave me a starting point, but it didn't immediately solve my problem. I realized I first needed to understand more about graph topology: which types of graphs are actually stackable (admit a crossing-free ordering), and which aren't.
I'm not a mathematician, and I'm sure there are elegant, theorem-shaped proofs out there that pin this down far better than I can.
Reducing the Problem
What I could do, though, was look for the practical knobs to turn. Once it became clear that stackability isn’t guaranteed, the next step was figuring out how to tame the unruly cases. If we can identify and fix these issues upfront, we might transform an intractable mess into something more manageable. I identified three key properties that I needed to address:
- The Transitivity Trap: If
Adepends onBandBdepends onC, adding a directA → Cedge creates a redundant constraint. In stacking terms, it demands thatCsit both two floors up and directly onA, which makes the entire tower geometry impossible to satisfy. - The Long-Edge Problem: When an edge jumps multiple layers at once, it behaves like a diagonal support beam slicing through the structure. These long spans complicate ordering and tend to force crossings, because the edge interacts with every layer it skips.
- The Tangle Motif: Certain dependency patterns (small "knots" in the graph) guarantee crossings no matter how cleverly you arrange the layers. These non-planar motifs inflate the search space because the layout algorithm has to navigate unavoidable overlaps.
It became obvious that I couldn’t stack anything until I dealt with these structural quirks. I needed a way to normalize the DAG first — a small reduction pipeline to tame and untangle whatever shape it arrived in.
Step 1: Transitive Cleanup
I started by removing redundant edges. If you've got App → Auth → Core, you don't also need App → Core (Fig. 2a). It's technically correct but for our purposes, it makes stacking impossible. This is actually called transitive reduction in graph theory, and there are some efficient algorithms for it. You strip out any edge you can reach by following other edges. The result is a cleaner chain (Fig. 2b) that stacks naturally (Fig. 2c).
Fig. 2 (a): The transitive trap
Fig. 2 (b): After transitive reduction
Fig. 2 (c): Stacked tower view
The graph still means the same thing. App ultimately depends on Core via Auth, I'll take it!
Step 2: Edge Shortening
Next, I worried about edges that span multiple layers (say, API in row 0 depends on Cache in row 2, skipping Auth's level, as shown in Fig. 3a). We need to break it into single-layer hops. Why? Because in a tower, all leaf nodes (blocks at the same layer) must rest on the same level floor and support has to be continuous. There was no magic in the fix; we simply insert dummy nodes to bridge the gap (Fig. 3b):
Fig. 3 (a): Long-spanning edge
Fig. 3 (b): Bridged with dummies
Now API→Cache_1 → Cache forms a chain of adjacent edges. During ordering, we treat these dummy nodes as a single unit and keep them vertically aligned so the support line stays straight. This turns long edges into manageable, stackable segments while retaining the row-by-row character of the ordering problem. Once ordering is complete, we can visually merge the dummy blocks back into a single tall block that represents the continuous support (Fig. 4a-b).
Fig. 4 (a): Intermediate tower view
Fig. 4 (b): Final merged view
Step 3: Planarity Repair
Finally, I needed to eliminate inevitable tangles, so that a proper stacking is guaranteed at least in theory. Here's a nasty example case with multiple parents sharing multiple children. Say Auth connects to Logging and Metrics, while API also connects to Logging and Metrics. In graph theory, that's a complete bipartite subgraph, and it guarantees crossings no matter how you order the children.
Fig. 5: Tangled edges
Instead of trying to beat the impossible, I decided to change the structure of the tower. If two parents share the exact same children, I insert a thin "separator" that sits between them. The parents rest on the beam, and the beam fans out to the children. Visually, this was an appealing compromise to me. It still conveys the core idea, but turns an impossible tangle into a tidy two-layer pattern. Now Auth and API rest on the separator, and the separator distributes support to Logging and Metrics.
Fig. 6 (a): Separator node insertion
Fig. 6 (b): Tower view
These three steps leave the DAG in a very different state. The redundancy is gone, the edges all move in single-floor increments, and the worst structural knots have been smoothed into simpler shapes. In this lean, untangled form, the graph is finally ready for the real challenge: finding an ordering that stacks.
The Evolution of an Algorithm
I'd love to tell you I sketched out a brilliant algorithm on a napkin and implemented it flawlessly.
That would be a lie.
What actually happened was a long stumble through increasingly desperate strategies, each one teaching me something new about why the previous one failed. Some were hilariously naive. Some were clever but slow. Some actually worked pretty well.
Attempt #1: Brute Force
My first instinct was to architect a galaxy-brain solution no one asked for. But if LeetCode grinding taught me anything, it's this: always start with the naive solution and confidently say "we can optimize this later", lmao. Even if it's hilariously slow, building it forces you to understand the problem. You see the edges (literally, in this case). You notice patterns. Sometimes you even stumble into insights that lead to the actual solution.
So, I could literally just try everything. The algorithm is obviously straightforward:
- For each row, generate all
n!possible permutations of its nodes. - Take the Cartesian product across all rows to create every possible complete layout.
- For each candidate layout, build the full tower and count edge crossings.
- Track the best-so-far (minimum crossings) as you go.
- Return the winner.
This actually worked well for small graphs (also trivially parallelizable), and it taught me something important: how do I count edge crossings without melting my CPU?
Naively, you'd compare every pair of edges between two rows and check if they cross, which is O(E²). I found that there's a much faster trick for layered graphs! Between any two adjacent rows we can:
- Fix the order of the lower row and give each node a position:
0, 1, 2, ... - Walk across the upper row from left to right, and for each parent, record the positions of its children in the lower row. This gives you a sequence of integers like
[2, 5, 1, 3, ...] - Every crossing between edges becomes an inversion in that sequence (a larger number appearing before a smaller one)
And counting inversions is a classic problem with many known solutions! I implemented a Fenwick tree (a binary indexed tree) to do it in O(E log V): as we stream through the sequence, we ask "how many edges have I seen that should have come after this one?" and update the tree structure. That's how we can easily score thousands of candidate layouts as we iterate.
Sticking to my LeetCode instincts and starting with the naive solution turned out to be useful. It forced me to tease out a crucial sub-problem I needed to solve.
But the factorial complexity killed me instantly. A graph with a single 10-node row? That's 10! = 3,628,800 permutations. Not great, but still manageable. Two rows of 10 nodes? That's 3.6 million × 3.6 million = 13 trillion layouts. Even at a million layouts per second, that's 150 days of compute time.
I didn’t have 150 days. I had a weekend and a dangerously optimistic attitude.
Attempt #2: The PQ-Tree
Staring at those factorial numbers, I suddenly realized I was an idiot for obvious reasons: most permutations are garbage.
Think about what's happening. Remember those dummy-node chains from normalization? When we split a long edge like API → Cache into API → Cache_1 → Cache, we did it so the support would be a clean vertical column. For that to work, Cache_1 and Cache have to sit together, they're really just one long beam chopped into pieces.
But that same logic applies to ordinary parent–child relationships too. Suppose API has two children Cache and Auth that both live in the same row. Those two children form a little sibling pair. If brute force scatters them with Cache on the far left, a random node in the middle, and Auth on the far right, you get a layout that makes no structural sense.
Brute force doesn't care. It happily generates these cursed permutations where structurally-linked nodes drift apart. So here's the insight: what if we only generate the valid orderings? Skip the garbage entirely.
After some thinking and a bit of research, I realized this can be formalized using what's called the Consecutive-Ones Property (C1P). If a set of nodes "belong together" (like a chain), they must appear as a contiguous block in the row ordering. Formally, for each constraint set S, there must be some column ordering where all elements of S appear as a single contiguous block. That's the C1P we need to maintain as we iterate. But how do we generate only the C1P-valid orderings, without wasting time on the invalid ones?
Enter the PQ-Tree.
A PQ-tree is a data structure that compactly represents a family of valid permutations. Instead of storing millions of orderings explicitly, it stores the constraints that define them. It has two types of internal nodes:
- P-nodes (Permutable): Children can be arranged in any order.
- Q-nodes (seQuence): Children have a fixed order (but can be reversed as a unit).
Say you have three independent libraries: Logging, Auth, Caching. With no constraints, they're all children of a P-node, representing 6 possibilities (3! orderings).
Now add a constraint: a node on the next layer depends on both Auth and Caching. This means they must be adjacent. The PQ-tree algorithm reduces the tree by applying this constraint. It might create a Q-node linking [Auth, Caching] (fixed order, or reversed), then group that Q-node with Logging under a new P-node.
The resulting tree represents exactly 4 orderings: [Logging, Auth, Caching], [Logging, Caching, Auth], [Auth, Caching, Logging], and [Caching, Auth, Logging].
We went from 6 to 4 with one constraint. Add five more constraints? You might go from millions of permutations to just a few hundred valid ones. If a constraint is impossible to satisfy, the tree reduction fails immediately, which is a huge time-saver.
Fig. 7a shows an unconstrained PQ-tree with three P-nodes (Logging, Auth, Caching) representing all 6 permutations. Fig. 7b shows the reduced tree with a Q-node grouping Auth and Caching, and a P-node that can permute Logging with that constrained pair, exactly the 4 valid orderings described above.
Fig. 7 (a): Single P-node with Logging, Auth, Caching
Fig. 7 (b): Q-node enforcing [Auth, Caching] adjacency
Aha! The smarter algorithm now looks like this:
- For each row, build a PQ-tree initialized with all nodes (fully permutable).
- Apply C1P constraints from chains and all parent-child relationships.
- Generate only the valid orderings by traversing the reduced PQ-tree.
- Take the Cartesian product of valid orderings across rows.
- Score each candidate and return the best.
Using the PQ-tree, brute-forcing made optimal solutions tractable for a much larger class of graphs like anything with 6-8 nodes per row and strong chain structures.
We're so back.
Attempt #3: The Barycenter Heuristic
So PQ-trees let us handle 6-8 nodes per row. Great! Except... real-world dependency graphs don't politely cap themselves at 8 nodes. A typical microservice architecture? Easily 30-40 nodes in a single layer. And remember, PQ-trees only help if your graph has strong structures. Graphs with lots of many-to-many relationships? You're back in factorial hell.
We're not back. We're still screwed.
At this point, it was time for a mindset shift. Maybe I don't need the optimal solution. Maybe I just need something good enough that runs in reasonable time. Remember the Sugiyama Framework from earlier? It doesn't promise optimality, it uses heuristics to find decent solutions quickly. Let's steal from that playbook.
The most famous heuristic from Sugiyama is the barycenter method: a greedy algorithm that positions each node near the average position of its neighbors. The intuition is simple: if a node's connections are clustered on the left, put it on the left. If they're spread out, put it in the middle. Minimize edge length, minimize crossings!
Say row 1 has three nodes at positions 0, 1, 2. Node D in row 2 has parents at positions 0 and 2. Its barycenter is $(0 + 2) / 2 = 1.0$. Node E has a parent only at position 1, so its barycenter is $1.0$. Node F has a parent at position 2, so its barycenter is $2.0$. After sorting by barycenter, the row 2 ordering is [D, E, F] (or [E, D, F] if we break ties). The key insight: we use barycenter values as sorting scores, not actual x-coordinates. Then we sweep multiple times—alternating top-down and bottom-up—to let each row iteratively influence its neighbors.
Here's the final algorithm:
- Start with a fixed ordering for the top row (say, alphabetical).
- For each node in row 2, calculate its barycenter: the average position of its parents in row 1.
- Sort row 2 by barycentric scores (ascending).
- With row 2 now fixed, calculate barycenters for row 3 based on row 2 positions.
- Repeat downward through all rows (alternating top-down and bottom-up).
Each pass is O(n log n), so even with 20+ passes, it's blazingly fast.
The method works surprisingly well on clean, hierarchical graphs, but it's still just a heuristic. Most of the time it spits out layouts that look way better than they have any right to, but there are no guarantees. On real, messy graphs it can absolutely whiff.
Attempt #4: Heuristic-Guided Constrained Search (The Synthesis)
Overall, I wasn't satisfied.
This is where I dusted off my old optimization brain from past jobs and flexed some long-forgotten branch-and-bound muscles: I had the PQ-tree (smart pruning) and the barycentric method (smart guessing). What if I could combine them? What if the heuristic could guide a structured search process with custom pruning?
Imagine exploring a giant decision tree of possible solutions. You keep track of:
- the best complete solution you've seen so far (the current "champion"), and
- for each partial solution, an optimistic lower bound on how good it could possibly get if everything else went perfectly.
If the best-case version of a partial solution is already worse than your champion, you just prune that entire branch. Don't explore its children, don't think about it again. That's branch-and-bound. Essentially, I implemented a tower-shaped version of that:
- Start with a good bound: run the barycentric heuristic to get an initial solution and use its crossing count as our best-known score.
- Build PQ-trees for each layer: apply all C1P constraints (chains, adjacency) so each tree encodes only the structurally-valid orderings.
- Search recursively layer-by-layer: for each layer, generate its valid orderings from the PQ-tree and sort them by barycentric score (most promising first).
- Track crossings incrementally: as we build the tower depth-first, compute the new crossings between the current and previous row, adding to a running total.
- Prune aggressively: if the running crossing count already exceeds our best complete solution, stop exploring that branch immediately—no point continuing if we've already lost.
- Update the champion: when we reach a complete layout with fewer crossings than the current best, replace it and tighten the pruning bound for future branches.
This is where things finally started to click. The algorithm was fast enough to handle much larger, messier graphs, and the layouts actually looked good. It felt like there was still a ton of room to explore: tighter bounds, fancier heuristics, parallel tricks, but I'd crossed the line from "this is an intractable NP-hard problem" to "this is actually a decently good solution".
The Final Touches
Looking back, the path from idea to working algorithm was surprisingly predictable in hindsight:
- I started with brute force, literally enumerating permutations, which instantly reminded me why factorial is a cursed word.
- To keep that from setting my laptop on fire, I added fast inversion-based crossing counting, so I could score layouts in reasonable time.
- Then I pulled in the PQ-tree to encode constraints like "these nodes must stay consecutive", letting me skip entire sets of obviously impossible orderings.
- From there, I stole the barycenter heuristic to get a cheap, surprisingly decent way to guess good row orderings on big, ugly graphs.
- Finally, I mashed it all into a search algorithm: PQ-trees define the legal moves, barycentric scores decide which branches to try first, the fast inversion-based crossing counter keeps scoring cheap, and a simple lower bound lets me ruthlessly kill anything that can't beat the best tower so far.
Not bad for a project that started with an XKCD joke!
Separation of Concerns
With the algorithm solved, everything else became trivial. The core outputs one thing: an ordered list of nodes per layer. No pixels, no styling. Just the abstract layout.
The rendering layer handles the rest independently: flow-based width allocation (wider blocks support more structure), positioning, and visual styling. I implemented two styles (clean and hand-drawn) without touching the algorithm. The hard problem stayed separate from the easy one.
Real Dependency Data
To make this work with actual projects, I built parsers for three major package ecosystems: PyPI (Python), crates.io (Rust), and npm (Node.js). Each parser hits the registry API, recursively walks the dependency tree, and builds the DAG.
Then I fetch metadata from GitHub for each package: repository owner, maintainer list, and contributor counts. This lets me enrich the visualization with real-world context about who actually maintains these packages.
The Nebraska Guy Ranking
Remember the XKCD comic about the random developer in Nebraska holding up a critical piece of internet infrastructure? I wanted to surface that in the visualization. The Nebraska ranking algorithm scans the dependency graph and scores each maintainer based on:
- Depth: Packages deeper in the tower (more dependencies above them) score higher. They're foundational.
- Role weight: Owners get 3x weight, leads get 1.5x, regular maintainers get 1x.
- Shared credit: If a package has multiple maintainers, the depth score is divided among them.
The algorithm walks every node, calculates its depth from the root, divides that by the number of maintainers, and accumulates weighted scores. The result: a ranked list of the people whose packages quietly support the entire tower. The Nebraska guy for your stack.
Implementing in Go
I started prototyping this in Python, my usual go-to for "let's just see if this works" projects. But as soon as I understood the scale of the problem (factorial search spaces, millions of layout evaluations), I realized I needed something faster. Much faster.
I'd been writing Go at work lately and was looking for a side project to sharpen my skills. This seemed like the perfect excuse. Go hits a sweet spot: it's fast enough for tight algorithmic loops, but you're not fighting a borrow checker or managing memory manually. You just write code and it runs.
Spoiler: it was absolutely the right call. The performance difference over Python was staggering: orders of magnitude faster for crossing detection and layout evaluation. Go's goroutines made parallelizing the search trivial. And the simplicity of the language meant I spent my time on the actual problem.
The Result
After iterating on algorithms, parsers, and rendering pipelines, the pieces finally came together. The search algorithm finds clean orderings, the metadata enrichment surfaces the maintainers, and the visual styles bring it all to life.
And honestly? I'm pretty damn happy with how it turned out. Here's what it looks like in action for FastAPI, a popular Python web framework. Hover over blocks to see more details.
Wrapping Up
This project ended up teaching me more than graph algorithms. It reminded me what real problem-solving feels like. In the end, Stacktower wasn't about inventing one brilliant algorithm from scratch. It was about finding the right decomposition, borrowing good ideas for each piece, and stitching them together into something that fits this particular problem.
If you want to explore the code, it's all open source on GitHub. You can render your own towers. And if you've read this far, thank you for coming on this journey with me.