Problem 33 : Why (I think) does Nature love power laws?
This is something that has been on my mind for a while now. Some time during the second year of my PhD at MIT, I was staring at impedance plots of battery electrodes and kept seeing a stubborn shape – a depressed semicircle in the Nyquist plane that no honest RC circuit could fit. The community’s polite workaround is to throw in a “Constant Phase Element” with impedance \(Z_{CPE} = Q^{-1}(j\omega)^{-\alpha}\) for some fractional \(\alpha\) between 0 and 1 and move on. What started bothering me was that the same little exponent kept showing up in places that had nothing to do with each other – supercapacitors, cracked solids, brain signals, glasses relaxing toward equilibrium. And not just there – we see power laws in city sizes, stock returns, earthquakes, even the weight matrices of GPTs. There clearly was something more general going on.
In this post, I want to give a “feel” for one of the cleanest mechanisms by which Nature manufactures power laws – the sandpile model of Bak, Tang, and Wiesenfeld (BTW). It’s a beautifully simple cellular automaton that, with no parameter tuning whatsoever, settles into a state where the size of avalanches follows a power law. After we get comfortable with it, I’ll try to connect this back to the maximum-entropy story (which I think is the closest we have to a “why” for the shape itself), and then we’ll look at a few places in the wild where this same kind of power law appears.
A pile of sand
Let us start with a picture in mind. Imagine an empty table on which we are dropping grains of sand, one at a time, very slowly. At first, nothing interesting happens – a small pile starts forming. But as the pile grows taller, the local slopes get steeper, and eventually a single new grain triggers a small landslide. Sometimes the landslide is tiny, just a few grains shuffling around. Sometimes it’s enormous, cascading across the entire pile in a way that feels totally disproportionate to the one grain we just added.
Bak, Tang and Wiesenfeld in their famous 1987 paper had the elegant idea of replacing this physical sandpile with a cellular automaton that captures the same essential dynamics. The rules are almost embarrassingly simple. Consider a 2D grid where each site \((i,j)\) stores an integer \(z_{i,j} \geq 0\) which we can think of as the local “slope” or pile height. The rules are,
-
Drive (drop a grain). At each time step, pick a random site and increment its value by 1: \(z_{i,j} \to z_{i,j} + 1\).
-
Topple. If at any site we have \(z_{i,j} \geq z_c\) (where \(z_c = 4\) for the standard 2D model), then the site becomes unstable and topples,
\[\begin{equation} z_{i,j} \to z_{i,j} - 4, \quad z_{i\pm 1,j} \to z_{i\pm 1,j} + 1, \quad z_{i,j\pm 1} \to z_{i,j \pm 1} + 1 \end{equation}\]i.e. the toppling site loses 4 grains and donates 1 to each of its 4 nearest neighbours.
-
Cascade. A toppling can push a neighbour over the threshold, which topples in turn, which pushes its own neighbours over, and so on. We let this cascade play out fully (treating the boundaries as “off the table” – grains that reach the edge simply disappear) before dropping the next grain.
That’s it. There are no parameters to tune. There is no temperature, no critical point hidden somewhere that we have to dial into. And yet if we sit back and watch, what we observe is remarkable.
Why it produces a power law (the feel)
Let me try to give the feel before getting any more technical. Two competing things are happening at the same time,
- Slow drive. We are pumping energy (sand) into the system grain by grain.
- Fast dissipation in bursts. The system releases this energy through avalanches, with grains leaving the system at the boundaries.
Initially, the pile is mostly flat and avalanches are small and local. But the slow drive keeps adding grains, and the average slope keeps creeping up. As the slopes get higher, the correlation length of the avalanches grows – a single toppling can push more of its neighbours over the edge, and the cascade can travel farther before it dies out.
Eventually, the pile reaches a statistical steady state in which the average inflow of grains exactly equals the average outflow at the boundaries. In this steady state – and this is the beautiful part – the system has spontaneously climbed to a configuration that is critical: the correlation length of avalanches is essentially infinite (limited only by the size of the table). A single new grain can trigger an avalanche of essentially any size, from one grain shuffling to a system-spanning cascade. This is what Bak called self-organized criticality: the system tunes itself to the critical point, no external knob required.
Now, why does this give a power law? Once the system is critical, scale invariance kicks in. Think of an avalanche as a tree of topplings – the original grain causes some topplings, each of those topples its own neighbours, and so on. If the average number of “child topplings” per “parent toppling” (the branching ratio) is exactly 1, we are sitting at the boundary of a critical branching process. Avalanches of every size \(s\) become possible, and the probability of an avalanche of size \(s\) takes the form
\[\begin{equation} P(s) \sim s^{-\tau} \end{equation}\]with \(\tau\) a critical exponent (in 2D BTW, simulations give \(\tau \approx 1.2\), see e.g. this nice review). The reason is essentially the same reason that random walks have no characteristic excursion length when they are on the boundary between drift-positive and drift-negative – there’s no preferred scale, so the only distribution consistent with that is a power law.
What makes this whole story so satisfying is that the BTW model needs nothing – no temperature, no carefully chosen coupling, no fine-tuning. The driving itself manufactures the criticality. That’s the punchline of self-organized criticality, and once you’ve seen it once, you start suspecting that a lot of natural systems – which after all are also being slowly driven and occasionally relaxing in bursts – might be sitting in this kind of self-organised state without anyone having put them there on purpose.
Connecting back to maximum entropy
Now that we have a mechanistic story for how a power law shows up, I want to briefly connect it to why the shape of the steady-state distribution must be \(p(x) \propto x^{-\gamma}\) and not something else.
Recall the standard Jaynes (1957) maximum-entropy program: of all distributions consistent with what we know, pick the one that maximizes
\[S[p] = -\int p(x)\log p(x) \, dx\]If the only thing we know is the average \(\langle x \rangle\), the Lagrange multiplier dance gives an exponential, \(p(x) \propto e^{-\lambda x}\). So Jaynes by himself does not give a power law – a subtlety I think is worth flagging because I almost glossed over it.
The actual move that gets you a power law was made by Montroll and Shlesinger (1983) in a paper with the wonderful title “A tale of tails.” They pointed out that to get an inverse-power-law distribution from maximum entropy, you have to swap the linear constraint for an “unconventional auxiliary condition” – you constrain the average of \(\log x\) instead of the average of \(x\). That is, you know the typical order of magnitude but not the typical value. Re-doing the optimization with \(\int (\log x)\, p(x)\, dx = \langle \log x \rangle\) as the constraint, the multiplier \(\gamma\) enters in the right place, and the maximizer is
\[\begin{equation} p(x) \propto e^{-\gamma \log x} = x^{-\gamma} \end{equation}\]This dovetails very nicely with the sandpile picture. Once the BTW model is critical, there is no characteristic avalanche size – the one piece of information you would naturally try to constrain (\(\langle s \rangle\)) is essentially undefined. What the system does know is the order of magnitude on which avalanches live (set by the system size, which acts as an upper cutoff). That is exactly the regime in which the Montroll-Shlesinger constraint is the natural one, and the maximum-entropy distribution is, predictably, a power law.
So the two pictures fit together. The sandpile gives us a mechanism for arriving at a state with no preferred scale; maximum-entropy-on-\(\log x\) tells us that, once we are in such a state, the shape of the distribution is forced to be \(x^{-\gamma}\). One is the recipe; the other is the inevitable result.
Where this shows up in the wild
Once you have the sandpile picture in mind, you start seeing it everywhere. Here are a few of the most interesting validated examples I came across,
Earthquakes (the Gutenberg-Richter law)
The earth’s crust is, with some poetic license, a giant 3D sandpile – tectonic plates push grains of stress in slowly, and faults release them suddenly in bursts. The famous Gutenberg-Richter law (1944) says the number \(N\) of earthquakes with magnitude greater than \(M\) obeys \(\log_{10} N = a - b M\), with \(b \approx 1\) in most seismically active regions. Since magnitude is itself a logarithm of energy, this is a true power law in the released energy. For every magnitude-4 quake, expect about ten magnitude-3s and a hundred magnitude-2s. There is no characteristic earthquake.
Stock market returns (the inverse cubic law)
| Markets too look like a sandpile – traders pump in information slowly and the price reacts in cascades. Empirically, if \(r\) is the return of a liquid stock or index, the tail of the distribution obeys $$ P( | r | >x) \propto x^{-\alpha}\(with\)\alpha \approx 3\(([Gopikrishnan, Plerou, Gabaix and Stanley](https://ideas.repec.org/p/arx/papers/cond-mat-9803374.html), validated across many markets). This is the reason crashes happen far more often than a Gaussian risk model would suggest -- a\)5\sigma$$ day in a Gaussian world is essentially a once-per-millennium event, while under the inverse cubic, you get one every few years. Anyone who lived through 2008 has empirical evidence on which is closer to reality. |
Cities and words (Zipf’s law)
If you rank cities by population from largest to smallest, the population of the \(n\)-th largest scales like \(\text{population}(n) \propto n^{-1}\). This is Zipf’s law, and the same exponent of \(\sim 1\) shows up for the frequency of words in natural language, the size of firms, and the number of citations to academic papers. The mechanism here is closer to a multiplicative random walk (Gibrat’s law of proportional growth) than a literal sandpile, but the effect is the same – multiplicative dynamics on a system with no preferred scale, and we land on a power law.
Neural network weights
This one is probably my favourite because it sits right in the middle of modern AI. If you take a single weight matrix \(W\) from inside a trained deep neural network (any layer of any modern model), compute its singular values \(\sigma_i\), and plot the histogram, you find a heavy-tailed distribution well fit by \(p(\sigma) \propto \sigma^{-\mu}\). This is the heavy-tailed self-regularization story of Martin and Mahoney, and the part that genuinely surprised me is that the value of \(\mu\) correlates with the test accuracy of the model across architectures – VGG, ResNet, Transformers – without ever looking at the test data. The mechanism feel is again sandpile-like: training is a massively multiplicative process where gradients flow through chains of matrix multiplications and weight updates compound across millions of steps, with no preferred scale for “how much a weight should matter.”
Wrapping it up
What I find genuinely amazing is how often this same little story plays out across completely unrelated domains. Slow driving, occasional bursts of dissipation, no preferred scale – and out pops \(p(x) \propto x^{-\gamma}\). The sandpile is the cleanest cartoon I know for this, and the Montroll-Shlesinger maximum-entropy argument is the cleanest justification I know for the shape that the cartoon produces.
It almost feels like Nature has a default font, and that font is the power law. Whenever a system loses its sense of “typical” – because it’s being slowly driven, or because everything compounds multiplicatively, or simply because no scale is preferred – the universe shrugs and writes things in this font. Earthquakes, crashes, cities, GPTs, and (the part that started this whole rant for me) the depressed semicircles I keep seeing in battery impedance plots all turn out to be different sentences in the same handwriting.
P.S. – The maximum-entropy story tells you when to expect a power law, not what the exponent will be. Pinning down \(\gamma\) (or \(b\), or \(\alpha\), or \(\mu\), or \(\tau\)) requires the actual physics of the system – the geometry of the lattice for the sandpile, the order book for stocks, the loss landscape for neural nets. The “feel” gives you the shape; the physics gives you the slope.
P.P.S. – The battery story I alluded to at the top is what I’ve been working on with my advisor at MIT for the past while. It turns out that the depressed semicircle in battery impedance is the macroscopic shadow of a broad spectrum of inter-particle relaxation timescales, set by the graph Laplacian of the heterogeneous wiring network inside the electrode. Same multiplicative-compounding-plus-broad-spectrum story as the examples above, just played out in lithium and carbon black instead of dollars or words. Hopefully a separate post on that one soon.