The recent 36Kr article, originally republished from QbitAI, amplified a design idea that AI architects should already be watching closely: maybe the next useful scaling gain does not come only from stacking more unique transformer layers, but from looping a smaller transformer core multiple times with controlled recurrence.
This article is inspired by the 36Kr English piece "22-Year-Old Reverse-Engineers and Open-Sources Mythos Architecture with MoE and Attention Mechanisms Inspired by DeepSeek", originally credited there to QbitAI. It also draws on Prairie et al., Scaling Laws For Stable Looped Language Models, Kohli et al., Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers, and Kye Gomez’s OpenMythos GitHub repository. The analysis below is original.
OpenMythos, Kye Gomez’s open PyTorch implementation, turns that architectural hypothesis into something concrete enough to test. It is explicitly framed as an independent, speculative reconstruction rather than a claim about any proprietary Anthropic system. That disclaimer matters. But the repo is still valuable because it shifts the conversation from rumor toward inspectable engineering.
Loop k transformer layers L times instead of stacking kL. Same effective depth, fraction of the parameters. OpenMythos is an open PyTorch implementation of this Recurrent Depth Transformer hypothesis. Parcae (Prairie et al.) shows a looped 1.3B hitting 87.5% the quality of a standard transformer twice its size. MoE routing and Multi-Latent Attention run inside the loop. The Claude Mythos connection is speculative, but the repo lets you test it instead of debating it.
That summary captures the excitement, but the deeper technical story is more interesting than the mythology around the name. The real question is whether recurrent-depth transformers are becoming a serious architectural option for compute-adaptive reasoning systems.
The mainstream scaling pattern of the last few years has been familiar: increase parameters, increase data, increase training FLOPs. That approach works, but it also drives up memory footprint, training complexity, and serving cost.
Looped architectures offer a different tradeoff. Instead of increasing depth by adding more unique layers, they increase effective depth by reusing a smaller block across multiple iterations. Parameter count stays closer to the smaller block, while compute scales with the number of loop iterations.
For AI architects, this matters because it starts to decouple two things that conventional transformers usually bind together too tightly: memory footprint and reasoning depth.
OpenMythos is not just “the same layer repeated.” Based on the public implementation notes, it combines several ideas into one stack:
This matters because recurrence alone is not the whole hypothesis. The architecture combines weight sharing across depth, conditional computation through MoE, and latent iterative refinement inside the same model skeleton.
A standard transformer behaves like a long pipeline. Tokens enter, pass through many different parameterized layers, and then produce logits.
A recurrent-depth transformer changes that into a controlled loop. The model encodes the input, runs a shared block over the hidden state, feeds the updated hidden state back into that same block, repeats the process several times, and only then produces output. The idea is to refine the same latent representation over multiple passes instead of allocating a different parameterized layer for each stage of computation.
That is why people describe this as “more thinking without more parameters.” It is not magic, but it is a meaningful design shift.
Prairie et al., in Scaling Laws For Stable Looped Language Models, address the historical weakness of looped models: instability. If hidden states are repeatedly fed through the same block without strong control, residual dynamics can explode and training can diverge.
The important contribution of Parcae is not just that it loops. It makes looping more stable by constraining the recurrence dynamics, treating the update process as a dynamical system and controlling the spectral behavior of the injection parameters.
That matters because without stability, recurrent-depth transformers stay academic curiosities. With stability, they become something architects can seriously benchmark and potentially deploy.
The empirical headline is worth stating carefully. Parcae reports that a looped 1.3B model achieves up to 87.5 percent of the quality of a conventional transformer about twice its size in the studied setting. That is not identical quality at half the parameters, but it is still a striking efficiency result.
Kohli et al., in Loop, Think, & Generalize, push on a different question. Does recurrence change reasoning behavior, not just parameter efficiency?
The answer appears to be yes. The paper studies systematic generalization and depth extrapolation, two areas where vanilla transformers often struggle. Recurrent-depth transformers do better, especially when given more inference-time recurrence.
That suggests the loop is not only a compression trick. It may also be a better computational substrate for composing known facts and rules into new multi-step reasoning chains.
This point is easy to blur, but it matters. Chain-of-thought externalizes reasoning into tokens. The model writes intermediate steps into the context window, which increases token count and makes the reasoning trace visible.
A recurrent-depth transformer does the extra work inside the hidden state. It can iterate internally and emit an answer only after several latent refinement passes. That means deeper internal computation without requiring the model to serialize every step into text.
For system designers, this creates a potentially attractive path to test-time compute scaling with less token overhead.
OpenMythos becomes more interesting because the recurrent block is not purely dense and uniform.
MoE inside the loop: Sparse routing means each loop does not have to activate the same internal path. Even if the outer block is weight-shared, the active expert set can differ across iterations. That gives the system a plausible mix of iterative depth and conditional specialization.
MLA inside the loop: If a model runs attention multiple times inside the same forward process, efficient attention becomes even more important. Any attention mechanism that reduces KV pressure or improves compute efficiency compounds across loop iterations.
That is why this architecture should be viewed as a systems bundle, not just a recurrence trick.
The hype line is usually something like “same depth, half the parameters.” That is catchy, but it compresses too much.
A better technical statement is this: recurrent-depth transformers can approximate greater effective depth with fewer unique parameters, but quality depends on training recipe, recurrence stability, routing design, data budget, and inference-time loop strategy. The evidence so far supports strong efficiency gains, not a universal theorem that looping dominates standard transformers in every regime.
The Mythos link is interesting, but still speculative. That is fine. The real value of OpenMythos is not that it proves a rumor. The value is that it packages a serious architecture class into an inspectable PyTorch implementation that engineers can test directly.
That changes the conversation from mythology to engineering, which is where it belongs.
The most important thing about OpenMythos is not the branding. It is that it makes a meaningful architectural question concrete.
Can we get more useful reasoning depth by looping a smaller transformer core, stabilizing the recurrence, and combining that with sparse routing and efficient attention?
The answer is not fully settled. But between Parcae’s stability and scaling results, the implicit reasoning findings from Kohli et al., and an open implementation that can actually be profiled, recurrent-depth transformers have moved out of the fringe category.
For AI architects, that alone is enough reason to pay attention.