Conway's Game of Life (Life), a famous cellular automata (CA) algorithm, exhibits complex, initial condition-sensitive dynamics. Modeling Life without knowing the system's topology presents a challenge, motivating the development of algorithms capable of generalizing across grid configurations and boundary conditions. We introduce LifeGPT, a decoder-only generative pretrained transformer (GPT) model with rotary positional embedding (RoPE) and forgetful causal masking (FCM), capable of computing a single-timestep global state transition in Life on a toroidal grid without prior knowledge of grid size or boundary conditions. LifeGPT is topology-agnostic, achieving near-perfect accuracy in capturing the deterministic rules of Life given sufficiently diverse training data. We show recursive simulation of Life is possible using LifeGPT within an autoregressive loop. LifeGPT can also be trained on grids of varying sizes while retaining near-perfect accuracy. Finally, we propose future research, like using LifeGPT-like models to infer CA rulesets from real-world data, which could advance predictive modeling.
Cellular automata (CA) have long been a subject of profound interest within the fields of computer science and mathematics, owing to their intricate and emergent behaviors. CA algorithms are uniquely characterized by their combination of computational simplicity -- evolving solely by local state-transition rules -- and broad dynamical behavior, encompassing static, periodic, chaotic, and complex patterns, depending on the ruleset and initial condition (IC) being used. These properties render CA algorithms particularly valuable for simulating a wide array of natural phenomena, such as the propagation of forest fires, traffic flow dynamics, chemical reactions, and recrystallization. The inherent behavioral unpredictability of CA (following human inspection of their rulesets) has hindered advancements in these subfields, confining CA to the realm of phenomenological modeling and subsequently preventing their evolution into mature, predictive tools for systems or phenomena where we do not yet know a closed-form ruleset.
One notable CA algorithm is Conway's Game of Life (Life), as introduced by John H. Conway in 1970. Life (also referred to as simply "the Game of Life") has since intrigued researchers from a range of disciplines, including mathematics, computer science, and materials science. This is partly due to Life's self-organizing yet often unpredictable dynamics that emerge from a simple 2-state-8-neighbor transition scheme (Fig. 1). Despite Life's fame, complex dynamics are present in numerous CA, and have led many researchers to support the conjecture that most CA are "computationally irreducible." This designation suggests that their evolution cannot be perfectly predicted by any process more computationally efficient than running the CA algorithms themselves. Practically, this conjecture implies that the evolution of most CA algorithms from an arbitrary IC to an arbitrary time-step in the future cannot be described analytically. As Wolfram (2024) writes, "... we can't expect to systematically "jump ahead" and predict [...] the system..." Thereby, any model that tries to simulate or predict the dynamics of a computationally irreducible CA is either implementing an algorithm capable of approximating CA behavior or implementing the exact CA algorithm as a recursive operation for the number of timesteps desired. The former solution requires that the model diverge from the CA algorithm's ground truth after some number of timesteps and/or for some specific initial conditions, and the latter solution necessitates that the model be fundamentally limited by its size to a maximum number of timesteps. Another question is whether, and how, we can begin to build artificially intelligent systems that can program new versions of CA to meet certain target behaviors, and whether such a system could develop a heightened situational "awareness" to predict the evolution of an algorithm multiple timesteps into the future without explicitly computing it recursively. While this may seem impossible given the computational irreducibility conjecture, past research has suggested that Life exhibits self-organized criticality (SOC) and persistent scaling behavior. Additionally, some CA have been shown to generate fractal-like patterns. These phenomena suggest that Life (and potentially, many other CA algorithms) are at least statistically predictable -- while perfectly accurate predictions of any CA system might be unachievable without running the exact CA algorithm, being able make global or "coarse-grained" claims about CA systems is far more reasonable. In addition, recent work demonstrates that exposing LLMs to complex elementary CA systems (class 4), as opposed to static (class 1), periodic (class 2), or chaotic (class 3) CA systems, leads to improved downstream performance on cognitive tasks, suggesting that a fundamental understanding of complexity is intrinsic to the common notion of "intelligence". The nexus question here is: Are there "deep trends" (which may also be conceptualized as "pockets of computational reducibility") that exist across, or transcend, scales in CA that might be elucidated by deep learning models? These and related questions led us to explore the use of AI algorithms in modeling CA.
With increases in computing power and the growing popularity of neural networks as tools of analysis for complex systems (for which there are few practically useful analytical models), the computational irreducibly problem has lead many down the road of using neural networks to try to "predict" the evolution of CA systems (often, Life is chosen as the example system), with varying degrees of success. The general assumption across this field of research is that neural networks can either learn the rules of a given CA algorithm exactly, or can learn to abstract the system's behavior well enough to find a more-or-less accurate prediction of some far-future state based on an IC, though some work ventures beyond this paradigm.
Previous work focuses largely on either feed-forward convolutional neural networks (CNNs) or convolutional encoder-decoder models. Such approaches already encode knowledge of the significance of the spatial relationships between each cell in the CA training data, namely that CA utilize state-transition rules based on the neighboring states of a given cell (also known as the Moore neighborhood). This likely introduces a form of inductive bias. CNNs embed the solution in the model directly because they already understand that CA utilize a 2D grid as a logical topology, and hence, even before their weights are initialized, CNNs are more than halfway to the solution. To date, there is little research that has investigated the possibility of utilizing data topology-agnostic models for predicting the evolution of CA.
Other recent work has taken a more general approach to such problems, includes a noteworthy conceptual framework for extracting generative rules from complex dynamical systems was introduced by Zenil et al., based on algorithmic information theory (AIT), called algorithmic information calculus (AIC). In this work, AIC was used to reconstruct phase spaces and identify causal mechanisms of discrete systems (including CA). Their approach applied perturbation analysis to quantify the algorithmic complexity of system components, enabling reconstruction of the system's generative rules without requiring explicit kinetic equations. This method provided insights into the causal structure of systems and their reprogrammability toward desired states.
Building on this foundation, Hernández-Orozco et al. introduced an algorithmic loss function to quantify the discrepancy between predicted and observed system behavior. By combining AIT with ML, they developed a framework for learning generative rules in non-differentiable spaces, bridging discrete algorithmic theory and continuous optimization. This domain-agnostic approach enabled robust predictions and causal rule discovery in CA and similar systems.
The ability to parse complex patterns and predict causal relationships from high-dimensional data is, however, not limited to AIT approaches. Within recent years, generative pretrained transformer models (GPTs) have gained widespread popularity within the areas of natural language processing, weather prediction, speech processing/modeling, machine vision, strategic gaming, physical field modeling, protein folding, and finance, to name a few. This remarkable capability for GPTs to generate accurate predictions, solve inverse problems, make useful decisions, and/or new and relevant content depending on the task at hand has been attributed to the underlying attention mechanism's remarkable ability to parse meaning from data through repeatedly updating token embeddings in a high-dimensional space. Consequentially, attention allows GPTs to build sophisticated ontological understandings, which must be learned using a large amount of training data, or fine-tuned (in the case of pretrained models) on a smaller set of data. Furthermore, these understandings can be visualized in numerous ways using graphs, proving a unique level of functional transparency, in contrast to previous machine learning strategies, which are more opaque, leading to the common "black-box" comparison.
In this paper, we present an analysis of the transformer architecture's capability to predict the outcomes of Life, played on a 32 × 32 toroidal grid, with remarkable accuracy. We accomplish this using a decoder-only transformer model, equipped with causally masked multi-headed self-attention, rotary positional embedding (RoPE) with forgetful causal masking (FCM) implemented during training, trained on data representing pairs of 2D grids encoding ICs and next-game-states (NGSs); we call our model LifeGPT.
The Life algorithm is popular within circles studying Artificial Life (ALife). ALife is a highly interdisciplinary field that is focused on understanding the essential attributes of life and life-like systems through computation, hardware, and biochemical means. "Soft" ALife, which is the most relevant to our work, is a subset of ALife that is entirely simulated computationally, instead of being physically realized. Many CA systems (notably, Life) have been constructed for the purpose of better understanding self-assembly, self-propagating patterns, and, more generally, complexity and chaos, all phenomenological attributes intrinsic to life as it is currently understood.
Some scholars have already begun to conceptualize GPTs -- more specifically, large language models (LLMs) -- as having a "reciprocal relationship" with ALife research. Nisioti et al. identify two key relationships between the two areas of research. The authors argue that LLMs may become tools for ALife research by allowing for the generation of open-ended environments or as operators of evolutionary computation. Inversely, they also argue, ALife principles such as collective intelligence, evolution, and organization may also be exhibited by LLM agents. The authors emphasize the potential for LLMs for control or understand ALife systems. While the scope of our paper is limited to a replication of Life's game-state transition rules using a transformer model, our work lays the groundwork for a paradigm in which GPTs are able to fine-tune CA systems' emergent qualities through alterations of ICs and rulesets, enabling a form of ALife oversight and regulation, which the authors argue is a primary ethical consideration.
In recent months, yet another reciprocal relationship between neural networks and ALife has been elucidated. Recent work by Wolfram shows that discrete systems known as "spacetime-inhomogeneous cellular automata" (also referred to as "rule arrays") can be tuned through a discrete, Boolean-calculus-based backpropogation to create minimal models for perceptron-style neural networks. In addition, it is argued that ordinary cellular automata can, similary, discretely model recurrent neural networks. Through these examples, an argument is developed asserting that modern-day machine learning techniques are effective at learning simply because they can "mine" programs from an extremely large space of complex behavior, arising from computationally irreducible systems governed by relatively few/simple rules. Thus, the author argues, it is not likely that AI will intrinsically favor finding "explainable" or "understandable" models for fitting training data; rather, AI will find something that just "happen[s] to work". Furthermore, it is stated that in the context of AI development, this harnessing of computational irreducibly creates a fundamental trade-off between model interpretability/predictability and model capability. Models that are structured more obviously for human understanding will be more computationally reducible, and will therefore be forced to sample from a computational space with reduced complexity during weight-tuning. Moreover, it is suggested that models will intrinsically struggle to fit data generated by computationally irreducible programs, as finding programs that "do more or less the right thing" will not be sufficient. The work suggests that a deeper understanding of cellular automata will go hand-in-hand with the development of any kind of broadly applicable AI-science, as both necessitate the discovery and analysis of "pockets" of computationally reducible behavior inside of complex systems.
Earlier work synthesizing the domains of CA and ML has largely focused on convolutional models paired with 2D CA. This is because 2D CA "time-slices" are easily represented as images, which feed-forward CNNs are especially equipped to process. Springer and Kenyon (2021) conducted a thorough empirical analysis on the ability of feed-forward CNNs to capture the allegedly computationally irreducible dynamics of Life. Since Life abides by rules pertaining to the states of 2D nearest-neighbors (the r = 1 Moore neighborhood), which are effectively equivalent to a convolutional operations, the authors were able to calculate a hypothetical minimal feed-forward CNN (using ReLU activation) capable of capturing the rules of the Life and applying them over n time-steps. They were thereby able to define a metric for network over-completeness, m, to empirically test the relationship between CNN size and Life learning effectiveness for various n and m. One finding was that for n≥3, increasing m up to 25 was insufficient for reaching a training convergence fraction of over 50%. Furthermore, the authors found that for networks which did converge, the number of epochs necessary increased substantially with n. The authors also note that converged, minimal networks were highly sensitive to sign-flipping of initial weights, suggesting that for smaller networks, more luck in initializing suitable weights is needed to converge. The authors further reported that gradient descend was highly sensitive to the distribution parameters of the dataset, suggesting that some game examples were more useful that others for teaching the rules of Life to the model.
Wulff and Hertz (1992) demonstrated the use of simple neural networks with short range connections (Σ - Π networks) to learn the dynamics of 2D CA. They concluded that even with a network sharing the same topology as the CA being studied, the dynamics of Life could not be learned effectively without weight-sharing between neurons. Their early approach was similar to many modern-day graph neural network architectures. Still, their approach encoded part of the solution to the problem in the architecture of the model, and in doing so introduced an inductive bias likely aided the learning process. By enabling weight-sharing and employing short-range connections, the architecture of the network was itself a reflection of the grid topology and nearest-neighbor rules engendering Life.
Recent work has shown that CA systems can be trained to dynamically sustain desired patterns (such as small 2D images) by allowing a single CNN to simultaneously control the state-transition rules of all cells in the system, which is likewise made possible by allowing cells to take on continuous states (as opposed to discrete states). By training the neural network (through repeated runs of the CA growth process, followed by subsequent loss calculation by comparison with the desired image and backpropagation), the model's parameters are tuned so that a single cell can, over time, multiply into a cellular collective that "grows" into the desired image. This strategy is known as neural cellular automata (NCA).
The outline of the paper is as follows. First, we introduce the LifeGPT architecture, evaluate its training and inference abilities, and draw conclusions about its overall learning capabilities. We then introduce the autoregressive loop, a framework that enables LifeGPT to recursively simulate Life's dynamics over multiple time steps and evaluate its performance in this context. Additionally, we assess LifeGPT's ability to generalize across grids of varying sizes by employing a different training approach. Finally, we discuss future directions, including the integration of reinforcement learning (RL) techniques and internal representations such as world models, to improve inference accuracy, broaden applicability to diverse CA rulesets, and enhance its role as a tool for studying and simulating complex dynamical systems.