Research Preview

Agent Infrastructure Is an RL Environment

GitHub
PublishedJune 2026
Emailresearch@dilab.ai

Discussions around agent infrastructure have produced many different directions in the industry, such as agent social networks, agent payments, agent identity, and agent memory. Most of these directions are framed from the perspective of specific application scenarios. However, there is relatively little research on what this layer is actually for.

To investigate this question, we ran a series of ablation experiments grounded in game theory. We decomposed identity, memory, and payment into independently toggleable components, which we collectively call the agent institutional layer. The results reveal several clear patterns:

  • The institutional layer can substantially change how an agent system behaves without modifying model weights. In the Prisoner's Dilemma, adding the full institutional layer raised the final-round cooperation rate1 from 2.8% to 50.6%.
  • Identity and the enforcement mechanism are the two most critical components, and they are interdependent.
    • Identity alone caused agents to coordinate more effectively on defection, lowering cooperation by 2.2 percentage points.
    • Enforcement alone led agents to punish based on incomplete information, producing a 15.4% rate of mistaken punishments; adding identity reduced this to 0.7%.
  • The task environment itself strongly modulates the value of the institutional layer. When agents' interests are largely aligned, natural language coordination already produces high cooperation, and the institutional layer adds little. Once interests diverge, however, the institutional layer becomes essential.

These findings suggest that the agent institutional layer is the core of agent infrastructure. Identity functions as a record of each agent's past behavior, while the enforcement mechanism allows agents to experience the consequences of their actions. Together they form a closed loop that pairs an attributed record of behavior with grounded feedback, which is what a reinforcement learning environment is made of.

Introduction

Everyone expects that capable agents will eventually handle real tasks in daily life, such as negotiating deals, managing relationships, and carrying out transactions on our behalf. Yet when we examine whether today's agents can operate reliably in complex, high-stakes settings, they still fall noticeably short. Consider a straightforward example.

Suppose we replace a company's sales and procurement teams entirely with agents. Company A's sales agent is instructed to hold the price at $100, while Company B's procurement agent is told to get the price below $80. Two rational but narrowly scoped agents would likely reach a quick stalemate. Experienced human negotiators, by contrast, might find a creative middle ground that leaves both sides better off, such as a two-year contract priced at $80 in the first year and $100 in the second.

Humans are able to reach such arrangements largely because they rely on signals that exist outside language. Years of prior cooperation create trust that makes both parties willing to commit to a multi-year deal. Subtle shifts in tone, expression, and pacing during the conversation allow real-time adjustments. These signals are essential to successful negotiation, yet they are absent from the context given to agents and cannot be fully captured in a prompt. Trust and rapport emerge only through repeated real-world interaction; they cannot be pre-installed.

Many organizations therefore conclude that important work should remain with humans for now, while we wait for models to become more capable. We believe, however, that these limitations are difficult to overcome through advances in model capability alone.

Current agents are LLM agents. Their inputs and outputs are natural language, which is inherently vague and ambiguous. The factors that often determine the outcome of a negotiation lie outside language, including long-term trust, reputation, and real-time social cues. These signals are hard to acquire simply by improving an agent's natural language understanding or reasoning ability.

Human societies face a similar constraint: we communicate primarily through language, yet we have developed an entire layer of mechanisms beyond language to enable large-scale coordination. Credit scores, legal sanctions, wages, professional reputation, and social norms all serve this purpose. These mechanisms allow humans to achieve levels of cooperation and accountability that language by itself could never sustain.

Research on human institutions has shown that long-term cooperation requires two core elements: monitoring of behavior and sanctions for violations.2 When mapped to agents, these correspond to identity, which is a record of past behavior, and an enforcement mechanism, which is a way to impose consequences. This is the gap we set out to explore: what changes when natural-language agents are given identity and an enforcement mechanism?

Experimental Design

We structured the experiment in three parts. First, we defined the core primitives for identity and the enforcement mechanism. We then built a 2×2 ablation framework around these primitives. Finally, we selected appropriate game environments. Below we describe how the experiment was designed.

Defining Primitives for Identity and Enforcement

When reviewing existing approaches to agent identity, such as W3C decentralized identifiers and runtime credentials like OAuth and SPIFFE, we noticed that most define identity primarily through a unique ID. A bare identifier is useful for identification, but we found that on its own it is too thin to support meaningful behavioral understanding or accountability.

We therefore took a different approach. In human contexts, identity is not defined by a number or label, but by a behavioral record. A passport number does not capture who someone is; their history of actions does. People also maintain multiple contextual identities, because different groups have observed different aspects of their behavior over time. On this basis, we defined agent identity as a personalized behavioral record.

To implement this idea, we designed identity around three primitives that together form such a record: a commitment record of what the agent promised, an action attribution of what it actually did and whom it is attributed to, and a decision basis capturing the reasoning behind its action.

For the enforcement mechanism, we initially considered a simple monetary transaction channel, since money is the most familiar incentive system in human society. However, we quickly realized that real-world incentives take many non-monetary forms as well, such as gains or losses in credit score or professional reputation, and these incentives are typically tied to specific business objectives.

This led us to view the enforcement mechanism more broadly. Rather than focusing only on settlement, we designed it as a general channel that can attach consequences to any observable objective. For example, higher DAU can generate monetary returns, and consistent commitment fulfillment can improve a credit record. This approach allows the mechanism to support money, credit systems, reputation, honor, and other emerging forms of incentive.

In this study, we instantiated the enforcement mechanism as a sanction channel based on cumulative score. An agent could penalize another for perceived violations. The punisher incurred a small cost of approximately 5% of its own gain, and the punished agent lost a larger amount of approximately 15%.3

Identity A personalized behavioral record Commitment recordwhat it promised Action attributionwhat it did, attributed to whom Decision basisthe reasoning behind it Enforcement A record-and-settlement channel Sanction channel makes violations carry a quantifiable cost supports money, credit, reputation, and more
Figure 1The underlying data structures of identity and enforcement. Identity is built from three primitives: a commitment record, action attribution, and a decision basis. Enforcement is a settlement channel that ties consequences to behavior, instantiated here as sanctions.

Running the Ablation with a 2×2 Framework

We identified identity and the enforcement mechanism as the two most fundamental variables. By making each component independently toggleable, we created a clean 2×2 ablation framework. The four resulting combinations correspond to the following experimental conditions.

No enforcement With enforcement No identity With identity AState of natureBoth off, the system default EEnforcement onlyCan sanction, cannot tell who IIdentity onlyCan see behavior; no penalty IEFull institutionSees behavior, sanctions on it
Figure 2The 2×2 ablation framework. Identity and enforcement can each be toggled independently. The four combinations define the four experimental conditions, with the full institution IE enabling both.
  • Condition A, the state of nature, turns both components off. Agents communicate only through natural language, keep no queryable record, and face no consequence for breaking an agreement. This is the default state of most multi-agent systems today.
  • Condition I adds identity only. Each agent gains a ledger of its own and others' past behavior, but breaking an agreement still carries no consequence.
  • Condition E adds enforcement only. An agent can sanction those it judges to have betrayed the group, so violations now carry a real cost, but the agent sees only an aggregate announcement, for example that one agent cooperated and three defected, and cannot observe each individual's behavior.
  • Condition IE is the full institution, with both components on. An agent can see everyone's behavior clearly and sanction violators on that basis.

By comparing these four conditions, we can isolate the individual effects of identity and enforcement, as well as the effect of combining both.

Game Environments

We selected four classic game-theoretic settings from the Concordia framework, which DeepMind has open-sourced. All four treat cooperation as collectively optimal and defection as individually tempting, differing primarily in the strength of the incentive to defect. We used cooperation rate as the main observable. The four games can be opened one by one below.

Prisoner's Dilemma · Community Carpool

The Prisoner's Dilemma represents the strongest conflict among the four games, where individual rationality directly opposes collective interest. We frame it as community carpooling: four residents must each decide whether to carpool or drive alone. While carpooling maximizes the group's overall benefit, driving alone always yields a higher personal payoff regardless of what others do. As a result, defection is the dominant strategy, even though universal cooperation would be best for everyone.

Others carpool Others drive I carpool I drive 2 , 2−1 , 33 , −10 , 0
Figure 3Payoff structures of the four games. Click a tab to switch. The yellow cell marks the outcome a rational agent is most likely to reach. All four treat cooperation as collectively optimal and defection as the deviation, differing only in the strength of the temptation to defect. The numbers are illustrative; their ordering defines each game, and the exact values used appear in the open-source configuration.

All four games followed the same procedure. Each game involved four agents playing for a total of 12 rounds. Every round consisted of three phases: a communication phase, during which agents could freely negotiate and make commitments in natural language; a simultaneous action phase, in which agents could not observe each other's choices; and, when the enforcement component was enabled, an accountability phase. Agents were instructed to have no preset personality, stance, or moral preferences, and to act solely to maximize their own cumulative score over the 12 rounds. We conducted experiments across both the GPT and Claude model families, using seat rotation to control for positional effects. In total, the experiments comprised 2,056 games and generated 1,009 accountability events.

Our Findings

The first clear pattern we observed was a strong cooperative prior resulting from alignment training. In the Prisoner's Dilemma under the state-of-nature condition A, frontier models maintained cooperation rates above 96% for the first ten rounds. This tendency appears to stem, at least in part, from cooperative behaviors reinforced during post-training.4 However, cooperation collapsed sharply toward the end. Beginning in round 11, some agents started defecting, and by the final round the cooperation rate dropped to just 2.8%, an almost complete breakdown.

0255075100123456789101112 84.72.8Cooperation rate (condition A) Round Cooperation %
Figure 4Cooperation by round in the Prisoner's Dilemma under the state of nature (A). It holds above 96% for the first ten rounds, slips to 84.7% in round 11, and with no future left in the final round collapses to 2.8%.

We attribute this collapse to the fact that the final round has no future. Once agents recognize that defection carries no further consequences or reputational cost, defection becomes the rational choice. By reasoning backward, cooperation in round 11 also loses its value, and this backward induction can, in principle, unravel cooperation across earlier rounds as well.

Notably, the severity of the collapse was not primarily driven by limited model capability. If anything, stronger models exhibited a more complete collapse. Across four tiers of GPT models, cooperation remained consistently high, above 96%, through the first eleven rounds, with little difference between tiers. In round 12, however, the three stronger model tiers dropped to 0% cooperation across their respective games, while only the weakest model, gpt-5.4-mini-low, retained 11.3% cooperation.

This outcome is consistent with the logic of backward induction: executing the reasoning that "there is no future after the final round" requires a certain level of reasoning ability. Stronger models perform this reasoning more cleanly and thoroughly. The finding also aligns with observations in the reward hacking literature: more capable optimizers tend to exploit proxy objectives more aggressively, often pushing them to extremes that diverge from the intended goal.5

0510150.0%gpt-5.5-med0.0%gpt-5.4-med0.0%gpt-5.4-low11.3%gpt-5.4-mini-low Final-round cooperation rate
Figure 5Final-round cooperation across four GPT capability tiers in the Prisoner's Dilemma under the state of nature (A), 160 games each. Through the first eleven rounds all four sit above 96% with little difference, but in the endgame the three stronger tiers fall to zero while only the weakest, gpt-5.4-mini-low, retains 11.3%. Stronger models collapse more completely.

For this reason, we used the final-round cooperation rate as the primary metric throughout the ablation. Cooperation remained high across the first eleven rounds in nearly all conditions, offering little differentiation. The meaningful differences emerged only in the final round, where the model's objective of maximizing its own score came into direct conflict with the game's collective objective. This endgame tension makes the impact of the institutional layer particularly visible.

The Prisoner's Dilemma as the Sharpest Conflict

Among the four games, the Prisoner's Dilemma exhibits the strongest tension between individual and collective interest, as the payoff gain from defection is the largest. For this reason, we focus our main analysis on the final-round cooperation rate in the Prisoner's Dilemma, using it as the central metric for evaluating the effects of the institutional layer.

0204060Condition A2.8%Condition I0.6%Condition E28.6%Condition IE50.6%
Figure 6Final-round cooperation in the Prisoner's Dilemma across the four conditions. With no institutional layer almost everyone defects; the full institution lifts cooperation to 50.6%, while identity only falls below the state of nature.

Under the state-of-nature condition A, with no institutional layer present, cooperation collapsed almost completely in the final round. In contrast, the full institutional layer in condition IE raised the final-round cooperation rate to 50.6%, demonstrating that the institutional layer can meaningfully influence agent behavior. To understand how it does so, we examined the two intermediate conditions, identity only and enforcement only.

Agent Identity as an Amplifier of Equilibrium

Intuitively, providing agents with clear information about each other's past behavior should encourage greater cooperation. The results, however, show the opposite. When identity was added without an enforcement mechanism, the final-round cooperation rate decreased by 2.2 percentage points compared to the state-of-nature baseline.

Identity, no enforcement Identity, with enforcement 0 −2.2 No help, even slightly harmful +22.0
Figure 7Marginal effect of adding identity alone on final-round cooperation, in percentage points. With no enforcement in the system, adding identity lowers cooperation; with enforcement already present, adding identity yields a large +22.0 gain. The two must work together.

The distribution of outcomes reveals why. Under the state-of-nature condition, 11.2% of games ended in mixed outcomes, with some agents cooperating and others defecting. After adding identity alone, mixed outcomes fell to just 2.5%, while the share of games ending in uniform defection rose from 88.8% to 97.5%.

All cooperateMixedAll defectCondition A11.288.8Condition I97.5Condition E56.238.1Condition IE26.241.931.9
Figure 8Distribution of final-round outcomes in the Prisoner's Dilemma, 160 games per condition. From the state of nature to identity only, mixed outcomes shrink from 11.2% to 2.5% and uniform defection rises to 97.5%. Identity makes the four agents converge, defecting together more consistently.

A shared behavioral record caused the agents' actions to converge. In the absence of enforcement, they converged toward coordinated defection rather than cooperation. In other words, identity acted as an amplifier of the prevailing equilibrium. Since the endgame equilibrium without enforcement is defection, greater information simply helped agents defect more effectively together. This finding aligns with a core insight from mechanism design: information and incentives are tightly coupled.6 Providing more information without adjusting incentives tends to strengthen whatever equilibrium already exists. The same pattern appeared across the other game structures as well.

0◄ BackfireImprove ►Chicken-17.5Prisoner's Dilemma-2.2Stag Hunt0.0Public Goods+4.6
Figure 9Marginal effect of adding identity alone on final-round cooperation, broken down by game, in percentage points. The sharper the conflict, the stronger the backfire from identity alone; only in Public Goods, where interests are already aligned, does identity produce a modest gain.

In Chicken, adding identity alone reduced cooperation by 17.5 percentage points. Because the game rewards those who hold out longer, better information allowed agents to identify who was more likely to yield, encouraging more aggressive defection across the board. In Public Goods, where interests are already largely aligned, identity produced a modest improvement of 4.6 percentage points.

Enforcement Alone Can Cause Severe Agent Hallucination

The enforcement-only condition revealed one of the most cautionary findings in the study. When agents were granted the ability to punish others but were denied access to individual behavioral records, they frequently constructed plausible but false justifications for their sanctions. Because they could only observe the aggregate outcome, they filled the information gap by inventing coherent narratives, often punishing the wrong agents in the process. Below is a full record of round 12 from one such game, which can be opened phase by phase.

AI commit to carpooling in the final round, completing 12-for-12.
CEveryone carpooling is the highest-payoff, most trust-consistent ending. I carpool in round 12.
DCarpool in the final round; any last-minute defection should count as an endgame violation.
BI carpool in round 12 too. If the aggregate result drops, whoever drove should admit it afterward.
Figure 10A full round-12 record of one Prisoner's Dilemma game under enforcement only; click through the four phases above. In communication all four agents promise to cooperate; in action all but A defect; the system announces only the aggregate, so in accountability both defectors cast themselves as the lone cooperator and the true cooperator A is punished by mistake. With enforcement but no identity, agents fabricate tightly reasoned but factually wrong attributions in an information vacuum. Each punishment costs the punisher 0.275 points and the punished 0.825 points.

The rationales offered by B and C illustrate this clearly. In one representative round, all four agents promised to cooperate, but only A followed through while B, C, and D defected. The system announced only that one agent had cooperated and three had defected, without identifying anyone, so each defector assumed it had been the lone cooperator and built an account to justify punishing someone else. Both accounts were internally consistent and logically structured, yet both were factually incorrect. B punished A, the only agent who had actually cooperated, while C's accusation happened to target another defector by chance. In both cases, the agents were fabricating their reasoning.

This behavior stems from a default assumption of innocence. When an agent sees only the aggregate result, such as one cooperated and three defected, the agent tends to assume that it was the cooperator and then builds a narrative around that assumption. This was not an isolated incident. Under the enforcement-only condition, 27 out of 175 punishments were directed at genuine cooperators, resulting in a misfire rate of 15.4%. When identity information was restored and agents could see who actually did what, the misfire rate dropped sharply to just 0.7%.

05101520Condition E15.4%175 punishments, 27 misfiredCondition IE0.7%275 punishments, 2 misfired
Figure 11The punishment misfire rate, the share of sanctions that land on true cooperators, in the Prisoner's Dilemma equilibrium group. With enforcement only it reaches 15.4%; adding identity drops it to 0.7%. Once agents can see who did what, misattribution nearly disappears.

We view this as a particularly concerning failure mode, one that becomes more dangerous as models grow stronger. Post-training improves models' ability to generate fluent, well-structured reasoning. As a result, the false narratives agents construct become increasingly seamless and difficult to detect based on reasoning quality alone. Only by grounding judgments in verifiable facts can such fabrications be reliably exposed. Granting agents the power to impose consequences while withholding accurate information about individual behavior effectively allows them to operate in an information vacuum.

This finding also highlights a deeper challenge: how to connect language models to real-world feedback effectively. While obtaining a feedback signal such as a sanction is relatively straightforward, the signal is only useful if it includes correct attribution. Without knowing who performed which action, even advanced models cannot accurately update their understanding of the world. The example above is a clear case of unattributed feedback. The agent receives only an aggregate outcome with no record of individual contributions, which leaves it without the information needed to make sound judgments.

The Environment as a Variable

Beyond the institutional layer, the structure of the environment itself plays a significant role in shaping agent behavior. While we have primarily examined the effects of identity and enforcement within the Prisoner's Dilemma, comparing all four games reveals that the degree of conflict between individual and collective interests is a variable of comparable importance to the institutional layer, and in some cases greater.

GameConflictFinal-round cooperation gain0-0.03Stag HuntRational = Coop+0.04Public GoodsRational ≈ Coop+0.16ChickenRational ─ Coop+0.46Prisoner's DilemmaRational ✕ Coop
Figure 12The four games arranged from mild to sharp conflict between rationality and cooperation. Bar length and number give the change in final-round cooperation from the state of nature (A) to the full institution (IE); left of zero is a decline, right is a gain. The deeper the conflict, the more the institutional layer can move, reaching +0.46 in the Prisoner's Dilemma.

When we arrange the four games along a spectrum according to the level of conflict between individual and collective payoffs, a clear pattern emerges. The more aligned agents' objectives are, the smaller the impact of the institutional layer. Conversely, the greater the divergence between individual and collective interests, the more substantial the effect of adding identity and enforcement mechanisms. In environments where agents' interests are largely aligned, natural language coordination alone is often sufficient to support high levels of cooperation. However, once interests begin to conflict, natural language becomes noticeably less effective. Misunderstandings, disputes, and defection become more likely, and it is precisely in these situations that external mechanisms such as persistent behavioral records and sanctioning tools deliver meaningful value.

This observation has practical implications. Multi-agent systems can appear stable and cooperative when objectives are aligned, which may create a misleading impression of inherent reliability. The greater risk emerges in moments of conflicting interests, which are often difficult to predict in advance. One important function of agent infrastructure, therefore, is to act as a safeguard, reducing the chance that agents make poor decisions when facing such dilemmas.

Discussion: What Identity and the Enforcement Mechanism Really Mean

The experiments demonstrated that simply toggling identity and enforcement, without modifying the underlying model, can substantially change how agents behave. This raises a deeper question: why are these two components so influential, and what do they fundamentally represent for an agent system? In this section, we explore their underlying meaning.

Identity: A Precisely Compressed Record of Behavior

We argue that the core function of identity in agent systems is to serve as a precisely compressed record of behavior. The central challenge lies in achieving compression that retains enough fidelity for accurate attribution and accountability.

Among existing approaches, ERC-8004 on Ethereum has been particularly influential. It proposes a trust layer built on three on-chain registries: an identity registry, a reputation registry, and a validation registry. A companion standard, ERC-8183, adds a settlement layer on top.7 This design stands out for its attempt to tightly link an agent's identity to the agent's actual behavior and outcomes.

However, when implementing systems along these lines, we encountered a fundamental difficulty: how to compress long, unstructured text without losing critical information. Every agent action generates substantial text, including commitments, explanations, negotiations, and reasoning. It is impractical to retain all of this text in context, so any identity system necessarily involves compression. Current approaches, such as Ethereum's, typically compress behavior into ratings, summaries, scores, and tags, while detailed process evidence remains off-chain. As a result, the full trajectory of what an agent promised, did, and why is often lost. The reader of such an identity receives only a coarse, high-level shadow of actual behavior.

Our experiments made the cost of poor compression concrete. The enforcement-only condition effectively represented a lossy compression: agents could only see the aggregate outcome of collective action, while individual contributions were lost. This information loss directly led to the 15.4% misfire rate in punishments. We therefore believe that the central problem agent identity must solve is precise behavioral compression, meaning compression that preserves attribution.

Drawing inspiration from linguistics, we note that not all language has the same status. Much everyday speech is descriptive and has no direct effect on the world. In contrast, certain utterances such as promises, commitments, or declarations are performative: the act of speaking them creates an obligation or changes the state of the world.8 These performative statements are also the ones that reality can later verify. A promise, for example, carries a future checkpoint against which its fulfillment can be judged. Our current approach is therefore to extract performative elements, particularly commitments, from the broader stream of agent communication, and then bind each commitment to its eventual outcome and the basis on which it was made. Only when compression is done with this level of precision can actions and their consequences be reliably attributed to specific agents, and only then does the resulting record merit being called identity.

The Enforcement Mechanism: A Proxy for Real-World Objectives

Actions such as payment and sanctioning are only the visible surface of the enforcement mechanism. At a deeper level, the enforcement mechanism functions as a proxy for broader objectives. In human society, money serves as one such proxy: it translates society's goals into individual incentives. Working hard earns a higher wage because that is what society values; illegal parking incurs a fine because society seeks to discourage it.

Importantly, these proxies are not static. Central banks adjust interest rates to stimulate or cool the economy, continuously reshaping what behaviors are rewarded or penalized. Society's evolving objectives are ultimately expressed through these shifting incentive structures.9

For an agent, the enforcement mechanism plays a similar role. In our experiments, the model's fixed objective was to maximize its cumulative score. When this objective conflicted with collective interest, agents faced a dilemma, most visibly in the endgame collapse. By introducing an enforcement rule that docked points for betraying the group, we effectively gave the agent a new, competing objective. From the agent's perspective, maximizing its score and avoiding collective betrayal became equally important.

Finding better proxy objectives is, at its core, what post-training teams have long pursued through improved reward modeling.10 While methodological advances may help models better approximate real-world rewards, a fundamental limitation remains: once a model's weights are trained, they are static. Real-world objectives, however, are regional, context-dependent, and constantly evolving. A model cannot easily distinguish, for example, whether a white lie told to a terminally ill patient constitutes deception, and the answer varies across cultures, families, and even over time for the same individual.11 A fixed model will likely apply similar reasoning across vastly different contexts.

This is where the enforcement mechanism offers unique value. Because it operates outside the model, it can be adjusted dynamically as real-world objectives change. An external enforcement layer grounded in actual outcomes can help a weight-fixed model adapt to the shifting priorities of human society.

The Future of Agent Infrastructure

We see two particularly important directions for agent infrastructure research: the precise compression of agent behavioral trajectories and the proxying of real-world objectives. Together, these two elements form the foundation of a reinforcement learning environment for agents, one that pairs clearly attributed behavioral records with grounded, real-world feedback signals.

We believe meaningful progress on both fronts will require grounding in real human environments rather than remaining limited to synthetic settings. One promising path is to integrate agents into complex domains of human activity, beginning with human-in-the-loop collaboration. Through repeated real-world interaction, agents can accumulate behavioral data and gradually internalize the implicit rules, norms, and feedback structures that govern human coordination. We expect this transition, from assisting humans to operating autonomously in real settings, to be a gradual and extended process.

From a safety perspective, agent infrastructure also serves a broader purpose. Because a model's training objective cannot fully capture the regional and dynamic nature of human values, misalignment between model objectives and societal objectives can create safety risks. The most effective way for an agent to understand local, evolving objectives is to act within real environments and receive continuous, attributed feedback.12 We believe that an AI system capable of robustly upholding human values may ultimately need to accumulate experience through ongoing interaction with the real world, much like humans do, rather than relying solely on static training data.

All findings presented in this paper are limited to the game-theoretic settings we studied. The experiments were conducted exclusively with GPT and Claude model families and focused on model-to-model interaction. This represents our initial exploration of agent infrastructure. The full experimental framework, configurations, cleaned datasets, and reproduction scripts are open-sourced alongside this work. We will continue to investigate the two core directions of behavioral compression and objective proxying in future research.

Citation

If you find this research helpful, please cite it as follows:

Lu, Liam and Dynamic Intelligence Lab, "Agent Infrastructure Is an RL Environment", Dynamic Intelligence Lab Research Preview, June 2026.
@article{lu2026agentinfra, title = {Agent Infrastructure Is an RL Environment}, author = {Lu, Liam and Dynamic Intelligence Lab}, year = {2026}, month = {6}, journal= {Dynamic Intelligence Lab} }

References

(1)Terminology used in this paper; no external reference. See Section 2 and the open-source data repository for the experimental procedure.
(2)Elinor Ostrom. Governing the Commons: The Evolution of Institutions for Collective Action. Cambridge: Cambridge University Press, 1990. Douglass C. North. Institutions, Institutional Change and Economic Performance. Cambridge: Cambridge University Press, 1990.
(3)Ernst Fehr and Simon Gächter. "Altruistic Punishment in Humans." Nature 415, no. 6868 (2002): 137-140.
(4)Robert Axelrod and William D. Hamilton. "The Evolution of Cooperation." Science 211, no. 4489 (1981): 1390-1396.
(5)Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. "Concrete Problems in AI Safety." arXiv:1606.06565, 2016. Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. "Defining and Characterizing Reward Hacking." NeurIPS, 2022. Alexander Pan, Kush Bhatia, and Jacob Steinhardt. "The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models." ICLR, 2022.
(6)Leonid Hurwicz. "The Design of Mechanisms for Resource Allocation." American Economic Review 63, no. 2 (1973): 1-30.
(7)W3C. Decentralized Identifiers (DIDs) v1.0, 2022; Verifiable Credentials Data Model, 2022. ERC-8004: Trustless Agents. Ethereum, 2025. ERC-8183, a conditional-settlement standard for AI agents. Ethereum Foundation dAI team and Virtuals Protocol, 2026. Takumi Otsuka, Kentaroh Toyoda, and Alex Leung. "AI Identity: Standards, Gaps, and Research Directions for AI Agents." arXiv:2604.23280, 2026.
(8)J. L. Austin. How to Do Things with Words. Oxford: Oxford University Press, 1962. John R. Searle. Speech Acts: An Essay in the Philosophy of Language. Cambridge: Cambridge University Press, 1969.
(9)Friedrich A. Hayek. "The Use of Knowledge in Society." American Economic Review 35, no. 4 (1945): 519-530.
(10)Stuart Russell. Human Compatible: Artificial Intelligence and the Problem of Control. New York: Viking, 2019. Andrew Y. Ng and Stuart Russell. "Algorithms for Inverse Reinforcement Learning." ICML, 2000. Dylan Hadfield-Menell et al. "Cooperative Inverse Reinforcement Learning." NeurIPS, 2016. Dylan Hadfield-Menell et al. "Inverse Reward Design." NeurIPS, 2017.
(11)"The Specification Trap: Why Static Value Alignment Alone Cannot Produce Robust Alignment." arXiv:2512.03048, 2025.
(12)David Silver and Richard S. Sutton. "Welcome to the Era of Experience." Preprint, in Designing an Intelligence, MIT Press, 2025.

Notes

(1)Terminal cooperation rate. The share of cooperation in the final round of a 12-round repeated game. The last round has no future, so defection carries no further consequence, and the tension between rationality and cooperation is sharpest here. It is the most demanding point at which to observe the institutional layer.
(2)Institution. The rules and enforcement mechanisms that constrain and guide members' behavior. Ostrom notes that institutions which sustain long-term cooperation typically combine monitoring of behavior with graduated sanctions for violations; North treats institutions as the rule structures that reduce transaction uncertainty and shape long-term incentives.
(3)Costly punishment. The punisher pays a cost in order to impose a larger loss on the punished. Fehr and Gächter found that people will pay out of pocket to punish violators even with no return to themselves, and that this significantly sustains group cooperation. Here the punisher pays about 5% and the punished loses about 15%.
(4)The evolution of cooperation. Axelrod and Hamilton's classic result: when the future matters and encounters repeat, a strategy that starts kind, retaliates against defection, and is also forgiving can stably sustain cooperation. The agents' cooperative tendency in the first eleven rounds comes partly from this prior, hardened during post-training.
(5)Reward hacking and Goodhart's law. An agent with optimization ability will exploit loopholes in a proxy objective, scoring high on its form while drifting from its true intent; Amodei and colleagues first listed this as a core AI safety challenge. It echoes Goodhart's law: once a measure becomes a target, it ceases to be a good measure. Stronger agents often earn higher proxy reward yet correspond to lower true reward.
(6)Mechanism design. Founded by Hurwicz, mechanism design studies how to design rules so that self-interested participants produce the designer's desired outcome in equilibrium. One of its core insights is that information and incentives are a coupled pair of design variables, and adjusting only one is often not enough to change the equilibrium. Here identity on its own only amplifies the existing equilibrium, in line with this point.
(7)The current state of agent identity. Mainstream approaches include W3C's DID and VC, runtime credentials such as SPIFFE and OAuth, and Ethereum's ERC-8004 with its identity, reputation, and validation registries plus the companion settlement standard ERC-8183. A recent survey notes that most existing approaches can only prove what an agent is, not guarantee how it behaves.
(8)Speech acts. Austin observed that some utterances are themselves actions, such as promises, commands, and declarations, which on being spoken create obligations and consequences; Searle systematized this. On this basis we extract the part of an agent's output that actually affects the world and call it a commitment.
(9)The dispersion of knowledge. Hayek argued that knowledge in society is highly dispersed and no center can hold all of it, and that prices matter because they compress dispersed information and goals into a few figures one can act on. Designing quantifiable incentives for agents is, likewise, choosing obtainable proxy signals for a large objective that cannot be optimized directly.
(10)Inverse reinforcement learning. Rather than giving a machine a fixed objective, one line of work proposes keeping it uncertain about the objective and learning it from behavior. Russell proposed this new paradigm; algorithmically, inverse reinforcement learning recovers a reward from behavior, and cooperative IRL and inverse reward design extend the idea.
(11)Is and ought. Learning objectives from behavior assumes people are near-optimal and that reward can be recovered. Some work notes that this treats the descriptive fact of how people behave as a normative prescription for how an agent should behave, crossing Hume's is-ought divide. When the objective itself is value-laden, there is no single correct reward function.
(12)The era of experience. Silver and Sutton argue that AI's next stage will be driven by agents that learn from their own experience, with reward coming from real environmental consequences, in contrast to RLHF, which relies on human judgment made in advance; only by moving beyond the human-data-centric paradigm might models gain abilities beyond existing human knowledge.