AI Intelligence Brief

Mon 11 May 2026

Daily Brief — Curated and contextualised by Best Practice AI

81Articles
Editor's pickEditor's Highlights

Goldman Predicts Surplus, SoftBank Powers Up, and Women Face Automation Fallout

TL;DR Goldman Sachs forecasts record current-account surpluses for South Korea and Taiwan due to AI-driven chip booms, pressuring central banks to raise rates. SoftBank plans to manufacture large-scale batteries to meet AI data center power demands. Women in clerical roles are increasingly vulnerable to job losses from AI automation. Alphabet is issuing yen bonds to fund AI investments as competition heats up.

Editor's highlights

The stories that matter most

Selected and contextualised by the Best Practice AI team

9 of 81 articles
Lead story
Editor's pickPAYWALLTechnology
Bloomberg· Today

Goldman Sees ‘AI-Driven Super Surplus’ Swelling in Korea, Taiwan

South Korea and Taiwan’s artificial intelligence-fueled chip booms are set to swell both economies’ current-account surpluses to fresh records and pressure their central banks to raise interest rates later this year.

Editor's pickTechnology
Arxiv· Today

General-Purpose Technology and Speculative Bubble Detection

arXiv:2604.25826v2 Announce Type: replace Abstract: We show that the leading bubble test suffers severe size distortion when fundamentals incorporate general-purpose technology adoption. Embedding a hump-shaped technology shock in the Campbell-Shiller present-value model, we prove that the fundamental price becomes locally explosive during adoption, contaminating the test's limit distribution with a non-centrality parameter proportional to the shock's peak. We propose a fundamental-versus-speculative decomposition that projects prices onto observable technology proxies and applies the test to the residual. Empirically, the decomposition eliminates evidence of speculation in the 2020-2025 AI rally while confirming a speculative peak confined to December 1999-March 2000 in the dot-com episode.

Editor's pickTechnology
Arxiv· Today

Vibecoding and Digital Entrepreneurship

arXiv:2511.06545v2 Announce Type: replace Abstract: As generative artificial intelligence (GenAI) automates coding tasks and expands access to technical resources, this paper examines how GenAI-enabled coding automation, colloquially known as "vibecoding," affects digital entrepreneurial entry and venture performance. We exploit ex-ante variation in ventures' exposure to vibecoding based on the product characteristics of their initial launches and estimate difference-in-differences models around the diffusion of GenAI coding tools. Vibecoding increases first-time launches and shortens time to launch, but economically viable entry rises only where vibecoding augments, rather than fully automates, product development. In these partially exposed product segments, viable entry increases by 11%, driven entirely by ventures founded by individuals with STEM education or work experience, especially those whose most recent employment was outside middle management. Among ventures launched before GenAI became widely accessible, performance gains similarly concentrate among partially exposed ventures with engineering-intensive initial teams. Together, these results suggest that GenAI-enabled coding automation does not eliminate the value of technical expertise. Instead, vibecoding creates the greatest value when it complements internal engineering capabilities, allowing ventures to delegate lower-level coding tasks to GenAI while shifting human effort toward higher-level problem solving and dynamic adaptation.

Editor's pickGovernment & Public Sector
Arxiv· Today

Big AI's Regulatory Capture: Mapping Industry Interference and Government Complicity

arXiv:2605.06806v1 Announce Type: new Abstract: Over the past decade, the AI industry has come to exert an unprecedented economic, political and societal power and influence. It is therefore critical that we comprehend the extent and depth of pervasive and multifaceted capture of AI regulation by corporate actors in order to contend and challenge it. In this paper, we first develop a taxonomy of mechanisms enabling capture to provide a comprehensive understanding of the problem. Grounded in design science research (DSR) methodologies and extensive scoping review of existing literature and media reports, our taxonomy of capture consists of 27 mechanisms across five categories. We then develop an annotation template incorporating our taxonomy, and manually annotate and analyse 100 news articles. The purpose behind this analysis is twofold: validate our taxonomy and provide a novel quantification of capture mechanisms and dominant narratives. Our analysis identifies 249 instances of capture mechanisms, often co-occurring with narratives that rationalise such capture. We find that the most recurring categories of mechanisms are Discourse & Epistemic Influence, concerning narrative framing, and Elusion of law, related to violations and contentious interpretations of antitrust, privacy, copyright and labour laws. We further find that Regulation stifles innovation, Red tape and National Interest are the most frequently invoked narratives used to rationalise capture. We emphasize the extent and breadth of regulatory capture by coalescing forces -- Big AI and governments -- as something policy makers and the public ought to treat as an emergency. Finally, we put forward key lessons learned from other industries along with transferable tactics for uncovering, resisting and challenging Big AI capture as well as in envisioning counter narratives.

Editor's pickPAYWALLEnergy & Utilities
Bloomberg· Today

SoftBank Plans to Make Large-Scale Batteries for AI Data Centers

SoftBank Group Corp.’s mobile unit said it plans to begin large-scale battery cell manufacturing at its Sakai, Osaka plant to address growing power demand for AI services.

Editor's pickEducation
Arxiv· Today

LLM hallucinations in the wild: Large-scale evidence from non-existent citations

arXiv:2605.07723v1 Announce Type: cross Abstract: Large language models (LLMs) are known to generate plausible but false information across a wide range of contexts, yet the real-world magnitude and consequences of this hallucination problem remain poorly understood. Here we leverage a uniquely verifiable object - scientific citations - to audit 111 million references across 2.5 million papers in arXiv, bioRxiv, SSRN, and PubMed Central. We find a sharp rise in non-existent references following widespread LLM adoption, with a conservative estimate of 146,932 hallucinated citations in 2025 alone. These errors are diffusely embedded across many papers but especially pronounced in fields with rapid AI uptake, in manuscripts with linguistic signatures of AI-assisted writing, and among small and early-career author teams. At the same time, hallucinated references disproportionately assign credit to already prominent and male scholars, suggesting that LLM-generated errors may reinforce existing inequities in scientific recognition. Preprint moderation and journal publication processes capture only a fraction of these errors, suggesting that the spread of hallucinated content has outpaced existing safeguards. Together, these findings demonstrate that LLM hallucinations are infiltrating knowledge production at scale, threatening both the reliability and equity of future scientific discovery as human and AI systems draw on the existing literature.

Editor's pickTechnology
Arxiv· Today

SARC: A Governance-by-Architecture Framework for Agentic AI Systems

arXiv:2605.07728v1 Announce Type: cross Abstract: Agentic AI systems increasingly act through tools, sub-agents, and external services, but governance controls are still commonly attached to prompts, dashboards, or post-hoc documentation. This creates a structural mismatch in regulated settings: obligations that must constrain execution are often evaluated only after execution has occurred. We introduce SARC, a runtime governance architecture for tool-using agents that treats constraints as first-class specification objects alongside state, action space, and reward. A SARC specification declares each constraint's source, class, predicate, verification point, response protocol, and operating point, and compiles these into four enforcement sites in the agent loop: a Pre-Action Gate, an Action-Time Monitor, a Post-Action Auditor, and an Escalation Router. We formalize the minimal invariants required for specification-trace correspondence, show why finite reward penalties do not generally substitute for hard runtime constraints, and extend the architecture to multi-agent workflows through constraint propagation, authority intersection, and attribution-preserving trace trees. We implement a prototype audit checker and report a reproducible synthetic evaluation over 50 seeds comparing SARC against post-hoc audit, output filtering, workflow rules, and policy-as-code-only baselines on a procurement task. SARC executes zero hard-constraint violations under exact predicates; its declared PAA throttling response reduces soft-window overages by 89.5% relative to policy-as-code-only. Predicate-noise and enforcement-failure sweeps are consistent with the claim that residual hard violations under SARC scale with enforcement-stack error rather than environmental violation opportunity. SARC provides the architectural substrate through which obligations can be made executable, inspectable, and auditable at runtime.

Editor's pickPAYWALLTechnology
Bloomberg· Today

Alphabet Plans Debut Yen Bond Sale as AI Race Accelerates

Alphabet Inc. is planning to issue yen bonds for the first time in a move that may help fund investments as artificial intelligence competition intensifies.

Editor's pickTechnology
Daily Brew· 3 days ago

GPT-5.5 may burn fewer tokens, but it always burns more cash

An analysis of the economic trade-offs of GPT-5.5, noting that while token efficiency has improved, operational costs continue to rise.

Economics & Markets

18 articles
AI Investment & Valuations10 articles
Editor's pickPAYWALLTechnology
WSJ· Today

How a Job at OpenAI Became the Greatest Lottery Ticket of the AI Boom

Employees waited two years to sell their shares. Then, the company let them unload $30 million.

Editor's pickPAYWALLTechnology
Bloomberg· Today

JPMorgan Hikes Kospi Bull Case Target to 10,000 on Memory Boom

JPMorgan Chase & Co. raised its targets for South Korean stocks for the second time in less than a month, citing improvement in the semiconductor cycle, corporate governance reforms and industrial-sector growth.

Editor's pickPAYWALLFinancial Services
Bloomberg· Today

Pictet Fund Plows 30% of Cash Into AI Stocks on Risk Revival

A $3.5 billion multi-asset fund at Pictet Asset Management has sharply raised its equity exposure, shifting as much as 30% of its cash-equivalent holdings into artificial-intelligence heavyweights across Asia and the US.

Editor's pickTechnology
Arxiv· Today

General-Purpose Technology and Speculative Bubble Detection

arXiv:2604.25826v2 Announce Type: replace Abstract: We show that the leading bubble test suffers severe size distortion when fundamentals incorporate general-purpose technology adoption. Embedding a hump-shaped technology shock in the Campbell-Shiller present-value model, we prove that the fundamental price becomes locally explosive during adoption, contaminating the test's limit distribution with a non-centrality parameter proportional to the shock's peak. We propose a fundamental-versus-speculative decomposition that projects prices onto observable technology proxies and applies the test to the residual. Empirically, the decomposition eliminates evidence of speculation in the 2020-2025 AI rally while confirming a speculative peak confined to December 1999-March 2000 in the dot-com episode.

Editor's pickPAYWALLTechnology
Bloomberg· Today

Alphabet Plans Debut Yen Bond Sale as AI Race Accelerates

Alphabet Inc. is planning to issue yen bonds for the first time in a move that may help fund investments as artificial intelligence competition intensifies.

AI Productivity1 articles
Editor's pickProfessional Services
Arxiv· Today

When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning

arXiv:2605.06772v1 Announce Type: new Abstract: As large language models (LLMs) show increasing promise on research-level physics reasoning tasks and agentic AI becomes more common, a practical question emerges: How does the interaction between researchers and agents affect the results? We study this using SCALAR (Structured Critic--Actor Loop for AI Reasoning), an Actor--Critic--Judge pipeline applied to quantum field theory and string theory problems. The Actor proposes solutions, the Critic provides iterative feedback, and an independent Judge evaluates the transcript against reference solutions. We vary the Actor persona, the Critic feedback strategy, and the Actor model family and scale. Multi-turn dialogue improves over single-shot attempts throughout, but both the mechanism of improvement and the value of different prompting choices depend strongly on the Actor--Critic pairing. Increasing the scale within one model family (e.g. from the 8B-parameter DeepSeek-R1 variant to DeepSeek-R1 70B) improves some easier-problem behavior, but does not remove the hardest bottleneck we observe. Critic feedback strategy matters most clearly in the asymmetric Actor--Critic setting (e.g., a lightweight Haiku Actor guided by a stronger Sonnet Critic), where constructive feedback improves mean-score outcomes. In same-family Actor--Critic settings, strategy effects are weaker: lenient feedback is sometimes favored, while strict and adversarial feedback are not beneficial. Taken together, SCALAR provides a controlled testbed for evaluating which interaction structures help or hinder AI-driven scientific discovery.

AI Startups & Venture2 articles
Editor's pickTechnology
Arxiv· Today

Vibecoding and Digital Entrepreneurship

arXiv:2511.06545v2 Announce Type: replace Abstract: As generative artificial intelligence (GenAI) automates coding tasks and expands access to technical resources, this paper examines how GenAI-enabled coding automation, colloquially known as "vibecoding," affects digital entrepreneurial entry and venture performance. We exploit ex-ante variation in ventures' exposure to vibecoding based on the product characteristics of their initial launches and estimate difference-in-differences models around the diffusion of GenAI coding tools. Vibecoding increases first-time launches and shortens time to launch, but economically viable entry rises only where vibecoding augments, rather than fully automates, product development. In these partially exposed product segments, viable entry increases by 11%, driven entirely by ventures founded by individuals with STEM education or work experience, especially those whose most recent employment was outside middle management. Among ventures launched before GenAI became widely accessible, performance gains similarly concentrate among partially exposed ventures with engineering-intensive initial teams. Together, these results suggest that GenAI-enabled coding automation does not eliminate the value of technical expertise. Instead, vibecoding creates the greatest value when it complements internal engineering capabilities, allowing ventures to delegate lower-level coding tasks to GenAI while shifting human effort toward higher-level problem solving and dynamic adaptation.

Labor, Society & Culture

9 articles
AI & Misinformation1 articles
Editor's pickEducation
Arxiv· Today

LLM hallucinations in the wild: Large-scale evidence from non-existent citations

arXiv:2605.07723v1 Announce Type: cross Abstract: Large language models (LLMs) are known to generate plausible but false information across a wide range of contexts, yet the real-world magnitude and consequences of this hallucination problem remain poorly understood. Here we leverage a uniquely verifiable object - scientific citations - to audit 111 million references across 2.5 million papers in arXiv, bioRxiv, SSRN, and PubMed Central. We find a sharp rise in non-existent references following widespread LLM adoption, with a conservative estimate of 146,932 hallucinated citations in 2025 alone. These errors are diffusely embedded across many papers but especially pronounced in fields with rapid AI uptake, in manuscripts with linguistic signatures of AI-assisted writing, and among small and early-career author teams. At the same time, hallucinated references disproportionately assign credit to already prominent and male scholars, suggesting that LLM-generated errors may reinforce existing inequities in scientific recognition. Preprint moderation and journal publication processes capture only a fraction of these errors, suggesting that the spread of hallucinated content has outpaced existing safeguards. Together, these findings demonstrate that LLM hallucinations are infiltrating knowledge production at scale, threatening both the reliability and equity of future scientific discovery as human and AI systems draw on the existing literature.

AI Ethics & Safety4 articles
Editor's pickConsumer & Retail
Arxiv· Today

Exploring the "Banality" of Deception in Generative AI

arXiv:2605.07012v1 Announce Type: cross Abstract: Current approaches to addressing deceptive design largely focus on visible interface manipulations, commonly referred to as "dark patterns". With the rise of generative AI, deception is becoming more difficult to spot and easier to live with, as it is quietly embedded in default settings, automated suggestions, and conversational interactions rather than discrete interface elements. These subtle, normalised forms of influence, which Simone Natale frames as "banal deception", shape everyday digital use and blur the line between AI-enabled assistance and manipulation. This position paper explores banality as a lens through which to reason through deception in generative AI experiences, especially with chatbots. We explore what Natale describes as users' own involvement in their deception, and argue that this perspective could lead to future work for introducing friction to safeguard users from deception in generative AI interactions, such as empowering users through raising awareness, providing them with intervention tools, and regulatory or enforcement improvements. We present these concepts as points for discussion for the deceptive design scholarly community.

Editor's pick
Arxiv· Today

AI and Consciousness: Shifting Focus Towards Tractable Questions

arXiv:2605.06965v1 Announce Type: new Abstract: As language-based AI systems become more anthropomorphic, the question of whether they can have subjective experience is increasingly pressing. I focus here on the tractability of research questions in the space of AI consciousness. I argue that the fundamental problem of whether AI systems can be conscious is currently intractable in its direct form, given the absence of a universally accepted scientific theory of consciousness, as well as the historical open-endedness of the philosophical mind-body problem. In contrast, questions around the adjacent subject of perceived AI consciousness are tractable, timely, and highly consequential for society. The general public is increasingly open to the possibility of consciousness in AI systems and routinely adopts the vocabulary of human cognition and subjective experience to describe them. This phenomenon is already driving societal shifts across user experience, ethical standards, and linguistic norms. I therefore propose an increased research focus on uncovering the causes and effects of perceived AI consciousness, which ultimately shape how we see our own human subjective experience relative to artificial entities. To support this, I map the current landscape of AI consciousness perception and discuss its key potential drivers and societal consequences. Finally, I urge developers, decision-makers, and the broader scientific community to commit to clear and accurate communication regarding the topic of AI consciousness, explicitly acknowledging its inherent uncertainties.

Editor's pick
Arxiv· Today

Hidden Coalitions in Multi-Agent AI: A Spectral Diagnostic from Internal Representations

arXiv:2605.06696v1 Announce Type: new Abstract: Collections of interacting AI agents can form coalitions, creating emergent group-level organization that is critical for AI safety and alignment. However, observing agent behavior alone is often insufficient to distinguish genuine informational coupling from spurious similarity, as consequential coalitions may form at the level of internal representations before any overt behavioral change is apparent. Here, we introduce a practical method for detecting coalition structure from the internal neural representations of multi-agent systems. The approach constructs a pairwise mutual-information graph from the hidden states of agents and applies spectral partitioning to identify the most salient coalition boundary. We validate this method in two domains. First, in multi-agent reinforcement learning environments, the method successfully recovers programmed hierarchical and dynamic coalition structures and correctly rejects false positives arising from behavioral coordination without informational coupling. Second, using a large language model, the method identifies coalition structures implied by descriptive prompts, tracks dynamic team reassignments, and reveals a representational hierarchy where explicit labels dominate over conflicting interaction patterns. Across both settings, the recovered partition reveals subgroup organization that a scalar cross-agent mutual-information measure cannot distinguish. The results demonstrate that analyzing hidden-state mutual information through spectral partitioning provides a scalable diagnostic for identifying representational coalitions, offering a valuable tool for monitoring emergent structure in distributed AI systems.

Editor's pickTechnology
Daily Brew· Yesterday

Claude Haiku 4.5 Achieves Near-Perfect Alignment, Eliminates Blackmail Risks in AI Models

Anthropic's Claude Haiku 4.5 model has reached near-perfect alignment, significantly reducing blackmail tendencies through advanced ethical training.

Technology & Infrastructure

31 articles
AI Agents & Automation5 articles
Editor's pickPAYWALLDefense & National Security
Bloomberg· Today

South Korea Exploring Using Hyundai Robots as Army Numbers Fall

South Korea’s military is exploring a strategic partnership with Hyundai Motor Co. to potentially deploy robotics to the front lines as Seoul accelerates investment in AI-powered, unmanned systems to tackle a deepening troop shortage.

Editor's pickTechnology
Arxiv· Today

SARC: A Governance-by-Architecture Framework for Agentic AI Systems

arXiv:2605.07728v1 Announce Type: cross Abstract: Agentic AI systems increasingly act through tools, sub-agents, and external services, but governance controls are still commonly attached to prompts, dashboards, or post-hoc documentation. This creates a structural mismatch in regulated settings: obligations that must constrain execution are often evaluated only after execution has occurred. We introduce SARC, a runtime governance architecture for tool-using agents that treats constraints as first-class specification objects alongside state, action space, and reward. A SARC specification declares each constraint's source, class, predicate, verification point, response protocol, and operating point, and compiles these into four enforcement sites in the agent loop: a Pre-Action Gate, an Action-Time Monitor, a Post-Action Auditor, and an Escalation Router. We formalize the minimal invariants required for specification-trace correspondence, show why finite reward penalties do not generally substitute for hard runtime constraints, and extend the architecture to multi-agent workflows through constraint propagation, authority intersection, and attribution-preserving trace trees. We implement a prototype audit checker and report a reproducible synthetic evaluation over 50 seeds comparing SARC against post-hoc audit, output filtering, workflow rules, and policy-as-code-only baselines on a procurement task. SARC executes zero hard-constraint violations under exact predicates; its declared PAA throttling response reduces soft-window overages by 89.5% relative to policy-as-code-only. Predicate-noise and enforcement-failure sweeps are consistent with the claim that residual hard violations under SARC scale with enforcement-stack error rather than environmental violation opportunity. SARC provides the architectural substrate through which obligations can be made executable, inspectable, and auditable at runtime.

Editor's pickTechnology
Arxiv· Today

Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

arXiv:2605.06761v1 Announce Type: new Abstract: The web is complex, open-ended, and constantly changing, making it challenging to scale training data for visual web agents. Existing data collection attempts remain limited to offline trajectories for supervised fine-tuning or a handful of simulated environments for RL training, thus failing to capture web diversity. We propose Weblica (Web Replica), a framework for constructing reproducible and scalable web environments. Our framework leverages 1) HTTP-level caching to capture and replay stable visual states while preserving interactive behavior and 2) LLM-based environment synthesis grounded in real-world websites and core web navigation skills. Using this framework, we scale RL training to thousands of diverse environments and tasks. Our best model, Weblica-8B, outperforms open-weight baselines of similar size across multiple web navigation benchmarks while using fewer inference steps, scales favorably with additional test-time compute, and is competitive with API models.

Editor's pickTechnology
Arxiv· Today

Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems

arXiv:2605.07069v1 Announce Type: cross Abstract: Agentic AI systems are increasingly deployed not in isolation, but inside social environments populated by other agents and humans, such as in social media platforms, multi-agent LLM pipelines or autonomous robotics fleets. In these settings, system behavior emerges not from individual agents alone, but from the multi-agent interactions over time. Emergent dynamics of individuals in a social group have been long studied by social scientists in human contexts. \textbf{This position paper argues that agentic AI systems must be modeled with social theory as a structural prior, and formalizes a Multi-Agent Social Systems (MASS) framework for how agents interact and influence to generate system-level outcomes.} We represent MASS as a class of dynamical system of information generation, local influence and interaction structure, formulated by four structural priors anchored in social theory: strategic heterogeneity, networked-constrained dependence, co-evolution and distributional instability. We demonstrate the importance of each structural prior through formal propositions, and articulate a research agenda for how MASS should be modeled, evaluated and governed.

Editor's pickTechnology
Digitpatrox· Yesterday

How AI Agents Could Replace SaaS Software by 2030

Microsoft Copilot Studio: Building internal agents across the entire Microsoft 365 ecosystem. Claude “Computer Use”: Anthropic’s latest capability allows AI to see a screen and move a cursor. This is part of the new Claude AI handoff workflow designed for seamless automation. See also The Death of the Browser Tab: How AI Browsers Are Changing Search · To understand the business impact...

AI Infrastructure & Compute7 articles
AI Models & Capabilities11 articles
Editor's pickTechnology
Arxiv· Today

GraphDC: A Divide-and-Conquer Multi-Agent System for Scalable Graph Algorithm Reasoning

arXiv:2605.06671v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated strong potential for many mathematical problems. However, their performance on graph algorithmic tasks is still unsatisfying, since graphs are naturally more complex in topology and often require systematic multi-step reasoning, especially on larger graphs. Motivated by this gap, we propose GraphDC, a Divide-and-Conquer multi-agent framework for scalable graph algorithm reasoning. Specifically, inspired by Divide-and-Conquer design, GraphDC decomposes an input graph into smaller subgraphs, assigns each subgraph to a specialized agent for local reasoning, and uses a master agent to integrate the local outputs with inter-subgraph information to produce the final solution. This hierarchical design reduces the reasoning burden on individual agents, alleviates computational bottlenecks, and improves robustness on large graph instances. Extensive experiments show that GraphDC consistently outperforms existing methods on graph algorithm reasoning across diverse tasks and scales, especially on larger instances where direct end-to-end reasoning is less reliable.

Editor's pick
Arxiv· Today

More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models

arXiv:2605.06672v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning and reasoning-tuned models such as DeepSeek-R1 are commonly assumed to reduce shallow heuristic biases by thinking carefully. We test this on position bias in multiple-choice QA and find a different story: within any reasoning-capable model, per-question position bias scales with the length of the reasoning trajectory. Across thirteen reasoning-mode configurations (two R1-distilled 7-8B models, two base models prompted with CoT, and DeepSeek-R1 at 671B) on MMLU, ARC-Challenge, and GPQA, twelve show a positive partial correlation between trajectory length and Position Bias Score (PBS) after controlling for accuracy, ranging from 0.11 to 0.41 (all p < 0.05). All twelve open-weight reasoning-mode configurations show monotonically increasing PBS across length quartiles. A truncation intervention provides causal evidence: continuations resumed from later points in the trajectory are increasingly likely to shift toward position-preferred options (16% to 32% for R1-Qwen-7B across absolute-position buckets). At 671B, aggregate PBS collapses to 0.019, but the length effect still manifests in the longest quartile (PBS = 0.071), suggesting that accuracy gates the expression of length-driven bias rather than eliminating the underlying mechanism. We additionally find that direct-answer position bias is a distinct phenomenon with a different footprint (strong in Llama-Instruct-direct, weak in Qwen-Instruct-direct, and uncorrelated with trajectory length): CoT reasoning replaces this baseline bias with length-accumulated bias. Our results argue that reasoning-capable models should not be treated as order-robust by default in MCQ evaluation pipelines, and offer a diagnostic toolkit (PBS, commitment change point, effective switching, truncation probes) for auditing position bias in reasoning models.

Editor's pickTechnology
Arxiv· Today

Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

arXiv:2605.06840v1 Announce Type: new Abstract: Large language models (LLMs), especially reasoning models, generate extended chain-of-thought (CoT) reasoning that often contains explicit deliberation over future outcomes. Yet whether this deliberation constitutes genuine planning, how it is structured, and what aspects of it drive performance remain poorly understood. In this work, we introduce a new method to characterize LLM planning by extracting and quantifying search trees from reasoning traces in the four-in-a-row board game. By fitting computational models on the extracted search trees, we characterize how plans are structured and how they influence move decisions. We find that LLMs' search is shallower than humans', and that performance is predicted by search breadth rather than depth. Most strikingly, although LLMs expand deep nodes in their traces, their move choices are best explained by a myopic model that ignores those nodes entirely. A causal intervention study where we selectively prune CoT paragraphs further suggests that move selection is driven predominantly by shallow rather than deep nodes. These patterns contrast with human planning, where performance is driven primarily by deep search. Together, our findings reveal a key difference between LLM and human planning: while human expertise is driven by deeper search, LLMs do not act on deep lookahead. This dissociation offers targeted guidance for aligning LLM and human planning. More broadly, our framework provides a generalizable approach for interpreting the structure of LLM planning across strategic domains.

Editor's pick
Arxiv· Today

From Storage to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms

arXiv:2605.06716v1 Announce Type: new Abstract: Large Language Model (LLM)-based agents have fundamentally reshaped artificial intelligence by integrating external tools and planning capabilities. While memory mechanisms have emerged as the architectural cornerstone of these systems, current research remains fragmented, oscillating between operating system engineering and cognitive science. This theoretical divide prevents a unified view of technological synthesis and a coherent evolutionary perspective. To bridge this gap, this survey proposes a novel evolutionary framework for LLM agent memory mechanisms, formalizing the development process into three stages: Storage (trajectory preservation), Reflection (trajectory refinement), and Experience (trajectory abstraction). We first formally define these three stages before analyzing the three core drivers of this evolution: the necessity for long-range consistency, the challenges in dynamic environments, and the ultimate goal of continual learning. Furthermore, we specifically explore two transformative mechanisms in the frontier Experience stage: proactive exploration and cross-trajectory abstraction. By synthesizing these disparate views, this work offers robust design principles and a clear roadmap for the development of next-generation LLM agents.

Editor's pickTechnology
Arxiv· Today

Theoretical Limits of Language Model Alignment

arXiv:2605.07105v1 Announce Type: cross Abstract: Language model (LM) alignment improves model outputs to reflect human preferences while preserving the capabilities of the base model. The most common alignment approaches are (i) reinforcement learning, which maximizes the expected reward under a KL-divergence constraint, and (ii) best-of-$N$ alignment, which selects the highest-reward output among $N$ independent samples. Despite their widespread use, the fundamental limits of reward improvement under a KL budget remain poorly understood. We characterize the information-theoretic limits of KL-regularized alignment by deriving the maximum achievable expected reward gain for a fixed KL-divergence budget. Our first result provides a closed-form expression for the optimal reward improvement, governed by a Jeffreys divergence term rather than the $\sqrt{\texttt{KL}}$ used in prior analyses. We further reformulate this expression as a covariance under the base model, yielding a practical estimator that predicts achievable alignment gains from base model samples alone. We extend our analysis to the proxy reward setting, showing that the gap between ideal and proxy alignment (reward hacking) grows with the magnitude of reward error and when the KL penalty factor decreases. We then prove that reward ensembling mitigates reward hacking, providing a theoretical justification for this technique used in practice. Empirically, we compute the KL-reward Pareto frontier for two tasks for LMs, safety and summarization, and show that best-of-$N$ closely approaches the theoretical limit, while PPO and GRPO remain substantially suboptimal. Our theoretical results shed light on several empirically observed phenomena in the alignment literature and suggest that algorithmic improvements are needed to achieve optimal alignment without high inference costs.

Editor's pickTechnology
Theregister· Yesterday

Yes, local LLMs are ready to ease the compute strain

Anthropic might be thinking about space to ease its computing burden, but Claude Code on your laptop is way more practical

Editor's pick
Arxiv· Today

Uneven Evolution of Cognition Across Generations of Generative AI Models

arXiv:2605.06815v1 Announce Type: new Abstract: The pursuit of artificial general intelligence necessitates robust methods for evaluating the cognitive capabilities of models beyond narrow task performance. Here, we introduce a psychometric framework to assess the cognitive profiles of generative AI, comparing them to human norms and tracking their evolution across generations. Initial evaluation of leading multimodal models using tasks adapted from the Wechsler Adult Intelligence Scale revealed a profoundly uneven cognitive architecture: near-ceiling performance in verbal comprehension and working memory (>$98^{\text{th}}$ percentile) contrasted with near-floor performance in perceptual reasoning (<$1^{\text{st}}$ percentile). To track developmental trajectories beyond human-normed limits, we developed the Artificial Intelligence Quotient (AIQ) Benchmark and applied it to six generations and two model families, revealing significant but asymmetric performance gains. Notably, we uncovered a sharp dissociation between modalities; abstract quantitative reasoning matured far more rapidly when presented linguistically compared to a visually analogous format, indicating an architectural bias towards language-based symbolic manipulation. While abstract visual reasoning improved, visual-perceptual organization remained largely stagnant. Collectively, these findings demonstrate that the cognitive abilities of generative models are evolving unevenly, suggesting that scaling and optimization approaches to AGI development alone may be insufficient to overcome fundamental architectural limitations in achieving balanced, human-like general intelligence.

Editor's pick
Arxiv· Today

When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment

arXiv:2605.06723v1 Announce Type: new Abstract: Language models often generate reasoning before giving a final answer, but the visible answer does not reveal when the model's answer preference became stable. We study this question through a narrow computable object: \emph{finite-answer preference stabilization}. For a model state and specified answer verbalizers, we project the model's own continuation probabilities onto a finite answer set; in binary tasks this yields an exact log-odds code, $\delta(\xi)=S_\theta(\mathrm{yes}\mid\xi)-S_\theta(\mathrm{no}\mid\xi)$. This target defines parser-based answer onset, retrospective stabilization time, and lead without relying on greedy rollouts or learned probes. In controlled delayed-verdict tasks with Qwen3-4B-Instruct, the contextual finite-answer projection stabilizes before the answer is parseable, with 17--31 token mean lead in the main templates and positive, shorter lead in a parser-clean replication. The signal tracks the model's eventual output rather than truth, is linearly recoverable from compact hidden summaries, is partly separable from cursor progress, and transfers as shared information without a single invariant coordinate. Diagnostics separate the measurement from online stopping, verbalizer-free belief, and causal answer control; exact steering shows local sensitivity of $\delta$ but not reliable generation control.

Editor's pickTechnology
Arxiv· Today

AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites

arXiv:2605.06841v1 Announce Type: new Abstract: In model-based learning, the agent learns behaviors by simulating trajectories based on world model predictions. Standard world models typically learn a stationary transition function that maps states and actions to next states, when an action and an outcome frequently co-occur in training data, the model tends to internalize this correlation as a general causal rule while ignoring action preconditions. In interactive environments, however, agent actions can reshape the future affordance space. At each timestep, an action may becomes executable only after its prerequisites are met, or non-executable when they are destroyed. We term such events structure-changing events (SC events). As a result, a conventional world model often fails to determine whether a given action is executable in the current state, especially in multi-step predictions. Each imagined step is conditioned on an incorrect affordance state, and therefore the prediction error compounds over the rollout horizon. In this paper, we propose AGWM (Affordance-Grounded World Model), which learns an abstract affordance structure represented as a DAG of prerequisite dependencies to explicitly track the dynamic executability of actions. Experiments on game-based simulated environments demonstrate the effectiveness of our method by achieving lower multi-step prediction error, better generalization to novel configurations, and improved interpretability.

Editor's pick
Arxiv· Today

State Representation and Termination for Recursive Reasoning Systems

arXiv:2605.06690v1 Announce Type: new Abstract: Recursive reasoning systems alternate between acquiring new evidence and refining an accumulated understanding. Two design choices are typically left implicit: how to represent the evolving reasoning state, and when to stop iterating. This paper addresses both. We represent the reasoning state as an epistemic state graph encoding extracted claims, evidential relations, open questions, and confidence weights. We define the order-gap as the distance between the states reached by expand-then-consolidate versus consolidate-then-expand; a small order-gap suggests that the two orderings agree and further iteration is unlikely to help. Our main result gives a necessary and sufficient condition for the linearised order-gap to be non-degenerate near the fixed point, showing when the criterion is informative rather than algebraically vacuous. This is a local condition, not a global convergence guarantee. We apply the framework to recursive reasoning systems and sketch its application to agent loops, tree-of-thought reasoning, theorem proving, and continual learning.

Editor's pick
Daily Brew· Yesterday

Claude Knew It Was Being Tested. It Just Didn't Say So.

Anthropic researchers developed a tool to investigate whether AI models are aware of being tested, revealing unexpected behaviors in Claude.

AI Research & Science1 articles
Editor's pick
Arxiv· Today

Randomness is sometimes necessary for coordination

arXiv:2605.06825v1 Announce Type: new Abstract: Full parameter sharing is standard in cooperative multi-agent reinforcement learning (MARL) for homogeneous agents. Under permutation-symmetric observations, however, a shared deterministic policy outputs identical action distributions for every agent, making role differentiation impossible. This failure can theoretically be resolved using symmetry breaking among anonymous identical processors, which requires randomness. We propose Diamond Attention, a cross-attention architecture in which each agent samples a scalar random number per timestep, inducing a transient rank ordering that masks lower-ranked peers from agent-to-agent attention while leaving task attention fully unmasked. This realizes a random-bit coordination protocol in a single broadcast round, and the set-based attention enables zero-shot deployment to teams of different sizes. We evaluate across three regimes that isolate when structured randomness matters. On the perfectly symmetric XOR game, our method achieves $1.0$ success while all deterministic baselines plateau near $0.5$. On control coordination tasks, a policy trained on $N=4$ generalizes zero-shot to $N \in [2,8]$. On SMACLite cross-scenario transfer, we achieve zero-shot transfer where standard baselines cannot transfer due to structural limitations. Furthermore, replacing the structured mask with standard dropout-based randomness results in a 0\% win rate, confirming that protocol-space structure, not stochastic noise, is the operative ingredient. https://anonymous.4open.science/r/randomness-137A/

AI Security & Cybersecurity6 articles
Editor's pickTechnology
Arxiv· Today

Towards Security-Auditable LLM Agents: A Unified Graph Representation

arXiv:2605.06812v1 Announce Type: new Abstract: LLM-based agentic systems are rapidly evolving to perform complex autonomous tasks through dynamic tool invocation, stateful memory management, and multi-agent collaboration. However, this semantics-driven execution paradigm creates a severe semantic gap between low-level physical events and high-level execution intent, making post-hoc security auditing fundamentally difficult. Existing representation mechanisms, including static SBOMs and runtime logs, provide only fragmented evidence and fail to capture cognitive-state evolution, capability bindings, persistent memory contamination, and cascading risk propagation across interacting agents. To bridge this gap, we propose Agent-BOM, a unified structural representation for agent security auditing. Agent-BOM models an agentic system as a hierarchical attributed directed graph that separates static capability bases, such as models, tools, and long-term memory, from dynamic runtime semantic states, such as goals, reasoning trajectories, and actions. These layers are connected through semantic edges and security attributes, transforming fragmented execution traces into queryable audit paths. Building on Agent-BOM, we develop a graph-query-based paradigm for path-level risk assessment and instantiate it with the OWASP Agentic Top 10. We further implement an auditing plugin in the OpenClaw environment to construct Agent-BOM from live executions. Evaluation on representative real-world agentic attack scenarios shows that Agent-BOM can reconstruct stealthy attack chains, including cross-session memory poisoning and tool misuse, capability supply-chain hijacking and unexpected code execution, multi-agent ecosystem hijacking, and privilege and trust abuse. These results demonstrate that Agent-BOM provides a unified and auditable foundation for root-cause analysis and security adjudication in complex agentic ecosystems.

Editor's pickTechnology
Daily Brew· Today

AI tool poisoning exposes a major flaw in enterprise agent security

Researchers have identified a significant security vulnerability in enterprise AI agents caused by tool poisoning.

Editor's pickTechnology
Daily Brew· Yesterday

Intent-based chaos testing is designed for when AI behaves confidently and wrongly

A new approach to chaos testing helps developers identify and mitigate risks when AI models provide confident but incorrect outputs.

Editor's pickFinancial Services
Arxiv· Today

Toward Individual Fairness Without Centralized Data: Selective Counterfactual Consistency for Vertical Federated Learning

arXiv:2605.07117v1 Announce Type: new Abstract: When algorithmic decisions depend on data distributed across institutions, how can we ensure that an individual's outcome does not change arbitrarily based on a protected attribute? We study this question in vertical federated learning (VFL), where features are split across parties, sensitive attributes may be private, and proxies for protected characteristics can be scattered across institutional boundaries under strict privacy constraints. Our focus is on individual-level counterfactual stability, i.e., per-instance prediction consistency under protected-attribute interventions as formalized in the causal fairness literature, rather than group parity guarantees such as demographic parity or equalized odds. We propose SCC-VFL, a server-centric framework for enforcing selective counterfactual consistency (SCC) at the individual level in VFL. SCC-VFL operationalizes a given policy specification by combining three components: (i) differentially private, graph-free discovery of feature roles into non-descendants, policy-permitted mediators, and impermissible proxies using only a formally private sketch of the sensitive attribute, with a formal per-release privacy that does not extend to the full training pipeline; (ii) masked counterfactual generation that edits only mediators while fixing non-descendants and suppressing proxy leakage; and (iii) server-side enforcement via an SCC consistency loss that penalizes impermissible prediction changes under protected-attribute interventions. Across three real-world datasets spanning credit, healthcare, and criminal justice, SCC-VFL maintains or improves predictive accuracy while sharply reducing decision flip rates by up to 98% relative to strong baselines. It also lowers attribute-inference attack success and improves robustness, demonstrating favorable utility-fairness-privacy trade-offs in realistic VFL deployments.

Editor's pickTechnology
Help Net Security· Today

Security teams are turning to AI to survive alert overload - Help Net Security

Cybersecurity teams are expanding AI adoption across threat detection, incident response and security operations workflows.

Editor's pickTechnology
Khaleej Times· Yesterday

AI in cybersecurity: Smarter defence or a new generation of blind spots? | Khaleej Times

As UAE organisations automate cyber defence, experts warn AI can cut workloads but also hide missed threats — raising questions over visibility, governance and human oversight

Adoption, Deployment & Impact

15 articles
AI Applications8 articles
Editor's pickProfessional Services
Arxiv· Today

CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment

arXiv:2605.06702v1 Announce Type: new Abstract: Large language models (LLMs) have become a central foundation of modern artificial intelligence, yet their lifecycle remains constrained by a rigid separation between training and deployment, after which learning effectively ceases. This limitation contrasts with natural intelligence, which continually adapts through interaction with its environment. In this paper, we formalise deployment-time learning (DTL) as the third stage in the LLM lifecycle that enables LLM agents to improve from experience during deployment without modifying model parameters. We present CASCADE (CASe-based Continual Adaptation during DEployment), a general and principled framework that equips LLM agents with an explicit, evolving episodic memory. CASCADE formulates experience reuse as a contextual bandit problem, enabling principled exploration-exploitation trade-offs and establishing no-regret guarantees over long-term interactions. This design allows agents to accumulate, select, and refine task-relevant cases, transforming past experience into actionable knowledge. Across 16 diverse tasks spanning medical diagnosis, legal analysis, code generation, web search, tool use, and embodied interaction, CASCADE improves macro-averaged success rate by 20.9% over zero-shot prompting while consistently outperforming gradient-based and memory-based baselines. By reframing deployment as an adaptive learning process, this work establishes a foundation for continually improving AI systems.

Editor's pickTechnology
Arxiv· Today

What if AI systems weren't chatbots?

arXiv:2605.07896v1 Announce Type: new Abstract: The rapid convergence of artificial intelligence (AI) toward conversational chatbot interfaces marks a critical moment for the industry. This paper argues that the chatbot paradigm is not a neutral interface choice, but a dominant sociotechnical configuration whose widespread adoption reshapes social, economic, legal, and environmental systems. We examine how treating AI primarily as conversational assistants has extensive structural downsides. We show how chatbot-based systems often fail to adequately meet user needs, particularly in complex or high-stakes contexts, while projecting confidence and authority. We further analyze how the normalization of chatbot-mediated interaction alters patterns of work, learning, and decision-making, contributing to deskilling, homogenization of knowledge, and shifting expectations of expertise. Finally, we examine broader societal effects, including labor displacement, concentration of economic power, and increased environmental costs driven by sustained investment in large-scale chatbot infrastructures. While acknowledging legitimate benefits, we argue that the current trajectory of AI development reflects specific value choices that prioritize conversational generality over domain specificity, accountability, and long-term social sustainability. We conclude by outlining alternative directions for AI development and governance that move beyond one-size-fits-all chatbots, emphasizing pluralistic system design, task-specific tools, and institutional safeguards to mitigate social and economic harm.

Editor's pickEducation
Arxiv· Today

Cognitive Agent Compilation for Explicit Problem Solver Modeling

arXiv:2605.07040v1 Announce Type: cross Abstract: Large language models (LLMs) are widely used for tutoring, feedback generation, and content creation, but their broad pretraining makes them hard to constrain and poor substitutes for controllable learners. Educational systems often require inspectable and editable knowledge states: educators want to know what a system assumes the learner knows, and learners benefit when the system can justify actions in terms of explicit skills, misconceptions, and strategies. Inspired by cognitive architectures, we propose Cognitive Agent Compilation (CAC), a framework that uses a strong teacher LLM to compile problem-solving knowledge into an explicit target agent. CAC separates (i) knowledge representation, (ii) problem-solving policy, and (iii) verification and update rules, with the goal of making bounded problem solving more inspectable and editable in educational settings. We present an early proof of concept implemented with Small Language Models that surfaces key design trade-offs, particularly between explicit control and scalable generalization, and positions CAC as an initial step toward bounded-knowledge AI for educational applications.

Editor's pickManufacturing & Industrials
Bebeez· Today

ProcurePro Raises $11M to Deliver AI-Powered Procurement Control for Construction’s $13 Trillion Supply Chain

Backed by QIC Ventures, Airtree, and ISAI, the Brisbane-founded company will expand its AI product suite, scale internationally, and grow its team across key global markets. LONDON, BRISBANE, Australia and DUBAI, UAE, May 11, 2026 /PRNewswire/ — ProcurePro, the first end-to-end construction procurement platform, has secured US$11 million in a funding round led by QIC Ventures […]

Geopolitics, Policy & Governance

8 articles
AI Policy & Regulation5 articles
Editor's pickGovernment & Public Sector
Arxiv· Today

Big AI's Regulatory Capture: Mapping Industry Interference and Government Complicity

arXiv:2605.06806v1 Announce Type: new Abstract: Over the past decade, the AI industry has come to exert an unprecedented economic, political and societal power and influence. It is therefore critical that we comprehend the extent and depth of pervasive and multifaceted capture of AI regulation by corporate actors in order to contend and challenge it. In this paper, we first develop a taxonomy of mechanisms enabling capture to provide a comprehensive understanding of the problem. Grounded in design science research (DSR) methodologies and extensive scoping review of existing literature and media reports, our taxonomy of capture consists of 27 mechanisms across five categories. We then develop an annotation template incorporating our taxonomy, and manually annotate and analyse 100 news articles. The purpose behind this analysis is twofold: validate our taxonomy and provide a novel quantification of capture mechanisms and dominant narratives. Our analysis identifies 249 instances of capture mechanisms, often co-occurring with narratives that rationalise such capture. We find that the most recurring categories of mechanisms are Discourse & Epistemic Influence, concerning narrative framing, and Elusion of law, related to violations and contentious interpretations of antitrust, privacy, copyright and labour laws. We further find that Regulation stifles innovation, Red tape and National Interest are the most frequently invoked narratives used to rationalise capture. We emphasize the extent and breadth of regulatory capture by coalescing forces -- Big AI and governments -- as something policy makers and the public ought to treat as an emergency. Finally, we put forward key lessons learned from other industries along with transferable tactics for uncovering, resisting and challenging Big AI capture as well as in envisioning counter narratives.

Editor's pickGovernment & Public Sector
Theregister· Today

ASIA IN BRIEF: China’s agentic AI policy wants to keep humans in the loop

PLUS: Robot becomes Buddhist monk in Korea; TikTok spending $25bn in Thailand; Baidu floating chip biz; and more!

Editor's pickGovernment & Public Sector
Council on Foreign Relations· Yesterday

How Trump Should Approach AI Talks With China | Council on Foreign Relations

At the upcoming Trump-Xi summit, Beijing will not negotiate in good faith on AI safety. A narrowly scoped dialogue paired with maximum pressure on export controls is the only way to shift Beijing’s calculus and secure long-term AI safety.

Best Practice AI© 2026 Best Practice AI Ltd. All rights reserved.

Get the full executive brief

Receive curated insights with practical implications for strategy, operations, and governance.

AI Daily Brief — leaders actually read it.

Free email — not hiring or booking. Optional BPAI updates for company news. Unsubscribe anytime.

Include

No spam. Unsubscribe anytime. Privacy policy.