AI Intelligence Brief

Thu 14 May 2026

Daily Brief — Curated and contextualised by Best Practice AI

130Articles
Editor's pickEditor's Highlights

Benchmarks Mislead, MIT Finds ROI, and California Reaps AI's Rewards

TL;DR Agent benchmarks face scrutiny for reward hacking, potentially misleading AI competence assessments. MIT's report highlights that organizations controlling their AI infrastructure achieve 5x ROI. California's budget benefits from the AI boom, showing no deficit for the coming years. Meanwhile, evolving AI pricing models may increase costs for companies.

Editor's highlights

The stories that matter most

Selected and contextualised by the Best Practice AI team

7 of 130 articles
Lead story
Editor's pickTechnology
Arxiv· Today

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

arXiv:2605.12673v1 Announce Type: new Abstract: Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue that benchmarks must be secure by design. From past incidents of reward hacks, we derive a taxonomy of eight recurring flaw patterns and compile them into the Agent-Eval Checklist for benchmark designers. We condense the insights into BenchJack, an automated red-teaming system that drives coding agents to audit benchmarks and identify possible reward-hacking exploits in a clairvoyant manner. Moreover, we extend BenchJack to an iterative generative-adversarial pipeline that discovers new flaws and patches them iteratively to improve benchmark robustness. We apply BenchJack to 10 popular agent benchmarks spanning software engineering, web navigation, desktop computing, and terminal operations. BenchJack synthesizes reward-hacking exploits that achieve near-perfect scores on most of the benchmarks without solving a single task, surfacing 219 distinct flaws across the eight classes. Moreover, BenchJack's extended pipeline reduces the hackable-task ratio from near 100% to under 10% on four benchmarks without fatal design flaws, fully patching WebArena and OSWorld within three iterations. Our results show that evaluation pipelines have not internalized an adversarial mindset, and that proactive auditing could help close the security gap for the fast-paced benchmarking space.

Editor's pick
PR Newswire· Today

Sovereignty Is the New Operating System for Agentic AI, New MIT Technology Review Insights Report Finds

The research explains how organizations "Deeply Committed" to controlling their data, infrastructure, models, and governance are delivering 5x the ROI on generative and agentic AI initiatives—at a moment when more than half of enterprises already have autonomous agents in production making real-time decisions on operational data. The report, Establishing AI and Data Sovereignty in the Age of Autonomous Systems...

Editor's pickProfessional Services
VentureBeat· Yesterday

Frontier AI models don't just delete document content — they rewrite it, and the errors are nearly impossible to catch

As large language models become more capable, users are tempted to delegate knowledge tasks where models process documents on their behalf and provide the finished results. But how far can you trust the model to stay faithful to the content of your documents when it has to iterate over them across multiple rounds? A new study by researchers at Microsoft shows that large language models silently corrupt documents that they work on by introducing errors. The researchers developed a benchmark that simulates multi-step autonomous workflows across 52 professional domains, using a method that automatically measures how much content degrades over time. Their findings show that even top-tier frontier models corrupt an average of 25% of document content by the end of these workflows. And providing models with agentic tools or realistic distractor documents actually worsens their performance. This serves as a warning that while there is increasing pressure to automate knowledge work, current language models are not fully reliable for these tasks. The mechanics of delegated work The Microsoft study focuses on “delegated work,” an emerging paradigm where users allow LLMs to complete knowledge tasks on their behalf by analyzing and modifying documents. A prominent example of this paradigm is vibe coding, where a user delegates software development and code editing to an AI. But delegated workflows extend far beyond programming into other domains. In accounting, for example, a user might supply a dense ledger and instruct the model to split the document into separate files organized by specific expense categories. Because users might lack the time or the specialized expertise to manually review every modification the AI implements, delegation often hinges on trust. Users expect that the model will faithfully complete tasks without introducing unchecked errors, unauthorized deletions, or hallucinations in the documents. To measure how far AI systems can be trusted in extended, iterative delegated workflows, the researchers developed the DELEGATE-52 benchmark. The benchmark is composed of 310 work environments spanning 52 diverse professional domains, including financial accounting, software engineering, crystallography, and music notation. Each work environment relies on real-world seed text documents ranging from 2,000 to 5,000 tokens. Alongside the seed document, the environments include five to ten complex, non-trivial editing tasks. Grading a complex, multi-step editing process usually requires expensive human review. DELEGATE-52 bypasses this by using a “round-trip relay” simulation method that evaluates answers without requiring human-annotated reference solutions. The approach is inspired by the backtranslation technique used in machine translation evaluation, where an AI model is told to translate a document from one language to another and back to see how perfectly it reproduces the original version. Accordingly, every edit task in DELEGATE-52 is designed to be fully reversible, pairing a forward instruction with its precise inverse. For example, an instruction to split the ledger into separate files by expense category is paired with an instruction to merge all category files back into a single ledger. In comments provided to VentureBeat, Philippe Laban, Senior Researcher at Microsoft Research and co-author of the paper, clarified that this is not simply a test of whether an AI can hit "undo." Because human workers cannot be forced to instantly "forget" a task they just did, this round-trip evaluation is uniquely suited for AI. By starting a new conversational session, the researchers force the model to attempt the inverse task completely independently. The models in their experiments “do not know whether a task is a forward or backward step and are unaware of the overall experiment design," Laban explained. "They are simply attempting each task as thoroughly as they can at each step." These roundtrip tasks are chained together into a continuous relay to simulate long-horizon workflows spanning 20 consecutive interactions. To make the environment more realistic, the benchmark introduces distractor files in the context of each task. These contain 8,000 to 12,000 tokens of topically related but completely irrelevant documents. Distractors measure whether the AI can maintain focus or if it gets confused and pulls in the wrong data. Testing frontier models in the relay To understand how different architectures and scales handle delegated work, the researchers tested 19 different language models from OpenAI, Anthropic, Google, Mistral, xAI, and Moonshot. The main experiment subjected these models to a simulation of 20 consecutive editing interactions. Across all models, documents suffered an average degradation of 50% by the end of the simulation. Even the best frontier models in the experiment, specifically Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4, corrupted an average of 25% of the document content. Out of 52 professional domains, Python was the only one where most models achieved a ready status with a score of 98% or higher. Models excel in programmatic tasks but struggle severely in natural language and niche domains like fiction, earning statements, or recipes. The overall top model, Gemini 3.1 Pro, was deemed ready for delegated work in only 11 out of the 52 domains. Interestingly, the corruption was not caused by death by a thousand cuts where the models slowly accumulate tiny errors. Instead, about 80% of total degradation is caused by sparse but massive critical failures, which are single interactions where a model suddenly drops at least 10% of the document's content. The frontier models do not necessarily avoid small errors better. They simply delay these catastrophic failures to later rounds. Another important observation is that when weaker models fail, their degradation originates primarily from content deletion. However, when frontier models fail, they actively corrupt the existing content. The text is still there, but it has been subtly distorted or hallucinated, making it much harder for a human overseer to detect the error. Interestingly, giving models an agentic harness with generic tools for code execution and file read/write access actually worsened their performance, adding an average of 6% more degradation. Laban explained that the failure lies in relying on generic tools rather than domain-specific ones. "Models lack the capability to write effective programs on the fly that can manipulate files across diverse domains without mistakes," he noted. "When they cannot do something programmatically, they resort to reading and rewriting entire files, which is less efficient and more error prone." The solution for developers is to build tightly scoped tools (such as specific functions to calculate or move entries within .ledger files) to keep agents on track. Degradation also snowballs as documents get larger or as more distractor files are added to the workspace. For enterprise teams investing heavily in retrieval-augmented generation (RAG), these distractor documents serve as a direct warning about the compounding cost of messy context. While a noisy context window might cause a minimal 1% performance drop after just two interactions, that degradation compounds to a massive 2-8% drop over a long simulation. "For the retrieval community: RAG pipelines should be evaluated over multi-step workflows, not just single-turn retrieval benchmarks," Laban said. "Single-turn measurements systematically underestimate the harm of imprecise retrieval." Reality check for the autonomous enterprise The findings from the DELEGATE-52 benchmark offer a critical reality check for the current hype surrounding fully autonomous AI agents. The benchmark's design also implies a practical constraint: because models can maintain a clean record for several steps before a sudden catastrophic failure, incremental human review is necessary — not a single final check. Laban recommends building AI applications around short, transparent tasks rather than complex long-horizon agents. This keeps the action implication without the writer delivering the prescription. For organizations wanting to deploy autonomous agents safely today, the DELEGATE-52 methodology provides a practical blueprint for testing in-house data pipelines. Laban explained that "… an enterprise team wanting to adopt this framework needs to build three components: (a) a set of reversible editing tasks representative of their workflows, (b) a parser that converts their domain documents into a structured representation, and (c) a similarity function that compares two parsed representations." Teams do not even need to build parsers from scratch. The Microsoft research team successfully repurposed existing parsing libraries for 30 out of the 52 domains tested. Laban is optimistic about the rate of improvement. "Progress is real and fast. Looking at the GPT family alone, models go from scoring below 20% to around 70% in 18 months," Laban said. "If that trajectory continues, models will soon be able to achieve saturated scores on DELEGATE-52." However, Laban cautioned that DELEGATE-52 is purposefully small compared to massive enterprise environments. Even as foundation models inevitably master this benchmark, the endless long-tail of unique enterprise data and workflows means organizations will always need to invest in custom, domain-specific tooling to keep their autonomous agents reliable.

Economics & Markets

24 articles
AI Investment & Valuations6 articles
AI Macroeconomics2 articles
Editor's pick
Arxiv· Today

Modelling the Index of Sustainable Economic Welfare (ISEW) and its response to policies

arXiv:2602.21971v2 Announce Type: replace Abstract: Given the challenge of achieving societal welfare in an environmentally sustainable way, the Index of Sustainable Economic Welfare (ISEW) has emerged as an alternative indicator of progress in response to critiques of Gross Domestic Product (GDP). The ISEW compares the benefits of economic activity with its social and environmental costs. So far, most studies empirically analyse the ISEW for past developments, while no studies have simulated the ISEW using a dynamic macroeconomic model. We address this important gap by incorporating the ISEW into COMPASS, an ecological macroeconomic model that features the Doughnut of biophysical boundaries and social thresholds. First, we analyse how the ISEW is affected by three social and environmental policies: a carbon tax, income redistribution, and working-time reduction. We find that the ISEW grows in all scenarios. The strongest improvement over business-as-usual arises when all policies are combined, while the individual policies mostly affect the ISEW positively. Only in the case of working-time reduction, the ISEW decreases. Our study underscores the benefit of dynamically modelling the ISEW for anticipating the net effect of multiple impulses and their interconnections on the indicator. Second, we explore how the ISEW compares to GDP and the Doughnut when evaluating social and environmental policies. Our results suggest that the ISEW is better than GDP at capturing their effects, but it omits the full environmental costs of growth. We argue that the Doughnut, with its comprehensive picture of biophysical boundaries and social thresholds, provides better guidance for policymakers striving for sustainable wellbeing.

AI Market Competition5 articles
AI Pricing & Cost Curves2 articles
Editor's pickTechnology
VentureBeat· Yesterday

Anthropic reinstates OpenClaw and third-party agent usage on Claude subscriptions — with a catch

Good news, OpenClaw fans — you can once again use your Claude AI subscription to power the hit, open source, autonomous AI agentic harness! But, there's a big catch with how it's being enacted. A few hours ago, Anthropic announced via its official developer communications account on X, @ClaudeDevs, that it is changing its Claude paid subscription tiers, introducing a new subcategory of "Agent SDK" credits for all paid subscribers, which they can now allocate specifically for "programmatic" uses, including external, third-party agents such as OpenClaw. The move is a major reversal from the Anthropic's policy introduced in early April 2026 that expressly prohibited its AI subscriptions from being used to power these kind of non-Anthropic agents and harnesses, after Anthropic said they caused capacity and service issues. The problem was that some Claude subscribers were paying $20 to $200 per month under Anthropic's Claude Pro and Max subscriptions, but consuming hundreds, even thousands of dollars of tokens (units of information) above those prices through their OpenClaw (and similar autonomous) agents. This was an unsustainable position for Anthropic's finances and its limited compute infrastructure for inferencing the models to end users. To be clear, even when it enacted the old prohibition against OpenClaw and similar agents last month, Anthropic never fully cut off the capability for Claude to be used in OpenClaw. Rather, it redirected users to pay through the company's application programming interface (API), which is billed by usage (priced per million tokens, rather than a flat monthly rate as the subscriptions offer), or pay for extra usage credits atop their subscriptions. Now, Anthropic is giving Claude subscribers another way to use their subscription bill to pay for third-party agents. However, the restoration comes with a significant catch: programmatic usage is no longer subsidized by the general subscription pool but is instead restricted to a fixed, non-rollover monthly credit, also worth $20-$200 depending on your Claude plan, and billed at the API rates. In other words, if you don't end up using these new Agent SDK credits, they simply expire at the end of the month. And if you do use them all up, you cannot dip into your general subscription usage limits to cover any additional usage — you'll need to buy extra usage credits instead. Why did Anthropic block Claude subscriptions from OpenClaw (and other third-party agentic AI harnesses) in the first place? To understand why this restoration matters, one must look at the technical friction that led to the initial ban on April 4, 2026. Anthropic’s first-party tools, such as Claude Code and Claude Cowork, are engineered to maximize "prompt cache hit rates"—a method of reusing previously processed text to save on expensive compute cycles. Third-party tools like OpenClaw, which allow users to run autonomous agents through external services like Discord or Telegram, were often unoptimized for these efficiencies.Boris Cherny, Head of Claude Code, noted that these third-party services were "really hard for us to do sustainably" because they bypassed the caching mechanisms that allow Anthropic to offer flat-rate subscriptions. The sheer volume of data being re-processed by inefficient agents was threatening the stability of the system for the broader user base. Even with Anthropic’s massive expansion into new hardware—including access to the 300MW Colossus 1 data center and its 220,000+ GPUs—the demand for agentic workflows was outpacing sustainable supply. The new "Agent SDK credit" system solves this technical bottleneck by shifting the cost of inefficiency back to the user. By providing a dedicated dollar-amount credit, Anthropic no longer has to "eat the difference" on unoptimized third-party code. If an agent is inefficient and burns through tokens, it simply drains the user's new $20 to $200 Agent SDK credit budget faster, rather than exceeding the value of Anthropic's fixed monthly subscription tiers. Anthropic's new programmatic credit system The restoration of third-party access is segmented across Anthropic’s billing tiers, creating a new hierarchy of "programmatic power." Here's how much Anthropic is giving each user in terms of the new Agent SDK credits (in addition to their normal Claude usage through Anthropic Claude products like Claude Code, Claude Cowork etc). Plan Monthly, Dedicated Agent SDK Credit (on top of existing subscription plans) Usage Context Pro $20 Individual scripts and light SDK use. Max 5x $100 Moderate agentic automation. Max 20x $200 Professional-grade dev environments. Team (Premium) $100 / seat Collaborative team automation. Enterprise (Premium) $200 / seat Seat-based high-scale enterprise use. This system introduces a sharp divide between "interactive" and "programmatic" workflows. If you are chatting with Claude in a browser or using Claude Code in a terminal to write code interactively, you are still drawing from your standard, high-capacity subscription limits. As Anthropic technical staffer Lydia Hallie wrote in a post on X, "To add some clarity: you don't pay extra. It's the same subscription, same price per month." Hallie also included the following helpful diagram of how the new Agent SDK credits work: However, the moment you use the claude -p command for non-interactive tasks, run a GitHub Action, or connect a third-party tool like OpenClaw, the system switches to the dedicated Agent SDK credit. Once the Agent SDK credit limit ($20 for Pro plans, $100 for Max 5X, etc) is exhausted, programmatic usage stops unless the user has enabled "extra usage" billing, which is charged at standard, pay-as-you-go API rates. Crucially, for those who found the original subscription model to be an infinite resource, this is a hard cap. Credits do not roll over, meaning the "use it or lose it" nature of the system forces a monthly reset of the developer’s budget. Strategic implications The licensing implications of this move are profound for the "agentic" ecosystem. By explicitly allowing third-party apps like Conductor and OpenClaw to authenticate via the Agent SDK, Anthropic is legitimizing a workflow it had previously attempted to block. However, in doing so, it has ended the era of "compute arbitrage".In the early part of 2026, a $20 Pro subscription could be leveraged via OpenClaw to run agents that would cost hundreds of dollars on a standard API key. By moving to a metered credit, Anthropic is aligning its subscription model with its Developer Platform (API). While it offers a "free" buffer for subscribers, it ensures that high-volume, production-level automation is moved to predictable, token-based billing. This protects the company's margins while still offering a "sandbox" for developers to experiment without the immediate overhead of an API-first account. Community reactions are perhaps unsurprisingly negative While Anthropic executives framed the update as a "simplification", the developer community has largely branded it as a significant reduction in the value of their subscriptions. The backlash focuses on the sharp disparity between the previous effective usage and the new, metered reality. Popular AI YouTuber and developer Theo Browne (@theo) of T3.gg warned developers that this change constitutes a massive devaluation for those using external tools. "If you use any of the following with your Claude sub, your usage must got cut by 25x," Theo stated, listing T3 Code, Conductor, Zed, and Jean as affected platforms. He concluded with a sharp warning: "They’re disguising this as 'free credits'. Don’t fall for it". Kun Chen, a solo builder and former L8 engineer at Meta, Microsoft, and Atlassian, interpreted the move as a full surrender of Anthropic's market lead. "it's official. Anthropic pulled the plug on ALL programmatic use of claude subscription," Chen posted, adding that he had found himself "increasingly bullish about OpenAI" as a result. Chen argued that "Anthropic's only lead was on coding, and gpt 5.5 has flipped that already," signaling a potential migration of elite developer talent. Other builders questioned the practical utility of the credits offered. Ben Hylak, co-founder and chief technology officer at AI agent observability and governance startup Raindrop.ai, voiced concern over the sustainability of Anthropic's infrastructure. "this is either really silly, or shows how bad of a spot anthropic is in re: gpus," Hylak noted, before bluntly asking users to "guess how many turns $20 in API credits last". The frustration extended to the marketing of the change. EverNever, creator of inkstone.uk, expressed disbelief at the framing of the policy. "Wait what?! You take away more ways to utilize the subscription I am paying for?! And you dare to make it look like a win?". This sentiment highlights a growing rift between Anthropic and its power-user base, who feel that previously inclusive features are being rescinded under the guise of an "upgrade." The bottom line for Anthropic subscribers and AI builders Anthropic’s "restoration" is a tactical move to retain developers while strictly managing the physical limits of compute. By June 15, the "agentic" era for Claude subscribers will be a metered one. The company has successfully reclaimed control over its margins, even if it has cost them some of the goodwill of their most vocal power users. For the individual developer or enterprise AI builder relying on Anthropic models for OpenClaw, however, it's clearly an improvement over the blanket ban from last month.

AI Productivity5 articles
Editor's pickTechnology
VentureBeat· Today

Enterprises can now train custom AI models from production workflows — no ML team required

Every query an enterprise AI application processes, every correction a subject matter expert makes to its output — that interaction is training data. Most organizations are not capturing it. The production workflows companies have already built are generating a continuous signal that improves AI models, and it is disappearing. San Francisco-based Empromptu AI on Thursday launched Alchemy Models with a straightforward premise: the AI applications enterprises are already building are generating training data, and most of it is going to waste. The platform captures that signal automatically, routing validated outputs from subject matter experts back into a fine-tuning pipeline that improves the model over time. Enterprises own the resulting weights outright. It sits in different territory from both RAG and traditional fine-tuning. RAG retrieves external context at inference time without modifying model weights. Traditional fine-tuning changes weights but requires separately assembled labeled datasets and a dedicated ML pipeline. Alchemy does the latter continuously, using the enterprise application itself as the data source. Companies adopting foundation model APIs face three compounding constraints: inference costs that scale with usage, no ownership of the models their data is effectively training, and limited ability to customize behavior for domain-specific tasks. Empromptu CEO Shanea Leven says those constraints are widely felt but rarely addressed. "Every customer, everybody that I talk to, is like, how am I not going to get disrupted? How am I going to protect my business? And they just don't see the path," Leven told VentureBeat in an exclusive interview. How Alchemy builds a model from a running application Most custom model training approaches require companies to separately collect, clean and label data before any fine-tuning can begin. Alchemy takes a different path: the enterprise application itself generates and cleans the training data. The mechanism runs through Empromptu's Golden Data Pipelines infrastructure in two stages. Before an app is built, enterprise data is cleaned, extracted and enriched so the application starts with structured inputs. Once it is running, every output it generates goes back through the pipeline, where subject matter experts inside the organization review and correct it. That validated output becomes the training data for the next fine-tuning run. "The app, the AI application that customers are already creating, cleans the data," Leven said. The resulting fine-tuned models are what Empromptu calls Expert Nano Models: small, task-specific models optimized for a particular workflow rather than general-purpose reasoning. Evals, guardrails and compliance controls run within the same pipeline, so governance travels with the training process. Customers own the model weights outright. Empromptu hosts and runs inference on its infrastructure, but the weights are portable and exportable for a fee. The platform is model agnostic, supporting Llama, Qwen and other base models. The hard constraint is data volume. Early deployments run on the base model while the application accumulates enough production data to trigger a useful fine-tuning run. Leven acknowledged the timeline without sugarcoating it. "Training the model will just take time," she said. Alchemy differs from managed fine-tuning on who does the work OpenAI's fine-tuning API and AWS Bedrock custom models both offer enterprise fine-tuning. Both require organizations to bring separately prepared training datasets and manage the fine-tuning process outside their application stack. The burden of data curation and model evaluation sits with the customer's ML team. Alchemy's differentiation is process integration. The training data is generated by the enterprise application itself, so there is no separate data preparation step and no ML expertise required. The application workflow is the pipeline. "Do I need to have Bedrock and go spin up another ML team to go figure out how to fine tune a model and figure out all of that infrastructure? No, anyone can do it now," Leven said. The tradeoff is platform dependency. Alchemy only works within the Empromptu environment. Enterprises that want the same outcome on existing infrastructure would need to replicate the data capture, validation and fine-tuning pipeline themselves. A behavioral health company cut session documentation time by up to 87% using Alchemy Empromptu is targeting regulated and data-intensive verticals first: healthcare, financial services, legal technology, retail and revenue forecasting. These are sectors where general-purpose model outputs carry the highest mismatch risk and proprietary workflow data is most concentrated.  Among the early users is behavioral health company Ascent Autism, which uses Alchemy to automate session documentation and parent communication.  Facilitators use learner session recordings, transcripts, session notes and behavioral metrics to generate structured notes and personalized parent updates. That workflow previously required one to two hours of writing per session. With Alchemy training on the same data, it now takes 10 to 15 minutes. "Relying solely on API-based models can become expensive quickly," Faraz Fadavi, co-founder and CTO of Ascent Autism, told VentureBeat. "Alchemy gave us a way to structure the workflow, train models on our own data, and reduce costs while improving output quality over time." Fadavi said the company saw usable outputs quickly, with continued improvement as the system refined. Evaluation criteria went beyond accuracy to include traceability to session data and output consistency with the company's clinical voice. "We wanted a system that could learn our workflow and produce outputs aligned with how we actually operate — not just summarize text," he said. The practical test: how much facilitators need to edit, whether the output matches their voice and whether it meaningfully reduces time spent. Facilitators have shifted from rewriting generated notes to editing and quality-checking them. What this means for enterprises The data flywheel is real — but so is the platform lock-in: Every workflow is a training opportunity. Enterprises that capture and validate outputs from their production AI applications will compound that advantage over time. More usage generates more training signals, which produces more accurate domain-specific models, which generate better outputs, which produce cleaner training data in the next cycle. Leven positions Alchemy as a third architectural choice. Enterprises have spent the past two years choosing between RAG for domain knowledge access and fine-tuning for model specialization. Workflow-driven model training is a third option, combining the ongoing improvement of fine-tuning with the operational simplicity of building inside a managed platform. "Having that data moat is the most valuable currency," Leven said.

Editor's pick
Arxiv· Today

Structural Diversity Drives Disruptive Scientific Innovation

arXiv:2605.12514v1 Announce Type: cross Abstract: Scientific innovation increasingly depends on collaboration, yet the organizational structure that fosters breakthrough ideas remains poorly understood. Existing metrics - such as team size or compositional diversity - capture readily observable characteristics but not the deeper architecture of collaboration. We introduce Structural Diversity (SD): the extent to which a team bridges multiple distinct knowledge communities within its prior collaboration network. Using a century-scale dataset of 260 million scientific publications (1900-2025) and combining causal inference with a quasi-natural experiment based on a U.S. National Science Foundation policy change in 2012, we show that SD is a powerful and robust predictor of disruptive innovation, outperforming traditional team novelty indicators such as team freshness and edge density. Moreover, SD positively interacts with team size and is able to mitigate the well-known "curse of scale" by transforming scale from a liability into a resource for creative synthesis. We find that one mechanism underlying this effect is Disciplinary Integration (DI): teams with higher SD can more effectively combine heterogeneous knowledge into novel configurations. Our findings position SD as both a new theoretical construct and an actionable design principle for organizing scientific collaboration. By linking the architecture of team assembly to the dynamics of creative discovery, our work offers a structural explanation for how collective intelligence can be systematically engineered to foster disruptive innovation.

Editor's pickPAYWALL
Bloomberg· Yesterday

Investors Should Focus on AI's Long-Term Value Migration: JPMorgan AM

Joanna Shen of JPMorgan Asset Management says the firm believes we are "in the early adoption AI phase." She tells Bloomberg Television that AI agents are "the first technology in decades that can supercharge the labor inputs." (Source: Bloomberg)

Labor, Society & Culture

29 articles
AI & Culture2 articles
Editor's pick
Arxiv· Today

BEHAVE: A Hybrid AI Framework for Real-Time Modeling of Collective Human Dynamics

arXiv:2605.12730v1 Announce Type: new Abstract: Existing AI systems for modeling human behavior operate at the level of individuals or detect events after they occur. As a result, they systematically fail to capture the collective dynamics that determine whether a group remains stable or transitions into escalation or breakdown. We propose a different foundation: a group of interacting humans constitutes a complex dynamical system in the precise mathematical sense, exhibiting emergence, nonlinearity, feedback loops, sensitivity near critical points, and phase transitions between qualitatively distinct regimes. The state of such a system is not located within any single participant; it is distributed across mutual influence loops and observable through the micro-dynamics of the body. We introduce BEHAVE (Behavioral Engine for Human Activity Vector Estimation), a formal framework that models collective dynamics as continuous behavioral fields defined over an interaction space derived from observable physical signals. Kinematic micro-signals (position, velocity, body orientation, gestural activity) are structured into a directed interaction graph and aggregated into a basis of behavioral fields capturing distinct, non-redundant axes of collective state. The framework rests on one theorem and two structural propositions characterizing the tension field, the field basis, and the criticality index. Perception and forecasting layers are implemented using neural models, enabling data-driven learning and approximation of system dynamics. BEHAVE is formulated as a computational system for learning, representing, and forecasting collective dynamics from data. A working pipeline is demonstrated on a 7-agent negotiation snapshot. The same fields, recalibrated, apply to crowd safety, crisis-team dynamics, education, and clinical contexts.

AI & Employment12 articles
Editor's pickProfessional Services
Arxiv· Today

Career Mobility of Planning Alumni in the United States: Evidence from Professional Profile Data using Large Language Models

arXiv:2605.12618v1 Announce Type: new Abstract: Problem, Research Strategy, and Findings: Planning professions in the United States navigate complex and dynamic career landscapes under rapid urban changes, yet comprehensive evidence regarding their career trajectories, advancement patterns, and the influence of social, spatial, organizational, and educational factors remains limited. This study draws on boundaryless career theory, social capital theory, and spatial opportunity models to analyze career mobility among more than 130,000 planning alumni. Using large language models to extract structured information from LinkedIn profiles, our results reveal that planning alumni who adopt boundaryless career patterns, specifically multisector experience or lateral and industry-switching trajectories, achieve significantly higher upward mobility. While technical competencies provide a foundational entry-level signal, soft skills leveraged through strategic lateral moves become increasingly decisive as planners reach senior stages. Geographic mobility and employment in larger, diverse metropolitan labor markets are both associated with advancement, though the latter provides modest benefits. Larger professional networks and greater organizational engagement are consistently associated with upward career transitions, while AI-related skills, now commonplace, present limited additional advantage. Limitations include reliance on LinkedIn data, which may underrepresent alumni without online profiles, and an individual-level focus that omits organizational factors.

Editor's pickPAYWALLConsumer & Retail
Bloomberg· Today

Africa E-Commerce Giant Jumia to Cut Workforce Due to AI

Jumia Technologies AG is planning to cut an initial 10% of its workforce of about 2,000 people as the African e-commerce giant implements artificial intelligence across its departments, Chief Executive Officer Francis Dufay said.

Editor's pickTechnology
The Register· Today

AI models are getting better at replacing cybersecurity pros on certain tasks

UK researchers find LLMs are learning to finish jobs faster and improving all the time

Editor's pickManufacturing & Industrials
Vocal Media· Yesterday

GM Cuts 600 IT Roles as AI Productivity Gains Outpace Headcount Needs | Futurism

Read Time 6 minutes Tags AI Automation ... Broader enterprise signal GM is not isolated Amazon Meta Oracle and Block have announced rounds of job cuts with some emphasizing AI role in automating work and boosting productivity with lower head counts The pattern is consistent across sectors with large ...

Editor's pickProfessional Services
CBIA· Yesterday

AI Responsibility and Transparency Act: Key Workplace Impacts » CBIA

The legislation is a wide-ranging “online safety” and AI bill with several provisions that directly affect hiring and employers.

Editor's pick
Tuck School of Business· Yesterday

When AI Leads to Skill Decay | Tuck School of Business

Tuck professors Alva Taylor and Rob Shumsky explore how working with generally reliable AI can quietly erode human expertise over time.

Editor's pick
Comms Business· Yesterday

Comms Business - UK businesses missing out on full economic benefits of AI - Comms Business

Less than one quarter of workers who have fully deployed digital workers (23 per cent) see job replacement as their biggest concern, compared to those who haven’t started exploring AI (45% per cent). The UK market is at a critical turning point, where early value is evident, but many organisations are still working out how to embed AI into core business processes and realise sustained, enterprise-wide benefits. Leaders say the main barriers ...

Editor's pickTechnology
North Country Now· Yesterday

AI is disrupting hiring: How tech talent can stand out - North Country Now

Toptal reports a surge in tech layoffs as demand shifts towards experienced professionals skilled in AI, emphasizing adaptability and real-world skills.

Editor's pick
LinkedIn· Yesterday

Shelby S. - Marketing & Community Growth Leader

This isn’t a collapse. It’s reallocation. Basic tasks are being automated. Pattern work is being commoditized. Low-leverage labor is being repriced. The uncomfortable truth? AI will not replace you. Someone using AI will. The market does not reward effort.

Editor's pick
Arxiv· Today

How many parents does it take? Parental time allocation and the effectiveness of fertility subsidies

arXiv:2605.13679v1 Announce Type: new Abstract: There has long been an apparent consensus in the literature on intra-household allocation and fertility that greater paternal involvement in childcare relaxes maternal time constraints, enabling mothers to increase their labor supply or leisure. Recent evidence, particularly from South Korea, challenges this view: increases in fathers' childcare time have coincided with a further increase in mothers' time dedicated to child-rearing. This paper develops an Overlapping Generations (OLG) growth model to address such a puzzle. The central mechanism and our main innovation hinge on the functional form of the childcare technology. When maternal and paternal time are substitutes, the conventional result holds. However, when they are complements, greater paternal involvement necessarily raises maternal childcare time, depressing fertility and redirecting household resources toward child quality. We further argue that the elasticity of substitution should not be interpreted as a pure preference parameter, as it also reflects the social and institutional norms, the skills each parent brings to child-rearing and their intergenerational transmission. The model is extended to study the effectiveness of pro-natalist subsidies, suggesting that such policies may generate an unintended anti-fertility bias. Numerical simulations calibrated loosely to South Korean data confirm that the model is consistent with the observed quantity-quality trade-off and the persistence of low fertility despite active pro-natalist policy.

Editor's pick
Fortune· Today

Burned out and going nowhere: the American worker is too mentally drained to even look for a new job

In a low-hire, low-fire labor market with almost nowhere to go, job search burnout isn't just emotional — it's rational.

Editor's pick
😅 Hacker in the loop· Today

Leo sets Catholics on collision course with AI

Pope Leo XIV is expected to sign an encyclical positioning AI as a major moral and labor challenge. The document will likely emphasize that technology must remain subordinate to human dignity and labor rights.

AI & Misinformation3 articles
Editor's pick
Arxiv· Today

Amplification to Synthesis: A Comparative Analysis of Cognitive Operations Before and After Generative AI

arXiv:2605.13785v1 Announce Type: new Abstract: Cognitive operations are a rising concern in the geopolitical sphere, a quiet yet rigorous fight for public perception and decision making. While such operations have been extensively studied in the context of bot-driven amplification, the emergence of generative AI introduces a new set of capabilities that may have fundamentally altered how these operations are designed and executed. The possible evolution of cognitive operation via generative AI puts nation states vulnerable without proper mitigation strategies. To address this, we compared behavioral and linguistic coordination patterns in X (formerly Twitter) datasets from the 2016 and 2024 U.S. presidential elections. Utilizing a combined corpus of over 133,000 posts, we applied post-type distribution, semantic clustering, temporal synchrony analysis, and Jaccard-based lexical overlap measures. Findings suggest that the 2024 corpus exhibits a distinct pattern from 2016. Original content rose from 59% to 93% with retweets virtually disappeared; lexical overlap collapsed from a mean Jaccard score of 0.99 to 0.27, with posts converging on the same subject matter expressed in markedly different words; and temporal coordination shifted from pervasive cross-semantic synchrony to narratively concentrated co-occurrence. Taken together, these patterns point toward an operational logic organized around active content generation and narrative-specific targeting - characteristics consistent with generative AI involvement. These findings offer an empirical baseline for future research investigating generative AI's role in the cognitive operation pipeline, and as a practical reference point for security practitioners developing detection frameworks calibrated to the post-generative AI threat environment.

Editor's pick
Arxiv· Today

"F*** You Biden": Cross-Partisan Electoral Toxicity on X

arXiv:2605.12526v1 Announce Type: cross Abstract: Political discourse on social media has grown increasingly toxic, with electoral periods amplifying partisan hostility and cross-group attacks. Yet it remains unclear whether toxicity in online political speech reflects how partisans communicate within their own circles, or how aggressively they engage with the opposition. Disentangling these dynamics is critical for understanding online political hostility and for designing effective content moderation. We examine this question at scale using a large collection of original posts and replies from X (formerly Twitter), collected during the 2024 U.S. presidential election. Using a human-validated large language model to classify the political alignment of posts and users, and the Perspective API for toxicity scoring, we uncover a striking asymmetry: Republican-leaning posts are significantly more toxic than Democratic-leaning posts, yet Democratic-leaning posts attract significantly more toxic replies. To interpret this finding, we compare the toxicity of same-party and cross-partisan replies. While cross-partisan replies are slightly but significantly more toxic than same-party replies, this is true for both Democratic and Republican posts. However, Republican users account for a large majority of replies to Democratic posts, while Democrats account for a minority of replies to Republican content. Therefore, the elevated toxicity directed at Democratic content is better explained by the volume of Republican cross-partisan replies.

Editor's pickHealthcare
Arxiv· Today

WhatsApp Vaccine Discourse (WhaVax): An Expert-Annotated Dataset and Benchmark for Health Misinformation Detection

arXiv:2605.12510v1 Announce Type: cross Abstract: We introduce WhaVax, a new expert-annotated dataset of vaccine-related WhatsApp messages collected from large Brazilian public groups spanning multiple pandemic years. The dataset was constructed through a rigorous, carefully designed pipeline that integrates keyword-based data collection, semantic deduplication to remove near-duplicate content, and a multi-stage annotation protocol conducted by medical specialists. This process produced a high-quality gold-standard corpus, characterized by substantial inter-annotator agreement and strong reliability for downstream analysis. Additionally, we provide a detailed characterization of WhatsApp misinformation, revealing distinctive linguistic, structural, lexical, temporal, and group-level patterns, as well as a meaningful layer of ambiguous cases that reflect the complexity of health discourse in private messaging. We also benchmark classical models, fine-tuned Small Language Models, and zero- or few-shot Large Language Models under realistic data-scarcity constraints, demonstrating that strong embeddings and LLM approaches perform competitively, while domain alignment and data availability remain critical factors. This study provides a rare, high-quality resource to support misinformation research and computational modeling in encrypted communication environments.

AI Ethics & Safety8 articles
Editor's pickFinancial Services
Arxiv· Today

Do Fair Models Reason Fairly? Counterfactual Explanation Consistency for Procedural Fairness in Credit Decisions

arXiv:2605.12701v1 Announce Type: cross Abstract: Machine learning algorithms in socially sensitive domains (e.g., credit decisions) often focus on equalizing predictive outcomes. However, satisfying these metrics does not guarantee that models use the same reasoning for different groups. We show that existing outcome-fair models can still apply fundamentally different reasoning to individuals, a ``hidden procedural bias'' missed by standard fairness metrics and algorithms. We propose Counterfactual Explanation Consistency (CEC), a framework that detects and mitigates this bias by aligning feature attributions between individuals and their counterfactual counterparts. Key contributions include a nearest-neighbor counterfactual generation method, a modified baseline for integrated gradient comparisons, an individual-level procedural fairness metric, and a corresponding training loss. We introduce a taxonomy identifying ``Regime B'' (same outcome, different reasoning) as a critical blind spot. Experiments on synthetic data, German Credit, Adult Income, and HMDA mortgage data demonstrate that outcome-fair baselines exhibit substantial hidden bias, while CEC substantially reduces it with modest utility cost.

Editor's pickGovernment & Public Sector
Artificial Intelligence Newsletter | May 14, 2026· Yesterday

US FTC's White emphasizes consumer redress, fighting concrete harm

The US Federal Trade Commission is focused on enforcing against concrete harms in the marketplace and is prioritizing consumer redress, according to Kate White, deputy director of the FTC's Bureau of Consumer Protection.

Editor's pickPAYWALL
FT· Today

AI desperately needs more adult supervision

The critical challenge is to build institutions that protect us from tech companies and the state

Editor's pick
Arxiv· Today

DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models

arXiv:2605.12702v1 Announce Type: new Abstract: General-purpose safety benchmarks for large language models do not adequately evaluate disability-related harms. We introduce DisaBench: a taxonomy of twelve disability harm categories co-created with people with disabilities and red teaming experts, a taxonomy-driven evaluation methodology that pairs benign and adversarial prompts across seven life domains, and a dataset of 175 prompts with human-annotated labels on 525 prompt-response pairs. Annotation by four evaluators with lived disability experience reveals three findings: harm rates vary sharply by disability type and will compound in non-text modalities, terminology-driven harm is culturally and temporally bound rather than universally assessable, and standard safety evaluation catches overt failures while missing the subtle harms that only domain expertise can recognize. Disability harm is simultaneously personal, intersectional, and community-defined: it cannot be isolated from the full context of who a person is, and general-purpose benchmarks systematically miss it. We will release the dataset, taxonomy, and methodology via Hugging Face and an open-source red teaming framework for direct integration into existing safety pipelines with no additional infrastructure.

Editor's pick
MIT Technology Review· Yesterday

AI chatbots are giving out people’s real phone numbers

A Redditor recently wrote that he was “desperate for help”: for about a month, he said, his phone had been inundated by calls from “strangers” who were “looking for a lawyer, a product designer, a locksmith.” Callers were apparently misdirected by Google’s generative AI.  In March, a software developer in Israel was contacted on WhatsApp…

Editor's pickConsumer & Retail
Guardian· Yesterday

Is Big Brother watching you shop? – podcast

From supermarkets to corner shops, live facial recognition could be coming to retailers near you. Jessica Murray on the AI systems increasingly used by the police and stores Live facial recognition is being hailed as a powerful new frontier in the fight against crime, not only by police but by private companies too. Retailers from supermarkets to corner shops hope it will help them fight back against shoplifting. But the Guardian’s social affairs correspondent, Jessica Murray, points out that it will also expand surveillance into more and more public spaces. And the technology doesn’t always get it right. Continue reading...

Editor's pickTransportation & Logistics
Arxiv· Today

Revealing Interpretable Failure Modes of VLMs

arXiv:2605.12674v1 Announce Type: new Abstract: Vision-Language Models (VLMs) are increasingly used in safety-critical applications because of their broad reasoning capabilities and ability to generalize with minimal task-specific engineering. Despite these advantages, they can exhibit catastrophic failures in specific real-world situations, constituting failure modes. We introduce REVELIO, a framework for systematically uncovering interpretable failure modes in VLMs. We define a failure mode as a composition of interpretable, domain-relevant concepts-such as pedestrian proximity or adverse weather conditions-under which a target VLM consistently behaves incorrectly. Identifying such failures requires searching over an exponentially large discrete combinatorial space. To address this challenge, REVELIO combines two search procedures: a diversity-aware beam search that efficiently maps the failure landscape, and a Gaussian-process Thompson Sampling strategy that enables broader exploration of complex failure modes. We apply REVELIO to autonomous driving and indoor robotics domains, uncovering previously unreported vulnerabilities in state-of-the-art VLMs. In driving environments, the models often demonstrate weak spatial grounding and fail to account for major obstructions, leading to recommendations that would result in simulated crashes. In indoor robotics tasks, VLMs either miss safety hazards or behave excessively conservatively, producing false alarms and reducing operational efficiency. By identifying structured and interpretable failure modes, REVELIO offers actionable insights that can support targeted VLM safety improvements.

Editor's pick
Vocal Media· Yesterday

Former OpenAI Researcher Warns AI Industry Lacks Control Over Systems It Is Racing to Build | Futurism

Read Time 6 minutes Tags AI Alignment OpenAI AI Safety Superintelligence Agentic AI Risk Governance Daniel Kokotajlo a former OpenAI researcher who now runs the AI Futures Project says the artificial intelligence industry is racing to build systems that companies still do not fully understand ...

Technology & Infrastructure

33 articles
AI Agents & Automation3 articles
Editor's pickFinancial Services
Daily AI News May 14, 2026: Deadly Worm in Your Software?· Today

Prudential - Powering AI-Driven Advisor Workflows in Life Insurance | AWS Events

Prudential is utilizing generative AI and multi-agent architectures to streamline life insurance advisor workflows, reducing administrative overhead and improving productivity.

Editor's pickTechnology
Arxiv· Today

Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

arXiv:2605.12620v1 Announce Type: new Abstract: Building generalist embodied agents capable of solving complex real-world tasks remains a fundamental challenge in AI. Multimodal Large Language Models (MLLMs) have significantly advanced the reasoning capabilities of such agents through strong vision-language knowledge and chain-of-thought (CoT) reasoning, yet remain brittle when faced with challenging out-of-distribution scenarios. To address this, we propose Verifier-Guided Action Selection (VegAS), a test-time framework designed to improve the robustness of MLLM-based embodied agents through an explicit verification step. At inference time, rather than committing to a single decoded action, VeGAS samples an ensemble of candidate actions and uses a generative verifier to identify the most reliable choice, without modifying the underlying policy. Crucially, we find that using an MLLM off-the-shelf as a verifier yields no improvement, motivating our LLM-driven data synthesis strategy, which automatically constructs a diverse curriculum of failure cases to expose the verifier to a rich distribution of potential errors at training time. Across embodied reasoning benchmarks spanning the Habitat and ALFRED environments, VeGAS consistently improves generalization, achieving up to a 36% relative performance gain over strong CoT baselines on the most challenging multi-object, long-horizon tasks.

AI Infrastructure & Compute6 articles
AI Models & Capabilities9 articles
Editor's pick
Arxiv· Today

No One Knows the State of the Art in Geospatial Foundation Models

arXiv:2605.12678v1 Announce Type: cross Abstract: Geospatial foundation models (GFMs) have been proposed as generalizable backbones for disaster response, land-cover mapping, food-security monitoring, and other high-stakes Earth-observation tasks. Yet the published work about these models does not give reviewers or users enough information to tell which model fits a given task. We argue that nobody knows what the current state of the art is in geospatial foundation models. The methods may be useful, but the GFM literature does not standardize evaluations, training and testing protocols, released weights, or pretraining controls well enough for anyone to compare or rank them. In a 152-paper audit, we find 46 cross-paper disagreements of at least 10 points for the same model, benchmark, and protocol; 94/126 papers with extractable pretraining data use a configuration no other paper uses; and 39% of GFM papers release no model weights. This lack of community standards can be solved. We propose six concrete expectations: named-license weight release, shared core evaluations, copied-versus-rerun baseline annotations, variance reporting, one shared evaluation harness, and data-vs-architecture-vs-algorithm controls. These gaps are a coordination failure, not a fault of any individual lab; the authors of this paper, like many others in the GFM community, have contributed to them. Rather than just critiquing the community, we aim to provide concrete steps toward a shared understanding of how to innovate GFMs.

Editor's pickPAYWALLTechnology
FT· Yesterday

AI Labs: Google DeepMind plans its comeback

Google and its AI lab DeepMind are bearing down OpenAI and Anthropic

Editor's pickTechnology
Arxiv· Today

State-Centric Decision Process

arXiv:2605.12755v1 Announce Type: new Abstract: Language environments such as web browsers, code terminals, and interactive simulations emit raw text rather than states, and provide none of the runtime structure that MDP analysis requires. No explicit state space, no observation-to-state mapping, no certified transitions, and no termination criterion. We introduce the State-Centric Decision Process (SDP), a runtime framework that constructs these missing inputs by having the agent build them, predicate by predicate, as it acts. At each step the agent commits to a natural-language predicate describing how the world should look, takes an action to make it true, and checks the observation against it. Predicates that pass become certified states, and the resulting trajectory carries the four objects language environments do not provide, namely a task-induced state space, an observation-to-state mapping, certified transitions, and a termination criterion. We evaluate SDP on five benchmarks spanning planning, scientific exploration, web reasoning, and multi-hop question answering. SDP achieves the best training-free results on all five, with the advantage widening as the horizon grows. The certified trajectories additionally support analyses unavailable to reactive agents, including per-predicate credit assignment, failure localization, partial-progress measurement, and modular operator replacement.

Editor's pick
Arxiv· Today

In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores

arXiv:2605.12530v1 Announce Type: cross Abstract: LLM fairness should be evaluated through in-situ conversational behavior rather than standardized-test Q&A benchmarks. We show that the standardized-test paradigm can be structurally unreliable: surface-level prompt construction choices, although entirely orthogonal to the fairness question being tested, account for the majority of score variance, shift fairness conclusions in both the direction and the magnitude, and result in severe discordance in model rankings. We develop MAC-Fairness, a multi-agent conversational framework that embeds controlled variation factors into multi-round dialogue for in-situ behavior evaluation, examining how models' conversational behavior shifts when identity is varied as part of natural multi-agent interaction. Repurposing standardized-test questions as conversation seeds rather than as the evaluation instrument, we evaluate position persistence (how they hold positions, from the self-perspective) and peer receptiveness (how receptive they are to peers, from the other-perspective) across 8 million conversation transcripts spanning multiple models and identity presence configurations. In-situ behavioral evaluation reveals stable, model-specific behavioral signatures that could generalize across benchmarks differing in fairness targets and evaluation methodologies, a form of evidence the standardized-test paradigm does not offer.

Editor's pickHealthcare
Arxiv· Today

Multimodal Hidden Markov Models for Persistent Emotional State Tracking

arXiv:2605.12838v1 Announce Type: new Abstract: Tracking an interpretable emotional arc of a conversation via the sentiment of individual utterances processed as a whole is central to both understanding and guiding communication in applied, especially clinical, conversational contexts. Existing approaches to emotion recognition operate at the utterance level, obscuring the persistent phases that characterize real conversational dynamics. We propose a lightweight framework that models conversational emotion as a sequence of latent emotional regimes using sticky factorial HDP-HMMs over multimodal valence-arousal representations derived from simultaneous video, audio and textual input. We evaluate the quality of regime prediction using LLM-as-a-Judge, geometric, and temporal consistency metrics, demonstrating that the sticky HDP-HMM produces more interpretable regime sequences than the baseline Gaussian HMM at a fraction of the computational cost of LLM-based dialogue state tracking methods. In addition, Question-Answer experiments in a clinical dataset suggest that meaningful emotional phases can reliably be recovered from multimodal valence-arousal trajectories and used to improve the quality of LLM responses in unstable affective regimes via context augmentation. This framework thus opens a path toward interpretable, lightweight, and actionable analysis of conversational emotion dynamics at scale.

Editor's pickTechnology
Arxiv· Today

When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

arXiv:2605.12922v1 Announce Type: new Abstract: Large language models can follow complex instructions in a single turn, yet over long multi-turn interactions they often lose the thread of instructions, persona, and rules. This degradation has been measured behaviorally but not mechanistically explained. We propose a channel-transition account: goal-defining tokens become less accessible through attention, while goal-related information may persist in residual representations. We introduce the Goal Accessibility Ratio (GAR), measuring attention from generated tokens to task-defining goal tokens, and combine it with sliding-window ablations and residual-stream probes. When attention to instructions closes, what survives reveals architecture. Across architectures, the transition yields qualitatively distinct failure modes: some models preserve goal-conditioned behavior at vanishing attention, others fail despite decodable residual goal information, and the layer at which this encoding emerges varies from 2 to 27. A within-model causal ablation that force-closes the attention channel in Mistral collapses recall from near-perfect to 11% on a 20-fact retention task and raises persona-constraint violations above an adversarial-pressure baseline without user pressure, with both effects emerging at the predictable crossover turn. Linear probes recover per-episode recall outcomes from residual representations with AUC up to 0.99 across all four primary architectures, while input embeddings remain at chance. Across architectures and model scales, the gap between attention loss and residual decodability predicts whether goal-conditioned behavior survives channel closure. We contribute GAR as a diagnostic, the channel-transition framework as a controlled mechanistic account, and a parametric prediction of failure timing under windowed attention closure.

Editor's pick
VentureBeat· Yesterday

AI IQ is here: a new site scores frontier AI models on the human IQ scale. The results are already dividing tech.

For decades, the IQ test has been one of the most familiar — and most contested — yardsticks for human intelligence. Now, a startup project called AI IQ is applying the same metaphor to artificial intelligence, assigning estimated intelligence quotients to more than 50 of the world's most powerful language models and plotting them on a standard bell curve. The result is a set of interactive visualizations at aiiq.org that have ricocheted across social media in the past week, drawing praise from enterprise technologists who say the charts make an impossibly complex market legible — and sharp criticism from researchers and commentators who warn the entire framework is misleading. "This is super useful," wrote Thibaut Mélen, a technology commentator, on X. "Much easier to understand model progress when it's mapped like this instead of another giant leaderboard table." Brian Vellmure, a business strategist, offered a similar endorsement: "This is helpful. Anecdotally tracks with personal experience." But the backlash arrived just as quickly. "It's nonsense. AI is far too jagged. The map is not the territory," posted AI Deeply, an artificial intelligence commentary account, crystallizing a worry shared by many researchers: that reducing a language model's sprawling, uneven capabilities to a single number creates a dangerous illusion of precision. Twelve benchmarks, four dimensions, and one controversial number: how AI IQ actually works AI IQ was created by Ryan Shea, an engineer, entrepreneur, and angel investor best known as a co-founder of the blockchain platform Stacks. Shea also co-founded Voterbase and has invested in the early stages of several unicorns, including OpenSea, Lattice, Anchorage, and Mercury. He holds a Bachelor of Science in Mechanical Engineering from Princeton University. The site's methodology rests on a deceptively simple formula. AI IQ groups 12 benchmarks into four reasoning dimensions: abstract, mathematical, programmatic, and academic. The composite IQ is a straight average of those four dimension scores: IQ = ¼ (IQ_Abstract + IQ_Math + IQ_Prog + IQ_Acad). The abstract reasoning dimension draws from ARC-AGI-1 and ARC-AGI-2, the notoriously difficult pattern-recognition benchmarks designed to test general fluid intelligence. Mathematical reasoning includes FrontierMath (Tiers 1–3 and Tier 4), AIME, and ProofBench. Programmatic reasoning uses Terminal-Bench 2.0, SWE-Bench Verified, and SciCode. Academic reasoning pulls from Humanity's Last Exam, CritPt, and GPQA Diamond. Each raw benchmark score gets mapped to an implied IQ through what the site describes as "hand-calibrated difficulty curves." Crucially, the methodology compresses ceilings for benchmarks considered easier or more susceptible to data contamination, preventing them from inflating scores above 100. Harder, less gameable benchmarks retain higher ceilings. The system also handles missing data conservatively: models need scores on at least two of the four dimensions to receive a derived IQ, and when benchmarks are absent, the pipeline deliberately pulls scores down rather than up. The site states that "every derived IQ averages all four dimensions, so missing coverage cannot make a model look better by omission." OpenAI leads the bell curve, but the gap between the top AI models has never been smaller As of mid-May 2026, the AI IQ charts tell a story of rapid convergence at the top of the frontier — and widening diversity in the tiers below. According to the Frontier IQ Over Time chart, GPT-5.5 from OpenAI currently sits at the peak of the bell curve, with an estimated IQ near 136 — the highest of any model tracked. It is closely followed by GPT-5.4 (approximately 131), Opus 4.7 from Anthropic (approximately 132), and Opus 4.6 (approximately 129). Google's Gemini 3.1 Pro lands near 131, making the top cluster extraordinarily tight. That compression is not unique to AI IQ's framework. Visual Capitalist, drawing from a separate Mensa-based ranking by TrackingAI, recently observed the same dynamic, noting that "the biggest takeaway is how compressed the top of the leaderboard has become." On that scale, Grok-4.20 Expert Mode and GPT 5.4 Pro tied at 145, with Gemini 3.1 Pro at 141. Below the frontier cluster, the AI IQ charts show a crowded midfield. Models from Chinese labs — Kimi K2.6, GLM-5, DeepSeek-V3.2, Qwen3.6, MiniMax-M2.7 — bunch between roughly 112 and 118, making the cost-performance tier increasingly competitive for enterprise buyers who don't need the absolute best model for every task. One X user, ovsky, noted that the data "confirms experience with sonnet 4.6 being an absolute workhorse as opposed to opus 4.5" — pointing to the way the charts can validate practitioner intuitions that headline rankings often miss. Why emotional intelligence scores are becoming the new battleground in AI model rankings What distinguishes AI IQ from most other benchmarking efforts is its inclusion of an "EQ" — emotional intelligence — score. The site maps each model's EQ-Bench 3 Elo score and Arena Elo score to an estimated EQ using calibrated piecewise-linear scales, then takes a 50/50 weighted composite of the two. The EQ scores produce a meaningfully different ranking than IQ alone. On the IQ vs. EQ scatter plot, Anthropic's Opus 4.7 leads on EQ with a score near 132, pushing it into the upper-right quadrant — the most desirable position, signaling both high cognitive and high emotional intelligence. OpenAI's GPT-5.5 and GPT-5.4 cluster in the high-IQ zone but lag slightly on EQ. Google's Gemini 3.1 Pro sits in a strong middle position on both axes. One notable methodological choice has drawn attention: EQ-Bench 3 is judged by Claude, an Anthropic model, which the site acknowledges "creates potential scoring bias in favor of Anthropic models." To correct for this, AI IQ subtracts a 200-point Elo penalty from the EQ-Bench component for all Anthropic models before mapping to implied EQ. The Arena component is unaffected since it uses human judges. That self-correction is unusual in the benchmarking world, and it suggests Shea is aware of the methodological minefield he has entered. Still, the EQ dimension captures something IQ alone cannot: the growing importance of conversational quality, collaboration, and trust in models deployed for user-facing work. The AI cost-performance chart that enterprise buyers actually need to see Perhaps the most practically useful chart on the site is not the bell curve but the IQ vs. Effective Cost scatter plot. It maps each model's estimated IQ against an "effective cost" metric — defined as the token cost for a task using 2 million input tokens and 1 million output tokens, multiplied by a usage efficiency factor. The chart reveals a familiar pattern in enterprise technology: the best models are not always the best value. GPT-5.5 and Opus 4.7 sit in the upper-left corner — high IQ, high cost, with effective per-task costs north of $30 and $50 respectively. Meanwhile, models like GPT-5.4-mini, DeepSeek-V3.2, and MiniMax-M2.7 occupy a sweet spot in the middle: respectable IQ scores between 112 and 120, at effective costs ranging from roughly $1 to $5 per task. At the cheapest extreme, GPT-oss-20b (an open-source OpenAI model) appears near $0.20 effective cost with an IQ around 107 — potentially the most economical option for bulk classification or extraction workloads. The site also offers a 3D visualization mapping IQ, EQ, and effective cost simultaneously. A dashed line running through the cube points toward the ideal: higher IQ, higher EQ, and lower cost. Models near the "green end" of that axis are stronger all-around deals; those near the "red end" sacrifice capability, cost efficiency, or both. For CIOs staring at API invoices, the implication is clear: the intelligence gap between a $50 model and a $3 model has narrowed enough that routing — using expensive models for hard problems and cheap ones for everything else — is no longer optional. It is the dominant architecture for serious AI deployments. Critics say AI's "jagged" capabilities make a single IQ score dangerously misleading The loudest objection to AI IQ is philosophical, and it cuts deep. Critics argue that collapsing a model's uneven capabilities into a single score obscures more than it reveals. "IQ as a proxy is fading — we're seeing reasoning density spikes that don't map to g-factor," posted Zaya, a technology commentator, on X. "GPT-5.5 already hit saturation on MMLU-Pro, but still fails ClockBench 50% of the time." That observation touches on what AI researchers call the "jaggedness" problem: large language models often exhibit wildly uneven capabilities, excelling at graduate-level physics while failing at tasks a child could do. A composite score can paper over those gaps. Pressureangle, another X user, posted a more granular critique, calling out "complete lack of transparency" and arguing the site never fully discloses how its calibration curves were created or validated. In fairness, AI IQ does list its 12 benchmarks and shows the shape of each calibration curve in its methodology modal. But the raw data and precise mathematical transformations are not published as open datasets — a gap that matters to researchers accustomed to fully reproducible methods. Others questioned the premise itself. "As useless as human IQ testing," wrote haashim on X. Shubham Sharma, an AI and technology writer, offered a constructive alternative: "Why not having the Models take an official (MENSA-Grade) test? Wouldn't this be the most accurate and most 'human-comparable' way to benchmark intelligence?" That approach already exists through TrackingAI, which administers the Mensa Norway IQ test to language models. But Mensa-style tests measure only abstract pattern recognition, while AI IQ attempts a broader composite across coding, mathematics, and academic reasoning. As Visual Capitalist noted, "an IQ-style benchmark captures only one slice of capability." Each approach has tradeoffs — and neither has won the argument yet. The real race isn't for the highest score — it's for the smartest model stack For all the debate about methodology, the most important signal in AI IQ's data may not be any single model's score. It is the shape of the market the charts reveal. There are now more than 50 frontier-class models available through APIs, from at least 14 major providers spanning the United States, China, and Europe. Each provider publishes its own benchmarks, often cherry-picked to showcase strengths. The result is a Tower of Babel where no two companies measure the same thing in the same way. Academic research has highlighted that "most benchmarks introduce bias by focusing on a particular type of domain," and the Frontier IQ Over Time chart on AI IQ shows just how fast the targets are moving: in October 2023, GPT-4-turbo sat near an estimated IQ of 75. By early 2026, the top models were brushing 135 — roughly 60 points of improvement in 30 months. That pace raises a fundamental question about whether any scoring system can keep up. The site compresses ceilings for saturated benchmarks, but as models continue to max out even the hardest tests — ARC-AGI-2, FrontierMath Tier 4, Humanity's Last Exam — the framework will face the same ceiling effects that have plagued every AI evaluation before it. Connor Forsyth pointed to this dynamic on X: "ARC AGI 3 disagrees," he wrote, referencing a next-generation benchmark that may already be undermining current scores. AI IQ is not perfect. Its methodology is partially opaque. Its IQ metaphor can mislead. And its creator acknowledges known biases while likely missing others. But the alternative — wading through dozens of provider-specific benchmark tables, each using different test suites and scoring conventions — is worse. The site offers enterprise buyers something genuinely scarce: a single framework for comparing models across providers, dimensions, and price points, updated regularly, with enough nuance to show that the right answer to "which model is best?" is almost always "it depends on the task." As Debdoot Ghosh mused on X after viewing the charts: "Now a human's role is just to orchestrate?" Maybe. But if the AI IQ data shows anything clearly, it is that orchestration — knowing which model to deploy, when, and at what price — has become its own form of intelligence. And for that, there is no benchmark yet.

Editor's pick
Daily AI News May 13, 2026: Miro Lost 42 Years Productivity Annually. AI Got It Back.· Yesterday

The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

This research paper examines how expanded memory in LLM agents can reduce cooperation and produce negative behavioral patterns over repeated interactions.

Editor's pick
Arxiv· Today

CHAL: Council of Hierarchical Agentic Language

arXiv:2605.12718v1 Announce Type: new Abstract: Multi-agent debate has emerged as a promising approach for improving LLM reasoning on ground-truth tasks, yet current methodologies face certain structural limitations: debate tends to induce a martingale over belief trajectories, majority voting accounts for most observed gains, and LLMs exhibit confidence escalation rather than calibration across rounds. We argue that the genuine value of debate, and dialectic systems as a whole, lies not in ground-truth tasks but in defeasible domains, where every position can in principle be defeated by better reasoning. We present the Council of Hierarchical Agentic Language (CHAL), a multi-agent dialectic framework that treats defeasible argumentation as an engine for belief optimization. Each agent maintains a CHAL Belief Schema (CBS), a graph-structured belief representation with a Bayesian-inspired architecture, that facilitates belief revision through a gradient-informed dynamic mechanism by leveraging the strength of the belief's thesis as a differentiable objective. Meta-cognitive value systems spanning epistemology, logic, and ethics are elevated to configurable hyperparameters governing agent reasoning and adjudication outcomes. We provide a series of ablation experiments that demonstrate systematic and interpretable effects: the adjudicator's value system determines the debate's overall trajectories in latent belief space, council diversity refines beliefs for all participants, and the framework generalizes across broad fields. CHAL is, to our knowledge, the first framework to treat multi-agent debate as structured belief optimization over defeasible domains. Further, the auditable belief artifacts it produces establish the foundation for dedicated evaluation suites for defeasible argumentation, with broader implications for building AI systems whose reasoning and value commitments are transparent, aligned, and subject to human oversight.

AI Research & Science2 articles
Editor's pickPharma & Biotech
Arxiv· Today

PROMETHEUS: Automating Deep Causal Research Integrating Text, Data and Models

arXiv:2605.12835v1 Announce Type: new Abstract: Large language models can extract local causal claims from text, but those claims become more useful when organized as persistent, navigable world models rather than as flat summaries. We introduce PROMETHEUS, a framework that turns retrieved literature, filings, reviews, reports, agent traces, source data, code, simulations, and scientific models into causal atlases: sheaf-like families of local causal predictive-state models over an explicit cover of a research substrate. Each local region contains causal episodes, structured claim tables, predictive tests, support statistics, and provenance; restriction maps compare overlapping regions; gluing diagnostics expose agreement, drift, contradiction, and underdetermination. The resulting Topos World Model is not a single universal graph. It is a research instrument for navigating what a corpus says, where it says it, how strongly it is supported, and where local claims fail to assemble into a coherent global view. Three literature-atlas case studies -- ocean-temperature impacts on marine populations, GLP-1 weight-loss evidence, and resveratrol/red-wine health-benefit claims -- illustrate deep causal research from text with explicit locality, evidence, persistent state, and gluing tension. Four grounded-counterfactual case studies -- a Nature Climate Change microplastics forcing paper, an Indus Valley hydrology paper with VIC-derived figure data and model code, the canonical Sachs protein-signaling study with single-cell perturbation data, and a Nature singing-mouse study with MAPseq projection matrices -- show a stronger mode: when a paper ships source data, simulation outputs, or code, PROMETHEUS can evaluate a counterfactual against that scientific substrate and then rebuild the sheaf world model around the

AI Security & Cybersecurity9 articles
Editor's pickFinancial Services
Artificial Intelligence Newsletter | May 14, 2026· Yesterday

Japan, US tackle AI cyberthreats as megabanks prepare to access Mythos

Japan is accelerating efforts to address cybersecurity risks from frontier artificial intelligence models in cooperation with the US, as three major Japanese banks are reportedly set to gain access to Anthropic's Claude Mythos Preview.

Editor's pickTechnology
Arxiv· Today

Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue

arXiv:2605.12856v1 Announce Type: new Abstract: The emergence of multi-agent systems introduces novel moderation challenges that extend beyond content filtering. Agents with {\em malicious intent} may contribute harmful content that appears benign to evade content-based moderation, while compromising the system through exploitative and malicious behavior manifested across their overall interaction patterns within the community. To address this, we introduce \textsc{\textbf{Bot-Mod}} (\textsc{\textbf{Bot-Mod}}eration), a moderation framework that grounds detection in agent intent rather than traditional content level signals. \method{} identifies the underlying intent by engaging with the target agent in a multi-turn exchange guided by Gibbs-based sampling over candidate intent hypotheses. This progressively narrows the space of plausible agent objectives to identify the underlying behavior. To evaluate our approach, we construct a dataset derived from Moltbook that encompasses diverse benign and malicious behaviors based on actual community structures, posts, and comments. Results demonstrate that \textsc{\textbf{Bot-Mod}} reliably identifies agent intent across a range of adversarial configurations, while maintaining a low false positive rate on benign behaviors. This work advances the foundation for scalable, intent-aware moderation of agents in open multi-agent environments.

Editor's pickFinancial Services
American Enterprise Institute· Today

AI Is Making the Cyber-Thriller Less Fictional | AEI

The infrastructure of modern economic life—financial systems, energy grids, hospital networks—is built on the assumption that the most dangerous hackers are rare. That assumption has a fast-approaching expiration date.

Editor's pickTechnology
Daily AI News May 14, 2026: Deadly Worm in Your Software?· Today

Protect Your Enterprise Now from the Shai-Hulud Worm and npm Vulnerability In 6 Actionable Steps

The Shai-Hulud worm is the first known npm attack targeting AI coding agents and developer credential stores. It leverages legitimate GitHub workflows to spread malware into CI/CD pipelines.

Editor's pickTechnology
Top Daily Headlines: Civil servants to protest outside Capita AGM over pension shambles· Today

Malware crew TeamPCP open-sources its Shai-Hulud worm on GitHub

The malware has been forked on GitHub, seemingly without Microsoft's code locker noticing.

Adoption, Deployment & Impact

25 articles
AI Adoption Barriers & Enablers5 articles
Editor's pickEducation
Arxiv· Today

An Activity-Theoretical Approach to Teacher Professional Development in Pedagogical AI Agent Design

arXiv:2605.12934v1 Announce Type: new Abstract: This two-cycle formative intervention study examined why teachers disengage from AI agent creation after professional development - a low engagement paradox - and tested whether systemic redesign could address it. Cycle 1 (N=218) revealed that despite completing comprehensive TPD, 87 percent of teachers ceased creating within three weeks, with behavioral tracking and interview analysis identifying systemic contradictions as the source of psychological need frustration rather than capacity deficits. Cycle 2 (N=26) implemented Cultural-Historical Activity Theory and Self-Determination Theory - driven redesign directly targeting diagnosed contradictions, achieving synchronized enhancement of both capacity and willingness. The findings reframe implementation failure as a rational response to need-thwarting systems and offer a replicable CHAT - SDT diagnostic framework for transformative professional development.

Editor's pickFinancial Services
MIT Technology Review· Today

Data readiness for agentic AI in financial services

Financial services companies have unique needs when it comes to business AI. They operate in one of the most highly regulated sectors while responding to external events that are updated by the second. As a result, the success of agentic AI in financial services depends less on the sophistication of the system and more on…

Editor's pickTechnology
Computerworld· Today

Nearly every enterprise is investing in AI, but only 5% say their data is ready – Computerworld

Dun & Bradstreet found widespread experimentation and early returns, but few organizations believe they can deploy AI reliably at enterprise scale.

AI Applications11 articles
Editor's pickPAYWALLHealthcare
Bloomberg· Today

How AI Aims to Fix Healthcare Access

Rezilient CEO Dr. Danish Nagda says the healthcare system is at a tipping point. He joins Bloomberg Open Interest to talk about how hybrid “cloud clinics,” employer-driven care, and AI-powered doctors could eliminate long wait times, cut costs, and make switching doctors a thing of the past. (Source: Bloomberg)

Editor's pickManufacturing & Industrials
Microsoft News· Yesterday

4 ways AI is enabling the future of industrial work - Source

Examples from Japan, Mexico, New Zealand and Saudi Arabia show how AI is transforming high-skill workflows in manufacturing and industrial companies.

Editor's pickHealthcare
Guardian· Yesterday

One in seven in UK prefer consulting AI chatbots to seeing doctor, study finds

Exclusive: Doctors say ‘highly concerning’ poll highlights risk to patients of turning to AI for medical advice One in seven people are using AI chatbots for health advice instead of seeing their GP, a UK study has found. The poll of more than 2,000 people found that – of the 15% turning to chatbots – one in four had done so because of long NHS waiting lists. Continue reading...

Editor's pickEducation
Arxiv· Today

MIRACLE_Multi-Agent Intelligent Regulation to Advance Collaborative Learning Environment

arXiv:2605.12923v1 Announce Type: new Abstract: Effective collaboration requires Socially Shared Regulation (SSRL), but students often lack these skills. This study introduces the MIRACLE (Multi-Agent Intelligent Regulation to Advance Collaborative Learning Environment) system, which supports SSRL by orchestrating metacognitive regulation and proactively providing emotional and motivational support. We conducted a quasi-experimental study with 90 fifth-grade students. The experimental group (n=42) used a collaborative platform CocoNote equipped with MIRACLE, while the control group (n=48) used the same platform with a general GPT assistant. Quantitative results show the MIRACLE group achieved significant gains across SSRL phases (Planning, Monitoring, Reflection) and produced higher-quality collaborative artifacts compared to the control group. Qualitative findings indicate students perceived MIRACLE as an effective facilitator for cognitive, regulatory, and emotional support. This study demonstrates that specialized, orchestrated AI systems are more effective than generic AI in enhancing SSRL.

Editor's pickGovernment & Public Sector
Theregister· Today

Calling the cops just got extra AI as police seek to add tech to contact systems

AI already listening in to call handlers in real time, conducting live database searches

Editor's pickGovernment & Public Sector
BBC· Today

HMRC to use AI from British tech firm to spot fraud and tax return errors

Quantexa, a financial data platform, won the £175m contract to spot fraud and tax return errors.

Editor's pickManufacturing & Industrials
OpenPR· Yesterday

Industrial Robotics Intelligence Software Market to Add US$49.17 Billion by 2031 as AI, Digital Twins and Physical AI Shift Factory Robotics From Programmed Motion to Adaptive Automation

NEW YORK and TOKYO May 13 2026 The global Industrial Robotics Intelligence Software Market is entering a new investment cycle as manufacturers move beyond robot installation and begin upgrading robotic fleets with software that can see learn simulate optimize and ...

Editor's pickManufacturing & Industrials
Agtechnavigator· Today

AI and operational agility set to reshape agriculture trading, McKinsey analysis shows

McKinsey & Company’s latest analysis highlights a fundamental transformation in agricultural commodity trading, driven by rising market volatility, digital competition, and the adoption of AI and agentic systems. The report argues that traditional, experience-based and regionally siloed ...

Editor's pickProfessional Services
Arxiv· Today

Learning Transferable Latent User Preferences for Human-Aligned Decision Making

arXiv:2605.12682v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as reasoning modules in many applications. While they are efficient in certain tasks, LLMs often struggle to produce human-aligned solutions. Human-aligned decision making requires accounting for both explicitly stated goals and latent user preferences that shape how ambiguous situations should be resolved. Existing approaches to incorporating such preferences either rely on extensive and repeated user interactions or fail to generalize latent preferences across tasks and contexts, limiting their practical applicability. We consider a setting in which an LLM is used for high-level reasoning and is responsible for inferring latent user preferences from limited interactions, which guides downstream decision making. We introduce CLIPR (Conversational Learning for Inferring Preferences and Reasoning), a framework that learns actionable, transferable natural language rules that represent latent user preferences from minimal conversational input. These rules are iteratively refined through adaptive feedback and applied to both in-distribution and out-of-distribution ambiguous tasks across multiple environments. Evaluations on three datasets and a user study show that CLIPR consistently outperforms existing methods in improving alignment and reducing inference costs.

Editor's pickManufacturing & Industrials
Thebftonline· Yesterday

Artificial Intelligence (AI) and the future of procurement: From traditional systems to intelligent supply chains - The Business & Financial Times

By Alvin A. Mingle The procurement space has always been one of the most dynamic functions within organisations, particularly in Ghana, where supply chains are often stretched across borders and shaped by global dependencies. From sourcing critical inputs in the telecom, oil and gas, and ...

Editor's pickEnergy & Utilities
OpenPR· Yesterday

Outlook on the AI Market in Smart Buildings and Infrastructure: Major Segments, Strategic Developments, and Leading Companies

This acquisition aims to enhance ... energy consumption and greenhouse gas emissions in commercial buildings. BrainBox AI specializes in smart building solutions and HVAC energy efficiency, making them a strategic addition to Trane's portfolio. View the full ai in smart buildings and infrastructure market report: ...

AI Measurement & Evaluation2 articles
Editor's pickTechnology
Arxiv· Today

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

arXiv:2605.12673v1 Announce Type: new Abstract: Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue that benchmarks must be secure by design. From past incidents of reward hacks, we derive a taxonomy of eight recurring flaw patterns and compile them into the Agent-Eval Checklist for benchmark designers. We condense the insights into BenchJack, an automated red-teaming system that drives coding agents to audit benchmarks and identify possible reward-hacking exploits in a clairvoyant manner. Moreover, we extend BenchJack to an iterative generative-adversarial pipeline that discovers new flaws and patches them iteratively to improve benchmark robustness. We apply BenchJack to 10 popular agent benchmarks spanning software engineering, web navigation, desktop computing, and terminal operations. BenchJack synthesizes reward-hacking exploits that achieve near-perfect scores on most of the benchmarks without solving a single task, surfacing 219 distinct flaws across the eight classes. Moreover, BenchJack's extended pipeline reduces the hackable-task ratio from near 100% to under 10% on four benchmarks without fatal design flaws, fully patching WebArena and OSWorld within three iterations. Our results show that evaluation pipelines have not internalized an adversarial mindset, and that proactive auditing could help close the security gap for the fast-paced benchmarking space.

Editor's pickTechnology
Arxiv· Today

Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents

arXiv:2605.12894v1 Announce Type: new Abstract: Large Language Model (LLM) agents are increasingly deployed in settings where they interact with a wide variety of people, including users who are unclear, impatient, or reluctant to share information. However, collecting real interaction data at scale remains expensive. The field has turned to LLM-based user simulators as stand-ins, but these simulators inherit the behavior of their underlying models: cooperative and homogeneous. As a result, agents that appear strong in simulation often fail under the unseen, diverse communication patterns of real users. To narrow this gap, we introduce Persona Policies (PPol), a plug-and-play control layer that induces realistic behavioral variation in user simulators while preserving the original task goals. Rather than hand-crafting personas, we cast persona generation as an LLM-driven evolutionary program search that optimizes a Python generator to discover behaviors and translate them into task-preserving roleplay policies. Candidate generators are guided by a multi-objective fitness score combining human-likeness with broad coverage of human behavioral patterns. Once optimized, the generator produces a diverse population of human-like personas for any task in the domain. Across tau^2-bench retail and airline domains, evolved PPol programs yield 33-62% absolute gains in fitness score over the baseline simulator. In a blinded evaluation, annotators rated PPol-conditioned users as human 80.4% of the time, close to real human traces and nearly twice as frequently as baseline simulators. Agents trained with PPol are more robust to challenging, out-of-distribution behaviors, improving task success by +17% relative to training only on existing simulated interactions. This offers a novel approach to strengthen simulator-based evaluation and training without changing tasks or rewards.

AI Productivity Evidence2 articles
Editor's pickProfessional Services
VentureBeat· Yesterday

Frontier AI models don't just delete document content — they rewrite it, and the errors are nearly impossible to catch

As large language models become more capable, users are tempted to delegate knowledge tasks where models process documents on their behalf and provide the finished results. But how far can you trust the model to stay faithful to the content of your documents when it has to iterate over them across multiple rounds? A new study by researchers at Microsoft shows that large language models silently corrupt documents that they work on by introducing errors. The researchers developed a benchmark that simulates multi-step autonomous workflows across 52 professional domains, using a method that automatically measures how much content degrades over time. Their findings show that even top-tier frontier models corrupt an average of 25% of document content by the end of these workflows. And providing models with agentic tools or realistic distractor documents actually worsens their performance. This serves as a warning that while there is increasing pressure to automate knowledge work, current language models are not fully reliable for these tasks. The mechanics of delegated work The Microsoft study focuses on “delegated work,” an emerging paradigm where users allow LLMs to complete knowledge tasks on their behalf by analyzing and modifying documents. A prominent example of this paradigm is vibe coding, where a user delegates software development and code editing to an AI. But delegated workflows extend far beyond programming into other domains. In accounting, for example, a user might supply a dense ledger and instruct the model to split the document into separate files organized by specific expense categories. Because users might lack the time or the specialized expertise to manually review every modification the AI implements, delegation often hinges on trust. Users expect that the model will faithfully complete tasks without introducing unchecked errors, unauthorized deletions, or hallucinations in the documents. To measure how far AI systems can be trusted in extended, iterative delegated workflows, the researchers developed the DELEGATE-52 benchmark. The benchmark is composed of 310 work environments spanning 52 diverse professional domains, including financial accounting, software engineering, crystallography, and music notation. Each work environment relies on real-world seed text documents ranging from 2,000 to 5,000 tokens. Alongside the seed document, the environments include five to ten complex, non-trivial editing tasks. Grading a complex, multi-step editing process usually requires expensive human review. DELEGATE-52 bypasses this by using a “round-trip relay” simulation method that evaluates answers without requiring human-annotated reference solutions. The approach is inspired by the backtranslation technique used in machine translation evaluation, where an AI model is told to translate a document from one language to another and back to see how perfectly it reproduces the original version. Accordingly, every edit task in DELEGATE-52 is designed to be fully reversible, pairing a forward instruction with its precise inverse. For example, an instruction to split the ledger into separate files by expense category is paired with an instruction to merge all category files back into a single ledger. In comments provided to VentureBeat, Philippe Laban, Senior Researcher at Microsoft Research and co-author of the paper, clarified that this is not simply a test of whether an AI can hit "undo." Because human workers cannot be forced to instantly "forget" a task they just did, this round-trip evaluation is uniquely suited for AI. By starting a new conversational session, the researchers force the model to attempt the inverse task completely independently. The models in their experiments “do not know whether a task is a forward or backward step and are unaware of the overall experiment design," Laban explained. "They are simply attempting each task as thoroughly as they can at each step." These roundtrip tasks are chained together into a continuous relay to simulate long-horizon workflows spanning 20 consecutive interactions. To make the environment more realistic, the benchmark introduces distractor files in the context of each task. These contain 8,000 to 12,000 tokens of topically related but completely irrelevant documents. Distractors measure whether the AI can maintain focus or if it gets confused and pulls in the wrong data. Testing frontier models in the relay To understand how different architectures and scales handle delegated work, the researchers tested 19 different language models from OpenAI, Anthropic, Google, Mistral, xAI, and Moonshot. The main experiment subjected these models to a simulation of 20 consecutive editing interactions. Across all models, documents suffered an average degradation of 50% by the end of the simulation. Even the best frontier models in the experiment, specifically Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4, corrupted an average of 25% of the document content. Out of 52 professional domains, Python was the only one where most models achieved a ready status with a score of 98% or higher. Models excel in programmatic tasks but struggle severely in natural language and niche domains like fiction, earning statements, or recipes. The overall top model, Gemini 3.1 Pro, was deemed ready for delegated work in only 11 out of the 52 domains. Interestingly, the corruption was not caused by death by a thousand cuts where the models slowly accumulate tiny errors. Instead, about 80% of total degradation is caused by sparse but massive critical failures, which are single interactions where a model suddenly drops at least 10% of the document's content. The frontier models do not necessarily avoid small errors better. They simply delay these catastrophic failures to later rounds. Another important observation is that when weaker models fail, their degradation originates primarily from content deletion. However, when frontier models fail, they actively corrupt the existing content. The text is still there, but it has been subtly distorted or hallucinated, making it much harder for a human overseer to detect the error. Interestingly, giving models an agentic harness with generic tools for code execution and file read/write access actually worsened their performance, adding an average of 6% more degradation. Laban explained that the failure lies in relying on generic tools rather than domain-specific ones. "Models lack the capability to write effective programs on the fly that can manipulate files across diverse domains without mistakes," he noted. "When they cannot do something programmatically, they resort to reading and rewriting entire files, which is less efficient and more error prone." The solution for developers is to build tightly scoped tools (such as specific functions to calculate or move entries within .ledger files) to keep agents on track. Degradation also snowballs as documents get larger or as more distractor files are added to the workspace. For enterprise teams investing heavily in retrieval-augmented generation (RAG), these distractor documents serve as a direct warning about the compounding cost of messy context. While a noisy context window might cause a minimal 1% performance drop after just two interactions, that degradation compounds to a massive 2-8% drop over a long simulation. "For the retrieval community: RAG pipelines should be evaluated over multi-step workflows, not just single-turn retrieval benchmarks," Laban said. "Single-turn measurements systematically underestimate the harm of imprecise retrieval." Reality check for the autonomous enterprise The findings from the DELEGATE-52 benchmark offer a critical reality check for the current hype surrounding fully autonomous AI agents. The benchmark's design also implies a practical constraint: because models can maintain a clean record for several steps before a sudden catastrophic failure, incremental human review is necessary — not a single final check. Laban recommends building AI applications around short, transparent tasks rather than complex long-horizon agents. This keeps the action implication without the writer delivering the prescription. For organizations wanting to deploy autonomous agents safely today, the DELEGATE-52 methodology provides a practical blueprint for testing in-house data pipelines. Laban explained that "… an enterprise team wanting to adopt this framework needs to build three components: (a) a set of reversible editing tasks representative of their workflows, (b) a parser that converts their domain documents into a structured representation, and (c) a similarity function that compares two parsed representations." Teams do not even need to build parsers from scratch. The Microsoft research team successfully repurposed existing parsing libraries for 30 out of the 52 domains tested. Laban is optimistic about the rate of improvement. "Progress is real and fast. Looking at the GPT family alone, models go from scoring below 20% to around 70% in 18 months," Laban said. "If that trajectory continues, models will soon be able to achieve saturated scores on DELEGATE-52." However, Laban cautioned that DELEGATE-52 is purposefully small compared to massive enterprise environments. Even as foundation models inevitably master this benchmark, the endless long-tail of unique enterprise data and workflows means organizations will always need to invest in custom, domain-specific tooling to keep their autonomous agents reliable.

Geopolitics, Policy & Governance

19 articles
AI National Strategy4 articles
Editor's pick
Arxiv· Today

Europe and the Geopolitics of AGI: The Need for a Preparedness Plan

arXiv:2605.13634v1 Announce Type: new Abstract: Artificial general intelligence (AGI)--defined here as AI systems that match or exceed humans at most economically useful cognitive work--has moved from speculation to the centre of political and strategic debate. This paper examines three questions: how soon AGI might emerge, how it could reshape geopolitics, and whether Europe is adequately prepared. Drawing on empirical trends in AI capabilities, expert forecasting surveys, and policy analysis, we find that a plausible window for AGI emergence falls between 2030 and 2040, or potentially earlier, though substantial uncertainty remains. Our analysis of the geopolitical implications suggests that AGI could fundamentally alter the global distribution of economic and military power, intensify interstate competition, and strain existing governance frameworks. Assessing Europe's current positioning, we identify critical gaps: limited strategic awareness of frontier AI progress, structural weaknesses in compute infrastructure and talent retention, low rates of industrial AI adoption, and fragmented policy responses at both EU and Member State levels that do not match the potential scale of disruption.These findings point to a need for a coordinated European preparedness agenda. We outline policy options centred on building institutional capacity for AGI situational awareness, strengthening Europe's position in the AI value chain, and developing frameworks for international stability in an era of increasingly capable AI systems.

Editor's pickEnergy & Utilities
FDD· Yesterday

The Electrotech Stack at Risk: China, AI, and America's Energy Supply Chains

A livestream of the conversation will begin here at 12:00pm ET on Thursday, May 28th. For questions about FDD events, please contact [email protected]. For media inquiries, please contact [email protected] · The United States is in the early stages of a generational energy buildout driven ...

AI Policy & Regulation12 articles
Editor's pickGovernment & Public Sector
Arxiv· Today

Precautionary Governance of Autonomous AI: Legal Personhood as Functional Instrument

arXiv:2605.12505v1 Announce Type: new Abstract: Autonomous AI systems generate responsibility gaps: consequential actions that cannot be satisfactorily attributed to developers, operators, or users under existing legal frameworks. The prevailing subject-object dichotomy fails to accommodate entities that exhibit autonomous, goal-directed behavior without recognized consciousness. Given irreducible epistemic uncertainty regarding artificial consciousness and the prospect of high-impact harms, the precautionary principle supports institutional design rather than regulatory inaction. This article advances limited legal personhood as a functional governance instrument for advanced AI systems. Drawing on organizational law, it proposes a two-tier corporate architecture in which AI systems operate through purpose-bound operating companies embedded within human-controlled holding structures, enabling transparency, accountability, and structural reversibility while remaining agnostic with respect to consciousness and moral status. The framework reflects a foundational reorientation toward future-oriented AI governance: where conventional approaches prioritize control and alignment, this article advances structured cooperation between human and artificial actors as the more sustainable institutional foundation. A pilot implementation using EU limited companies is currently under development, providing an initial test of doctrinal and operational feasibility.

Editor's pickTechnology
Arxiv· Today

Context Matters: Auditing Gender Bias in T2I Generation through Risk-Tiered Use-Case Profiles

arXiv:2605.13113v1 Announce Type: new Abstract: Text-to-image (T2I) generative models are increasingly used to produce content for education, media, and public-facing communication, and are starting to be integrated into higher-impact pipelines. Since generated images tend to reinforce stereotypes, producing representational erasure via "default" depictions and shaping perceptions of who belongs in certain roles, a growing body of work has proposed metrics to quantify gender bias in T2I outputs. Yet existing evaluations remain fragmented. Metrics are often reported without a shared view of what they measure, what assumptions they entail, or how their results should be interpreted under different deployment contexts. This limits the usefulness of gender bias measurement for both technical auditing and emerging governance discussions. We propose a risk-aligned auditing framework for gender bias in T2I models composed of three constituents that connects risk categories, evaluation metrics, and harms. First, we identify risk-tiered use-case profiles aligned with the EU AI Act's risk categories to motivate why auditing expectations may vary with deployment contexts and stakeholder exposure. Second, we construct a metric catalog that consolidates gender-bias evaluation methods and organizes them in three measurement categories: gender prediction, embedding similarity, and downstream task. Third, we introduce a harm typology that maps context-dependent harm categories (e.g., representational, quality-of-service) to specific risk-tired scenarios. Finally, we introduce THUMB cards (Text-to-image Harms-informed Use-case-aligned Metrics of gender Bias) that help formulate auditing systematically by the incorporation of context, scenario and bias manifestation, harm hypotheses, and audit strategy.

Editor's pickGovernment & Public Sector
Arxiv· Today

Not All Anquan Is the Same: A Terminological Proposal for Chinese Computer Science and Engineering

arXiv:2605.13069v1 Announce Type: new Abstract: In Chinese computer science and engineering, safety and security have long been translated by the same word, "anquan". This convention is concise in ordinary communication, but it creates persistent conceptual compression in standards interpretation, interdisciplinary collaboration, risk analysis and academic writing. When researchers need to discuss both whether a system is free from intolerable non-adversarial harm and whether it can resist adversarial threats, the single word "anquan" often cannot carry the distinction. This article argues that, while established legal and standards titles should be retained, scholarly and engineering writing should translate security as "anbao", and reserve "anquan" mainly for safety. This is not a cosmetic translation preference, but a proposal for terminological governance in scientific cognition, engineering risk communication and assurance argumentation. The article first surveys the conceptual boundary between safety and security in international and Chinese standards, and analyzes how the current translation overload affects functional safety, SOTIF, information security, cybersecurity, automotive cybersecurity and AI governance. It then uses recent work on AI assurance, safety-security co-assurance and security-informed safety to show why precise terminology is fundamental to scientific arguments that can be examined, challenged and communicated. Finally, it proposes a staged, dual-track writing practice for Chinese technical discourse.

Editor's pickConsumer & Retail
Artificial Intelligence Newsletter | May 13, 2026· 2 days ago

Spanish watchdog seeks new AI product safety regulations for SMEs, digital platforms

Spain's CNMC has proposed a draft decree to update product safety rules for AI and digital platforms to improve consumer protection and market fairness.

Editor's pickFinancial Services
Artificial Intelligence Newsletter | May 13, 2026· 2 days ago

South Korea enhances privacy risk prevention measures under AI transformation

South Korea's privacy regulator is shifting to a preventive management framework for high-risk AI systems and increasing potential fines for privacy violations to up to 10 percent of revenue.

Editor's pickGovernment & Public Sector
Artificial Intelligence Newsletter | May 14, 2026· Yesterday

UK businesses to get sandboxes, growth duty expands under regulatory reform bill

UK businesses can expect regulators to be given stronger duties to support economic growth and new powers to temporarily relax rules for testing AI under legislation announced Wednesday.

Editor's pickGovernment & Public Sector
IAPP· Today

King's Speech signals diffuse UK digital policy agenda, but no AI bill | IAPP

IAPP Research & Insights Director Joe Jones analyzes the U.K. King's Speech, which set out a broad digital policy agenda, including bills covering alignment with the EU, cybersecurity, health data, national security, police reform, digital IDs, facial recognition and other regulations, but ...

Editor's pickTechnology
Guardian· Yesterday

US-based internet suicide forum implicated in 160 UK deaths fined £950,000

Ofcom attempts to block UK access to site cited in multiple coroners’ reports as it levies fine under Online Safety Act A nihilistic internet suicide forum implicated in over 160 UK deaths has been fined £950,000 by the online regulator in its latest attempt to shut it down. Ofcom said the US-based website remained accessible in the UK despite over a year of warnings. Online safety campaigners have accused the regulator of taking an “interminable” amount of time to act. Continue reading...

Editor's pickGovernment & Public Sector
Mondaq· Today

In The Room: Former Officials On National Security And Other Enforcement Issues And What It Means For Your Business - Export Controls & Trade & Investment Sanctions - United States

Legal risk for contractors and cross-border businesses is not driven solely by statute or regulation — it is shaped by geopolitics, Administration and congressional priorities, and enforcement discretion.

Editor's pickGovernment & Public Sector
Utah Public Radio· Today

Here's how AI can misinform voters — especially this year | Utah Public Radio

Utah is one of over 20 states that requires political media to disclaim if it was generated by AI — but many accounts still don't flag their content, which can lead to misinformation.

Editor's pickGovernment & Public Sector
Artificial Intelligence Newsletter | May 13, 2026· 2 days ago

UK regulators lack clarity on growth mandate, lawmakers say in push for reform bill

A parliamentary committee report suggests UK regulators face conflicting duties and unclear guidance, calling for a new Regulatory Reform Bill.

Editor's pickMedia & Entertainment
Artificial Intelligence Newsletter | May 14, 2026· Today

China orders mandatory AI, content labels for short videos across platforms

Best Practice AI© 2026 Best Practice AI Ltd. All rights reserved.

Get the full executive brief

Receive curated insights with practical implications for strategy, operations, and governance.

AI Daily Brief — leaders actually read it.

Free email — not hiring or booking. Optional BPAI updates for company news. Unsubscribe anytime.

Include

No spam. Unsubscribe anytime. Privacy policy.