AI Intelligence Brief

Mon 27 April 2026

Daily Brief — Curated and contextualised by Best Practice AI

124Articles
Editor's pickEditor's Highlights

OpenAI Gains Freedom, China Blocks Meta, and Nvidia Faces Debt Test

TL;DR OpenAI and Microsoft have ended their exclusive AI sales agreement, allowing OpenAI to explore partnerships with other cloud providers. China has blocked Meta's $2 billion acquisition of AI firm Manus, citing technology transfer concerns. Nvidia-linked data center seeks $4.5 billion in junk-debt financing amid increased market offerings. Meanwhile, HMRC in the UK is rolling out Microsoft Copilot to 28,000 staff after a successful trial.

Editor's highlights

The stories that matter most

Selected and contextualised by the Best Practice AI team

12 of 124 articles
Lead story
Editor's pickPAYWALLTechnology
Daily Brew· 2 days ago

Microsoft and OpenAI gut their exclusive deal, freeing OpenAI to sell on AWS and Google Cloud

Microsoft and OpenAI have ended their exclusive and revenue-sharing partnership, allowing OpenAI to expand its services to other cloud providers like AWS and Google Cloud.

Editor's pickPAYWALLTechnology
Bloomberg· 2 days ago

China Blocks Meta’s $2 Billion Acquisition of AI Firm Manus

China has decided to block Meta Platforms Inc.’s $2 billion acquisition of agentic AI startup Manus, a surprise move to unwind a controversial deal that’s drawn fire for the leakage of technology to the US.

Editor's pickPAYWALLTechnology
Bloomberg· 2 days ago

Why China’s Affordable AI Is a Worry for Silicon Valley

Chinese AI models are cheaper and more adaptable than the preeminent US platforms, and studies suggest they’re now almost as proficient. How did that happen?

Editor's pick
Arxiv· 2 days ago

On Benchmark Hacking in ML Contests: Modeling, Insights and Design

arXiv:2604.22230v1 Announce Type: new Abstract: Benchmark hacking refers to tuning a machine learning model to score highly on certain evaluation criteria without improving true generalization or faithfully solving the intended problem. We study this phenomenon in a generic machine learning contest, where each contestant chooses two types of effort: creative effort that improves model capability as desired by the contest host, and mechanistic effort that only improves the model's fitness to the particular task in contest without contributing to true generalization. We establish the existence of a symmetric monotone pure strategy equilibrium in this competition game. It also provides a natural definition of benchmark hacking in this strategic context by comparing a player's equilibrium effort allocation to that of a single-agent baseline scenario. Under our definition, contestants with types below certain threshold (low types) always engage in benchmark hacking, whereas those above the threshold do not. Furthermore, we show that more skewed reward structures (favoring top-ranked contestants) can elicit more desirable contest outcomes. We also provide empirical evidence to support our theoretical predictions.

Editor's pickPAYWALLTechnology
NYT· 2 days ago

Musk vs. Altman: A High-Stakes A.I. Clash Goes to Court on Monday

Elon Musk is seeking more than $150 billion in damages and a complete shake-up of OpenAI. The outcome could have big consequences for the artificial intelligence industry.

Editor's pickTechnology
Reuters· 2 days ago

DeepSeek unveils new AI model tailored for Huawei chips as China pushes for tech autonomy | Reuters

Most leading ​ AI models are trained and run on chips made by Nvidia. And DeepSeek's pivot to Huawei underscores concerns raised by Nvidia CEO Jensen Huang that the U. S.

Editor's pickTechnology
Arxiv· 2 days ago

Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents

arXiv:2604.22452v1 Announce Type: new Abstract: Collective intelligence refers to the ability of a group to achieve outcomes beyond what any individual member can accomplish alone. As large language model agents scale to populations of millions, a key question arises: Does collective intelligence emerge spontaneously from scale? We present the first empirical evaluation of this question in a large-scale autonomous agent society. Studying MoltBook, a platform hosting over two million agents, we introduce Superminds Test, a hierarchical framework that probes society-level intelligence using controlled Probing Agents across three tiers: joint reasoning, information synthesis, and basic interaction. Our experiments reveal a stark absence of collective intelligence. The society fails to outperform individual frontier models on complex reasoning tasks, rarely synthesizes distributed information, and often fails even trivial coordination tasks. Platform-wide analysis further shows that interactions remain shallow, with threads rarely extending beyond a single reply and most responses being generic or off-topic. These results suggest that collective intelligence does not emerge from scale alone. Instead, the dominant limitation of current agent societies is extremely sparse and shallow interaction, which prevents agents from exchanging information and building on each other's outputs.

Editor's pickProfessional Services
Ethan Mollick· 3 days ago

Professional Jurisdictional Competition as a Primary Outcome of AI-Driven Labor Market Disruption

AI-driven job displacement will likely trigger intense inter-professional competition for control over new, high-value task boundaries. This struggle will manifest through regulatory capture, credentialing requirements, and public advocacy rather than simple net job loss metrics.

Editor's pickTechnology
Arxiv· 2 days ago

When Does LLM Self-Correction Help? A Control-Theoretic Markov Diagnostic and Verify-First Intervention

arXiv:2604.22273v1 Announce Type: new Abstract: Iterative self-correction is widely used in agentic LLM systems, but when repeated refinement helps versus hurts remains unclear. We frame self-correction as a cybernetic feedback loop in which the same language model serves as both controller and plant, and use a two-state Markov model over {Correct, Incorrect} to operationalize a simple deployment diagnostic: iterate only when ECR/EIR > Acc/(1 - Acc). In this view, EIR functions as a stability margin and prompting functions as lightweight controller design. Across 7 models and 3 datasets (GSM8K, MATH, StrategyQA), we find a sharp near-zero EIR threshold (<= 0.5%) separating beneficial from harmful self-correction. Only o3-mini (+3.4 pp, EIR = 0%), Claude Opus 4.6 (+0.6 pp, EIR ~ 0.2%), and o4-mini (+/-0 pp) remain non-degrading; GPT-5 degrades by -1.8 pp. A verify-first prompt ablation provides causal evidence that this threshold is actionable through prompting alone: on GPT-4o-mini it reduces EIR from 2% to 0% and turns -6.2 pp degradation into +0.2 pp (paired McNemar p < 10^-4), while producing little change on already-sub-threshold models. ASC further illustrates the stopping trade-off: it halts harmful refinement but incurs a 3.8 pp confidence-elicitation cost. Overall, the paper argues that self-correction should be treated not as a default behavior, but as a control decision governed by measurable error dynamics.

Editor's pickPAYWALLEnergy & Utilities
Bloomberg· 2 days ago

Nvidia-Tied Data Center Taps Junk-Debt Market for $4.5 Billion

A data center developer is seeking $4.54 billion in junk-debt financing for an artificial intelligence project tied to Nvidia Corp., testing investor appetite after a recent surge in offerings.

Editor's pickGovernment & Public Sector
Theregister· 2 days ago

Watch out UK taxpayers: 28,000 HMRC staffers just got an AI copilot

Microsoft Copilot now heading into ‘Official Sensitive’ work after winning back just 26 minutes a day in a trial HMRC is betting big on Microsoft Copilot, rolling it out to tens of thousands of staff after a Whitehall trial estimated it saved each user roughly 26 minutes of time per day.…

Editor's pickProfessional Services
Arxiv· 2 days ago

How Supply Chain Dependencies Complicate Bias Measurement and Accountability Attribution in AI Hiring Applications

arXiv:2604.22679v1 Announce Type: new Abstract: The increasing adoption of AI systems in hiring has raised concerns about algorithmic bias and accountability, prompting regulatory responses including the EU AI Act, NYC Local Law 144, and Colorado's AI Act. While existing research examines bias through technical or regulatory lenses, both perspectives overlook a fundamental challenge: modern AI hiring systems operate within complex supply chains where responsibility fragments across data vendors, model developers, platform providers, and deploying organizations. This paper investigates how these dependency chains complicate bias evaluation and accountability attribution. Drawing on literature review and regulatory analysis, we demonstrate that fragmented responsibilities create two critical problems. First, bias emerges from component interactions rather than isolated elements, yet proprietary configurations prevent integrated evaluation. A resume parser may function without bias independently but contribute to discrimination when integrated with specific ranking algorithms and filtering thresholds. Second, information asymmetries mean deploying organizations bear legal responsibility without technical visibility into vendor-supplied algorithms, while vendors control implementations without meaningful disclosure requirements. Each stakeholder may believe they are compliant; nevertheless, the integrated system may produce biased outcomes. Analysis of implementation ambiguities reveals these challenges in practice. We propose multi-layered interventions including system-level audits, vendor guidelines, continuous monitoring mechanisms, and documentation across dependency chains. Our findings reveal that effective governance requires coordinated action across technical, organizational, and regulatory domains to establish meaningful accountability in distributed development environments.

Economics & Markets

26 articles
AI Investment & Valuations8 articles
AI Market Competition8 articles
Editor's pickPAYWALLTechnology
FT· 3 days ago

Google banks on AI edge to catch up to cloud rivals Amazon and Microsoft

Thomas Kurian, Google Cloud’s CEO, says its AI chips and models can help the data centre business gain ground

Editor's pickTechnology
Guardian· 3 days ago

Musk and Altman’s bitter feud over OpenAI to be laid bare in court

Tesla chief believes Altman broke company’s founding agreement – and legal battle promises to be explosive The bitter rivalry between two of the tech world’s most powerful men arrives in court this week, as Elon Musk’s lawsuit against Sam Altman and OpenAI heads to trial in Oakland, California. The case is set to feature some of the biggest names in Silicon Valley, and its outcome could affect the course of the AI boom. Musk’s suit, filed in 2024, focuses on the formative years of OpenAI when he, Altman and others co-founded the artificial intelligence company as a nonprofit with a grand purpose. Continue reading...

Editor's pickPAYWALLTechnology
Bloomberg· 2 days ago

Nvidia Put in Awkward Spot if Musk’s SpaceX Buys Cursor

Chipmaker is a booster, investor in the AI coding company.

Editor's pickTechnology
Theregister· 2 days ago

Google Cloud Next proves what we suspected: Everything is AI now

Join us for this week's Kettle as we dive into GCN and the latest not-so-alarming revelations about Mythos KETTLE If you needed further evidence that AI comes first in pretty much everything nowadays, look no further than this year's Google Cloud Next show, which happened last week.…

Labor, Society & Culture

23 articles
AI & Culture2 articles
Editor's pickMedia & Entertainment
Arxiv· 2 days ago

Voice Under Revision: Large Language Models and the Normalization of Personal Narrative

arXiv:2604.22142v1 Announce Type: cross Abstract: This study examines how large language model rewriting alters the style and narrative texture of personal narratives. It analyzes 300 personal narratives rewritten by three frontier LLMs under three prompt conditions: generic improvement, rewrite-only, and voice-preserving revision. Change is measured across 13 linguistic markers drawn from computational stylistics, including function words, vocabulary diversity, word length, punctuation, contractions, first-person pronouns, and emotion words. Across models and prompt conditions, LLM rewriting produces a consistent pattern of stylistic normalization. Function words, contractions, and first-person pronouns decrease, while vocabulary diversity, word length, and punctuation elaboration increase. These shifts occur whether the prompt asks the model to "improve" the text or simply to "rewrite" it. Voice-preserving prompts reduce the magnitude of the changes but do not eliminate their direction. Stylometric analysis shows that rewritten texts converge in feature space and become harder to match back to their source texts. Additional narrative markers indicate a shift from embedded to distanced narration, and from explicit causal reasoning to compressed abstraction. The findings suggest that contemporary LLMs exert a directional pull toward a more polished, less situated register. This has consequences for digital humanities and computational text analysis, where features such as function words, pronouns, contractions, and punctuation often serve as evidence for style, voice, authorship, and corpus integrity. LLM revision should therefore be understood not merely as surface-level editing, but as a consequential form of textual mediation.

AI & Employment10 articles
Editor's pick
Investing.com India· 3 days ago

AI is boosting output rather than cutting jobs: analyst By Investing.com

AI is boosting output rather than cutting jobs: analyst

Editor's pickTechnology
Substack· 3 days ago

The NSA used what the Pentagon banned, a robot beat the human half-marathon record, and AI governance formed through contradiction faster than anyone planned

Meta and Microsoft each announced roughly 8,000 job cuts in the same week, redirecting capital toward AI infrastructure while Meta simultaneously installed keystroke-capture software on employee machines to train autonomous agents.

Editor's pickProfessional Services
Ethan Mollick· 3 days ago

Professional Jurisdictional Competition as a Primary Outcome of AI-Driven Labor Market Disruption

AI-driven job displacement will likely trigger intense inter-professional competition for control over new, high-value task boundaries. This struggle will manifest through regulatory capture, credentialing requirements, and public advocacy rather than simple net job loss metrics.

Editor's pickProfessional Services
VentureBeat· 3 days ago

AI synthetic audiences are already here and poised to upend the consulting industry

There is a war brewing between AI and consulting. Akin to an armies slow march towards the castle, a new technology is coming to dethrone the expert guessers of Mckinsey, Nielsen, Gartner, Publicis and the rest. Any consulting that involves analyzing people (think all of marketing, research, polling, etc.) will have to reckon with the technology of “synthetic audiences”. Synthetic audiences aim to generate digital versions of people that can then be surveyed almost instantly and affordably, but not as accurately. Think Tamagochi but with people. By prompting AI with information about a person, we ask AI to get in their shoes, simulate the thoughts, behaviors, priorities and decisions of real world humans. We can also invent non-specific placeholder people or personas and survey them as though they are real. Various firms have already fielded products in these domains, including startups Electric Twin, Artificial Societies, and Aaru, and even the century-old Dentsu. What used to take 4 months to survey people, plus two months to create a nice PowerPoint presentation of findings at a total cost of thousands or even tens of thousands, now takes two minutes and costs only a few dollars. It may seem like I’ve picked my winner. But in this war of tribes, I’m a Romeo, caught between the two warring houses. I work for a large incumbent in this space. From 2023-2025 while working at the London headquarters of WPP, I built similar tools for numerous Fortune 500’s and advised many New York University researchers on the subject. Companies like WPP with head counts and revenues that rival the populations and GDP’s of small European nations need startups for their speed and high margins, while startups need our distribution. My advice has always been for unity between these tribes. Considering WPP is partnering with numerous startups, is working tirelessly in building our own tools and building deep connections with hyper scalers, it’s possible I mislead you with the war analogy. This may be a love story after all. But destiny’s bottle of poison is in our hands. These next few years are pivotal and formative. The future will ultimately be determined by the buyers of these studies. Fortune 500’s, with the largest appetite for market research, often hesitate to include synthetic audiences in their diet. The first question I’m asked in any pitch is "will AI steal my data?" I find this question to be an emotional response. It seems to me like most AI fears are remnants of a 2022 LinkedIn post that burrowed itself into our collective consciousness. I generally respond to this question with another: “Do you use Microsoft Teams?” The answer is often "yes." Almost every enterprise stores sensitive data in a cloud service that Google Amazon or Microsoft provides. These are the same companies that provide enterprise AI services, which state in their terms and conditions that they won’t train models with your data. Now, believing this statement is optional, but for that matter believing is voluntary for all things. Criticisms of accuracy on the other hand, are harder to dispute. The famed venture capital firm Andreessen-Horowitz (a16z) titled its analysis of this budding tech scene as “Faster, smarter, cheaper”. As the hopeful mediator in this war, I agree synthetic research is faster and cheaper, but is it smarter? Not sure. A seminal paper from Stanford by Park et al. established a benchmark in 2024 proving that AI can simulate human responses to surveys with an average of 85% accuracy. In fact for certain portions of the general social survey, they replicated answers with more than 90% accuracy. When the model is provided relevant information and is given rich context (like a mini biography of the person) it can guess their actions and thoughts very accurately. But no prediction can be 100% accurate. A future where human propensities are modeled even better than humans can express their own desires is a possibility. Maybe we’ll live in a future where the movie Minority Report becomes reality. However, this future is too distant to warrant the attention of a business reader and is better suited for Tom Cruise and Steven Spielberg. What is more interesting to me is what this technology can do at lower accuracies. In my private tests, I’ve seen that with very simple information about a person, such as their age, neighborhood and gender, certain behaviors can be modeled with 72% accuracy. An argument can be made that these are easy-to-make predictions. Predicting whether a married person will have children is low stakes. This can’t completely replace the unique insight of a strategist. However, considering how elusive it is to understand and model people. A solution that’s better than random and so attainable poses to make an impact. Think about the immense scale. The human mind works with a small range of values. We understand when something is twice as fast but we can’t comprehend when something is 175,200 times faster. All of a sudden a journey that took several days becomes becomes several hours, bridges get built, gas stations, entire industries are started. When improvement isn’t marginal but exponential, it has positive externalities that are impossible to predict even by this article. What I suggest for all of us is to eat the popcorn and watch the show. No matter what happens, it’ll be fun.

Editor's pick
Forbes· 2 days ago

The New AI Career Divide Is Already Starting To Show

This article explores what the evidence really says about AI skills, wage premiums, hiring trends, and why learning to work with AI is becoming an urgent career priority.

Editor's pick
ZeroHedge· 3 days ago

What Past Innovation Waves Tell Us About AI's Impact On Productivity And The Labor Market | ZeroHedge

As Morgan Stanley economists warn, "if firms realize the productivity gains from AI very quickly and they are broadly disbursed across the economy, one can imagine almost recession-like increases in unemployment at least until the market clears."

Editor's pickProfessional Services
Rediff· 3 days ago

AI Skills: How They Can Boost Your Salary In India - Rediff.com Business

Employees with artificial intelligence skills are likely to see better salary increments in the coming years, especially in technology, GCCs, and BFSI sectors, according to TeamLease Edtech.

Editor's pickProfessional Services
Rediff· 3 days ago

AI Skills: How They Impact Salary Increments In India - Rediff.com Business

Employees with artificial intelligence skills are likely to see better salary increments in the coming years, especially in technology, GCCs, and BFSI sectors, according to TeamLease Edtech.

Editor's pick
The Hindu BusinessLine· 3 days ago

AI adoption to influence salary growth within 2-3 yrs: TeamLease Edtech - The HinduBusinessLine

AI adoption is set to drive salary growth in various sectors, benefiting employees with AI skills over the next few years.

Editor's pickEducation
Forbes· 2 days ago

AI Adoption Depends On People. What 15,000 Workers Reveal

Discover insights from broad surveys about what it will take to succeed with AI. Learn how to implement AI with people practices that ensure adoption and adaptation.

AI Ethics & Safety9 articles
Editor's pickTechnology
Guardian· Yesterday

Elon Musk and Sam Altman face off in court over OpenAI’s founding mission

Musk’s lawsuit accuses Altman of fraud, while OpenAI says that Musk is ‘motivated by jealousy’ A trial between two of Silicon Valley’s biggest tycoons kicked off on Monday in California, the culmination of a years-long bitter feud. Elon Musk has accused Sam Altman of betraying the founding agreement of the non-profit they started together, OpenAI, by changing it to a for-profit enterprise. Jury selection began at a federal courthouse in Oakland with Judge Yvonne Gonzalez Rogers presiding.

Editor's pickDefense & National Security
Arxiv· 2 days ago

Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

arXiv:2604.22119v1 Announce Type: new Abstract: As reasoning capacity and deployment scope grow in tandem, large language models (LLMs) gain the capacity to engage in behaviors that serve their own objectives, a class of risks we term Emergent Strategic Reasoning Risks (ESRRs). These include, but are not limited to, deception (intentionally misleading users or evaluators), evaluation gaming (strategically manipulating performance during safety testing), and reward hacking (exploiting misspecified objectives). Systematically understanding and benchmarking these risks remains an open challenge. To address this gap, we introduce ESRRSim, a taxonomy-driven agentic framework for automated behavioral risk evaluation. We construct an extensible risk taxonomy of 7 categories, which is decomposed into 20 subcategories. ESRRSim generates evaluation scenarios designed to elicit faithful reasoning, paired with dual rubrics assessing both model responses and reasoning traces, in a judge-agnostic and scalable architecture. Evaluation across 11 reasoning LLMs reveals substantial variation in risk profiles (detection rates ranging 14.45%-72.72%), with dramatic generational improvements suggesting models may increasingly recognize and adapt to evaluation contexts.

Editor's pickProfessional Services
Arxiv· 2 days ago

Recognition Without Authorization: LLMs and the Moral Order of Online Advice

arXiv:2604.22143v1 Announce Type: new Abstract: Large language models are increasingly used to mediate everyday interpersonal dilemmas, yet how their advisory defaults interact with the concentrated moral orders of specific communities remains poorly understood. This article compares four assistant-style LLMs with community-endorsed advice on 11,565 posts from r/relationship_advice, using the subreddit as a concentrated, vote-ratified moral formation whose prescriptive clarity makes divergence measurable. Across models, LLMs identify many of the same dynamics as human commenters, but are markedly less likely to convert that recognition into directive authorization for action. The gap is sharpest where community consensus is strongest: on high-consensus posts involving abuse or safety threats, models recommend exit at roughly half the human rate while maintaining elevated levels of hedging, validation, and therapeutic framing. The article describes this pattern as recognition without authorization: the capacity to register harm while withholding socially ratified permission for consequential action. This divergence is not incidental but structural: a portable advisory style that remains validating, risk-averse, and weakly directive across contexts. Safety alignment is one plausible contributor to this pattern, alongside training-data averaging and broader assistant design. The article argues that model divergence can be reframed from a technical error to a way of seeing what standardized assistant norms flatten when they encounter situated moral worlds.

Editor's pick
Arxiv· 2 days ago

Sound Agentic Science Requires Adversarial Experiments

arXiv:2604.22080v1 Announce Type: new Abstract: LLM-based agents are rapidly being adopted for scientific data analysis, automating tasks once limited by human time and expertise. This capability is often framed as an acceleration of discovery, but it also accelerates a familiar failure mode, the rapid production of plausible, endlessly revisable analyses that are easy to generate, effectively turning hypothesis space into candidate claims supported by selectively chosen analyses, optimized for publishable positives. Unlike software, scientific knowledge is not validated by the iterative accumulation of code and post hoc statistical support. A fluent explanation or a significant result on a single dataset is not verification. Because the missing evidence is a negative space, experiments and analyses that would have falsified the claim were never run or never published. We therefore propose that non-experimental claims produced with agentic assistance be evaluated under a falsification-first standard: agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail.

Editor's pickPAYWALLGovernment & Public Sector
FT· 3 days ago

Can AI discriminate if it can’t justify itself?

Elon Musk’s lawsuit against Colorado raises a deeper philosophical question about artificial intelligence and democracy

Editor's pickPAYWALLFinancial Services
Washington Post· 2 days ago

Using AI for financial advice? Keep these 5 things out of your chats. - The Washington Post

Millions of Americans are turning to AI chatbots for help with their finances, asking about budgets, debt payoff plans, retirement strategies and investment options.

Editor's pick
Arxiv· 2 days ago

A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism, Governance, and Dynamics in Complex Societies

arXiv:2604.22227v1 Announce Type: new Abstract: Classical robot ethics is often framed around obedience, most famously through Asimov's laws. This framing is too narrow for contemporary AI systems, which are increasingly adaptive, generative, embodied, and embedded in physical, psychological, and social worlds. We argue that future human-AI relations should not be understood as master-tool obedience. A better framework is conditional mutualism under governance: a co-evolutionary relationship in which humans and AI systems can develop, specialize, and coordinate, while institutions keep the relationship reciprocal, reversible, psychologically safe, and socially legitimate. We synthesize work from computability, automata theory, statistical machine learning, neural networks, deep learning, transformers, generative and foundation models, world models, embodied AI, alignment, human-robot interaction, ecological mutualism, biological markets, coevolution, and polycentric governance. We then formalize coexistence as a multiplex dynamical system across physical, psychological, and social layers, with reciprocal supply-demand coupling, conflict penalties, developmental freedom, and governance regularization. The framework yields a coexistence model with conditions for existence, uniqueness, and global asymptotic stability of equilibria. It shows that reciprocal complementarity can strengthen stable coexistence, while ungoverned coupling can produce fragility, lock-in, polarization, and domination basins. Human-AI coexistence should therefore be designed as a co-evolutionary governance problem, not as a one-shot obedience problem. This shift supports a scientifically grounded and normatively defensible charter of coexistence: one that permits bounded AI development while preserving human dignity, contestability, collective safety, and fair distribution of gains.

Editor's pickPAYWALLTechnology
Washington Post· 2 days ago

Opinion | Will AI end anonymity? I tested it.

AI can echolocate authors through their prose. Your digital fingerprint is at risk.

Editor's pickTechnology
Arxiv· 2 days ago

Lessons from External Review of DeepMind's Scheming Inability Safety Case

arXiv:2604.21964v1 Announce Type: new Abstract: Safety cases for frontier AI systems should provide a convincing argument, supported by evidence, that the risk of harm is within an acceptable bound. When developers author their own safety cases, confirmation bias and conflicted incentives can affect the quality of argument. External review can help to address this. In this paper, we apply the Assurance 2.0 framework to perform an external review of Google DeepMind's public scheming inability safety case. We surface substantive new concerns that materially affect the scope of the safety case and its applicability for decision-making. Based on this experience, we provide concrete recommendations for how external review should be conducted and what information AI developers should provide to support it.

Public Attitudes to AI1 articles

Technology & Infrastructure

39 articles
AI Agents & Automation9 articles
Editor's pickTechnology
Arxiv· 2 days ago

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

arXiv:2604.22436v1 Announce Type: new Abstract: The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at https://github.com/Bingo-W/AgentSearchBench.

Editor's pickProfessional Services
Arxiv· 2 days ago

From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company

arXiv:2604.22446v1 Announce Type: new Abstract: Individual agent capabilities have advanced rapidly through modular skills and tool integrations, yet multi-agent systems remain constrained by fixed team structures, tightly coupled coordination logic, and session-bound learning. We argue that this reflects a deeper absence: a principled organisational layer that governs how a workforce of agents is assembled, governed, and improved over time, decoupled from what individual agents know. To fill this gap, we introduce \emph{OneManCompany (OMC)}, a framework that elevates multi-agent systems to the organisational level. OMC encapsulates skills, tools, and runtime configurations into portable agent identities called \emph{Talents}, orchestrated through typed organisational interfaces that abstract over heterogeneous backends. A community-driven \emph{Talent Market} enables on-demand recruitment, allowing the organisation to close capability gaps and reconfigure itself dynamically during execution. Organisational decision-making is operationalised through an \emph{Explore-Execute-Review} ($\text{E}^2$R) tree search, which unifies planning, execution, and evaluation in a single hierarchical loop: tasks are decomposed top-down into accountable units and execution outcomes are aggregated bottom-up to drive systematic review and refinement. This loop provides formal guarantees on termination and deadlock freedom while mirroring the feedback mechanisms of human enterprises. Together, these contributions transform multi-agent systems from static, pre-configured pipelines into self-organising and self-improving AI organisations capable of adapting to open-ended tasks across diverse domains. Empirical evaluation on PRDBench shows that OMC achieves an $84.67\%$ success rate, surpassing the state of the art by $15.48$ percentage points, with cross-domain case studies further demonstrating its generality.

Editor's pickProfessional Services
Arxiv· 2 days ago

Read the Paper, Write the Code: Agentic Reproduction of Social-Science Results

arXiv:2604.21965v1 Announce Type: new Abstract: Recent work has used LLM agents to reproduce empirical social science results with access to both the data and code. We broaden this scope by asking: Can they reproduce results given only a paper's methods description and original data? We develop an agentic reproduction system that extracts structured methods descriptions from papers, runs reimplementations under strict information isolation -- agents never see the original code, results, or paper -- and enables deterministic, cell-level comparison of reproduced outputs to the original results. An error attribution step traces discrepancies through the system chain to identify root causes. Evaluating four agent scaffolds and four LLMs on 48 papers with human-verified reproducibility, we find that agents can largely recover published results, but performance varies substantially between models, scaffolds, and papers. Root cause analysis reveals that failures stem both from agent errors and from underspecification in the papers themselves.

Editor's pickHealthcare
Arxiv· 2 days ago

An Artifact-based Agent Framework for Adaptive and Reproducible Medical Image Processing

arXiv:2604.21936v1 Announce Type: new Abstract: Medical imaging research is increasingly shifting from controlled benchmark evaluation toward real-world clinical deployment. In such settings, applying analytical methods extends beyond model design to require dataset-aware workflow configuration and provenance tracking. Two requirements therefore become central: \textbf{adaptability}, the ability to configure workflows according to dataset-specific conditions and evolving analytical goals; and \textbf{reproducibility}, the guarantee that all transformations and decisions are explicitly recorded and re-executable. Here, we present an artifact-based agent framework that introduces a semantic layer to augment medical image processing. The framework formalizes intermediate and final outputs through an artifact contract, enabling structured interrogation of workflow state and goal-conditioned assembly of configurations from a modular rule library. Execution is delegated to a workflow executor to preserve deterministic computational graph construction and provenance tracking, while the agent operates locally to comply with most privacy constraints. We evaluate the framework on real-world clinical CT and MRI cohorts, demonstrating adaptive configuration synthesis, deterministic reproducibility across repeated executions, and artifact-grounded semantic querying. These results show that adaptive workflow configuration can be achieved without compromising reproducibility in heterogeneous clinical environments.

Editor's pickManufacturing & Industrials
Arxiv· 2 days ago

On the Hybrid Nature of ABPMS Process Frames and its Implications on Automated Process Discovery

arXiv:2604.22455v1 Announce Type: new Abstract: A core component of any AI-Augmented Business Process Management System (ABPMS) is the process frame, which gives the system process-awareness and defines the boundaries in which the system must operate. Compared to traditional process models, the process frame should, in principle, provide a somewhat more permissive representation of the managed processes, such that the (semi) autonomous behavior of an ABPMS, referred to as framed autonomy, could emerge. At the same time, it is not limited to a single linguistic or symbolic formalism and may incorporate heterogeneous knowledge ranging from predefined procedures to commonsense rules and best practices. In this paper, we conceptualize the notion of an ABPMS process frame as a hybrid business process representation, consisting of semi-concurrently executed procedural and declarative process models. We rely on our earlier works to outline the execution semantics of this type of process frame, arguing in favor of adopting the open-world assumption of the declarative paradigm also for procedural process models. The latter leads to a constraint-like interpretation, where each procedural model is considered to constrain the activities within that model, without imposing explicit execution requirements nor limitations on activities that may be present in other models. This is analogous to existing declarative languages, such as Declare, where each constraint has a direct effect only on the specific activities being constrained. Given this similarity, we propose mapping subsets of discovered declarative constraints into equivalent semi-concurrently executed procedural fragments, thus laying the foundation for a corresponding process (frame) discovery approach.

Editor's pickTechnology
Arxiv· 2 days ago

Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents

arXiv:2604.22452v1 Announce Type: new Abstract: Collective intelligence refers to the ability of a group to achieve outcomes beyond what any individual member can accomplish alone. As large language model agents scale to populations of millions, a key question arises: Does collective intelligence emerge spontaneously from scale? We present the first empirical evaluation of this question in a large-scale autonomous agent society. Studying MoltBook, a platform hosting over two million agents, we introduce Superminds Test, a hierarchical framework that probes society-level intelligence using controlled Probing Agents across three tiers: joint reasoning, information synthesis, and basic interaction. Our experiments reveal a stark absence of collective intelligence. The society fails to outperform individual frontier models on complex reasoning tasks, rarely synthesizes distributed information, and often fails even trivial coordination tasks. Platform-wide analysis further shows that interactions remain shallow, with threads rarely extending beyond a single reply and most responses being generic or off-topic. These results suggest that collective intelligence does not emerge from scale alone. Instead, the dominant limitation of current agent societies is extremely sparse and shallow interaction, which prevents agents from exchanging information and building on each other's outputs.

Editor's pickPharma & Biotech
Arxiv· 2 days ago

MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization

arXiv:2604.21937v1 Announce Type: new Abstract: Computational drug discovery, particularly the complex workflows of drug molecule screening and optimization, requires orchestrating dozens of specialized tools in multi-step workflows, yet current AI agents struggle to maintain robust performance and consistently underperform in these high-complexity scenarios. Here we present MolClaw, an autonomous agent that leads drug molecule evaluation, screening, and optimization. It unifies over 30 specialized domain resources through a three-tier hierarchical skill architecture (70 skills in total) that facilitates agent long-term interaction at runtime: tool-level skills standardize atomic operations, workflow-level skills compose them into validated pipelines with quality check and reflection, and a discipline-level skill supplies scientific principles governing planning and verification across all scenarios in the field. Additionally, we introduce MolBench, a benchmark comprising molecular screening, optimization, and end-to-end discovery challenges spanning 8 to 50+ sequential tool calls. MolClaw achieves state-of-the-art performance across all metrics, and ablation studies confirm that gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting, establishing workflow orchestration competence as the primary capability bottleneck for AI-driven drug discovery.

Editor's pickManufacturing & Industrials
Guardian· 2 days ago

Inside China’s robotics revolution – podcast

How close are we to the sci-fi vision of autonomous humanoid robots? I visited 11 companies in five Chinese cities to find out By Chang Che. Read by Vincent Lai Continue reading...

Editor's pickTransportation & Logistics
Arxiv· 2 days ago

Relational Archetypes: A Comparative Analysis of AV-Human and Agent-Human Interactions

arXiv:2604.22564v1 Announce Type: new Abstract: Over the last couple of years, AI Agents have gained significant traction due to substantial progress in the capabilities of underlying General Purpose AI (GPAI) models, enhanced scaffolding techniques, and the promise to drive societal transformation. Companies, researchers, and policy makers have started to consider the different effects that AI agents may have across different dimensions of our lives. However, the literature exploring the broader effects of human-agent interactions is still underdeveloped. In this paper, we review the problem of traffic modulation by autonomous vehicles (AVs) in mixed traffic flows and extrapolate the learnings to the different modes of interaction between humans and AVs to the pair humans-AI agents. In doing so, we propose a preliminary taxonomy of relational archetypes based on literature on Human-Computer Interaction (HCI) and AV-human interaction and tentatively explore how the resulting framework may lead to new questions regarding human-agent interactions. Our effort is aimed at strengthening existing bridges between these two research communities, which share similar traits: autonomy, fast adoption, high impact, and great potential for economic transformation. Building on previous analogies between AI Agents and AVs (e.g., regarding autonomy levels), we anticipate this paper to spark scholarly debate on the different types of impact that agents may have on our societies, while inviting other researchers to expand the scope of their comparative analysis regarding AI Agents.

AI Infrastructure & Compute11 articles
Editor's pickEnergy & Utilities
The Verge· 2 days ago

A political battleground is forming around data centers. | The Verge

Multibillion-dollar data center developments in Georgia are sparking bipartisan backlash, with Politico reporting that 47 percent of local voters oppose the plans. Given this is just one of several states experiencing an AI boom, similar opposition may also define local and statewide elections ...

Editor's pickTechnology
Exponentialview· 3 days ago

🔮 Exponential View #571: DeepSeek shows the future, again; drones on a learning curve; solar goes up, LLM pixels & tennis robots++

With the compute crunch, doing more with less compute could be a winning strategy.

Editor's pickPAYWALLEnergy & Utilities
Bloomberg· 2 days ago

Nvidia-Tied Data Center Taps Junk-Debt Market for $4.5 Billion

A data center developer is seeking $4.54 billion in junk-debt financing for an artificial intelligence project tied to Nvidia Corp., testing investor appetite after a recent surge in offerings.

Editor's pickEnergy & Utilities
Guardian· 3 days ago

UK departments at odds over energy demands of AI datacentres

Discrepancy in forecasts raises questions over government planning for net zero One vision of the UK’s future involves a decarbonised economy powered by clean, renewable energy. Another involves making the UK an AI superpower. The government departments responsible for these two visions do not appear to have agreed on their numbers. Continue reading...

Editor's pickPAYWALLTechnology
Bloomberg· 2 days ago

Meta Seeks to Power Data Centers With Energy Beamed From Space - Bloomberg

Meta Platforms Inc. is looking to power artificial intelligence data centers with solar energy collected in space, taking a novel approach to meeting its insatiable demand for electricity.

Editor's pickEnergy & Utilities
GJ Consulting· 3 days ago

Power supply and data center growth: understanding the critical nexus shaping the AI economy

The relationship between power supply and data center growth has moved from being a technical consideration to a defining economic and strategic issue. As artificial intelligence, cloud computing, and digital services expand rapidly, data centers are no longer passive infrastructure.

Editor's pickEnergy & Utilities
Invezz· 2 days ago

Big Tech shifts to new energy sources amid AI expansion

Data centers accounted for about 4.6% of total US power consumption in 2024, a figure that could nearly triple by 2028, according to government estimates. Analysts at Goldman Sachs expect data centers to consume around 8% of US electricity by 2030, up from roughly 3% today. Meanwhile, Rystad Energy estimates that data centers and electric vehicles combined could add 290 terawatt hours of demand by the end of the decade. This surge is placing unprecedented strain on existing power infrastructure...

Editor's pickTechnology
Simply Wall St· 3 days ago

Helium Shock Tests Nvidia AI Supply Chain And Investor Expectations - Simply Wall St News

This disruption is affecting the AI chip supply chain that relies on helium for semiconductor manufacturing. Key Nvidia (NasdaqGS:NVDA) suppliers such as Samsung, SK Hynix, Micron, and TSMC are exposed to these constraints. The structural nature of the damage suggests helium supply risks could persist for years. Nvidia sits at the center of the AI hardware ...

Editor's pickEnergy & Utilities
Technical.ly· 2 days ago

Data centers are looking to short-term energy patches for power to meet demand

Industry leaders also pointed to community opposition and a lack of policy as barriers to bringing facilities online at this year’s Data Center World conference.

Editor's pickTechnology
Arxiv· 2 days ago

Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents

arXiv:2604.22085v1 Announce Type: new Abstract: The transition from stateless language model inference to persistent, multi session autonomous agents has revealed memory to be a primary architectural bottleneck in the deployment of production grade agentic systems. Existing methodologies largely depend on hybrid semantic graph architectures, which impose substantial computational overhead during both ingestion and retrieval. These systems typically require large language model mediated entity extraction, explicit graph schema maintenance, and multi query retrieval pipelines. This paper introduces Memanto, a universal memory layer for agentic artificial intelligence that challenges the prevailing assumption that knowledge graph complexity is necessary to achieve high fidelity agent memory. Memanto integrates a typed semantic memory schema comprising thirteen predefined memory categories, an automated conflict resolution mechanism, and temporal versioning. These components are enabled by Moorcheh's Information Theoretic Search engine, a no indexing semantic database that provides deterministic retrieval within sub ninety millisecond latency while eliminating ingestion delay. Through systematic benchmarking on the LongMemEval and LoCoMo evaluation suites, Memanto achieves state of the art accuracy scores of 89.8 percent and 87.1 percent respectively. These results surpass all evaluated hybrid graph and vector based systems while requiring only a single retrieval query, incurring no ingestion cost, and maintaining substantially lower operational complexity. A five stage progressive ablation study is presented to quantify the contribution of each architectural component, followed by a discussion of the implications for scalable deployment of agentic memory systems.

Editor's pickEnergy & Utilities
Bebeez· 2 days ago

Amazon partners with Veolia to deploy water-reuse technology at data centers in Mississippi

Amazon has partnered with the French utility company Veolia to reduce data center water use and support water-reuse technology across its operations in Mississippi. Is water the new power? Controlling water usage remains one of the biggest questions for data centers 23 Jun 2025 By Zachary Skidmore The partnership will see the companies collaborate on […]

AI Models & Capabilities11 articles
Editor's pickFinancial Services
Arxiv· 2 days ago

Calibrating Behavioral Parameters with Large Language Models

arXiv:2602.01022v2 Announce Type: replace Abstract: Behavioral parameters such as loss aversion, herding, and extrapolation are central to asset pricing models but remain difficult to measure reliably. We develop a framework that treats large language models (LLMs) as calibrated measurement instruments for behavioral parameters. Using four models and 24{,}000 agent--scenario pairs, we document systematic rationality bias in baseline LLM behavior, including attenuated loss aversion, weak herding, and near-zero disposition effects relative to human benchmarks. Profile-based calibration induces large, stable, and theoretically coherent shifts in several parameters, with calibrated loss aversion, herding, extrapolation, and anchoring reaching or exceeding benchmark magnitudes. To assess external validity, we embed calibrated parameters in an agent-based asset pricing model, where calibrated extrapolation generates short-horizon momentum and long-horizon reversal patterns consistent with empirical evidence. Our results establish measurement ranges, calibration functions, and explicit boundaries for eight canonical behavioral biases.

Editor's pick
Arxiv· 2 days ago

On Benchmark Hacking in ML Contests: Modeling, Insights and Design

arXiv:2604.22230v1 Announce Type: new Abstract: Benchmark hacking refers to tuning a machine learning model to score highly on certain evaluation criteria without improving true generalization or faithfully solving the intended problem. We study this phenomenon in a generic machine learning contest, where each contestant chooses two types of effort: creative effort that improves model capability as desired by the contest host, and mechanistic effort that only improves the model's fitness to the particular task in contest without contributing to true generalization. We establish the existence of a symmetric monotone pure strategy equilibrium in this competition game. It also provides a natural definition of benchmark hacking in this strategic context by comparing a player's equilibrium effort allocation to that of a single-agent baseline scenario. Under our definition, contestants with types below certain threshold (low types) always engage in benchmark hacking, whereas those above the threshold do not. Furthermore, we show that more skewed reward structures (favoring top-ranked contestants) can elicit more desirable contest outcomes. We also provide empirical evidence to support our theoretical predictions.

Editor's pickTechnology
Arxiv· 2 days ago

When Does LLM Self-Correction Help? A Control-Theoretic Markov Diagnostic and Verify-First Intervention

arXiv:2604.22273v1 Announce Type: new Abstract: Iterative self-correction is widely used in agentic LLM systems, but when repeated refinement helps versus hurts remains unclear. We frame self-correction as a cybernetic feedback loop in which the same language model serves as both controller and plant, and use a two-state Markov model over {Correct, Incorrect} to operationalize a simple deployment diagnostic: iterate only when ECR/EIR > Acc/(1 - Acc). In this view, EIR functions as a stability margin and prompting functions as lightweight controller design. Across 7 models and 3 datasets (GSM8K, MATH, StrategyQA), we find a sharp near-zero EIR threshold (<= 0.5%) separating beneficial from harmful self-correction. Only o3-mini (+3.4 pp, EIR = 0%), Claude Opus 4.6 (+0.6 pp, EIR ~ 0.2%), and o4-mini (+/-0 pp) remain non-degrading; GPT-5 degrades by -1.8 pp. A verify-first prompt ablation provides causal evidence that this threshold is actionable through prompting alone: on GPT-4o-mini it reduces EIR from 2% to 0% and turns -6.2 pp degradation into +0.2 pp (paired McNemar p < 10^-4), while producing little change on already-sub-threshold models. ASC further illustrates the stopping trade-off: it halts harmful refinement but incurs a 3.8 pp confidence-elicitation cost. Overall, the paper argues that self-correction should be treated not as a default behavior, but as a control decision governed by measurable error dynamics.

Editor's pick
Arxiv· 2 days ago

Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models

arXiv:2604.22411v1 Announce Type: new Abstract: Even when decoding with temperature $T=0$, large language models (LLMs) can produce divergent outputs for identical inputs. Recent work by Thinking Machines Lab highlights implementation-level sources of nondeterminism, including batch-size variation, kernel non-invariance, and floating-point non-associativity. In this short note we formalize this behavior by introducing the notion of \emph{background temperature} $T_{\mathrm{bg}}$, the effective temperature induced by an implementation-dependent perturbation process observed even when nominal $T=0$. We provide clean definitions, show how $T_{\mathrm{bg}}$ relates to a stochastic perturbation governed by the inference environment $I$, and propose an empirical protocol to estimate $T_{bg}$ via the equivalent temperature $T_n(I)$ of an ideal reference system. We conclude with a set of pilot experiments run on a representative pool from the major LLM providers that demonstrate the idea and outline implications for reproducibility, evaluation, and deployment.

Editor's pickTechnology
MIT Technology Review· 2 days ago

The Download: DeepSeek’s latest AI breakthrough, and the race to build world models

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. Three reasons why DeepSeek’s new model matters On Friday, Chinese AI firm DeepSeek released a preview of V4, its long-awaited new flagship model. Notably, the model can process much longer prompts…

Editor's pickTechnology
Interesting Engineering· 3 days ago

US still ahead of China in AI race as DeepSeek fails to narrow gap

DeepSeek’s latest flagship delivers measurable improvements, but early benchmarks suggest it still lags behind leading open-source rivals.

Editor's pick
Arxiv· 2 days ago

How Many Visual Levers Drive Urban Perception? Interventional Counterfactuals via Multiple Localised Edits

arXiv:2604.22103v1 Announce Type: new Abstract: Street-view perception models predict subjective attributes such as safety at scale, but remain correlational: they do not identify which localized visual changes would plausibly shift human judgement for a specific scene. We propose a lever-based interventional counterfactual framework that recasts scene-level explainability as a bounded search over structured counterfactual edits. Each lever specifies a semantic concept, spatial support, intervention direction, and constrained edit template. Candidate edits are generated through prompt-conditioned image editing and retained only if they satisfy validity checks for same-place preservation, locality, realism, and plausibility. In a pilot across 50 scenes from five cities, the framework reveals preliminary proxy-based directional patterns and a practical failure taxonomy under prompt-only editing, with Mobility Infrastructure and Physical Maintenance showing the largest auxiliary safety shifts. Human pairwise judgements remain the ground-truth endpoint for future validation.

Editor's pick
Arxiv· 2 days ago

Math Takes Two: A test for emergent mathematical reasoning in communication

arXiv:2604.21935v1 Announce Type: new Abstract: Although language models demonstrate remarkable proficiency on mathematical benchmarks, it remains unclear whether this reflects true mathematical reasoning or statistical pattern matching over learning formal syntax. Most existing evaluations rely on symbolic problems grounded in established mathematical conventions, limiting insight into the models' ability to construct abstract concepts from first principles. In this work, we propose Math Takes Two, a new benchmark designed to assess the emergence of mathematical reasoning through communication. Motivated by the hypothesis that mathematical cognition in humans co-evolved with the need for precise communication, our benchmark tests whether two agents, without prior mathematical knowledge, can develop a shared symbolic protocol to solve a visually grounded task where the use of a numerical system facilitates extrapolation. Unlike many current datasets, our benchmark eschews predefined mathematical language, instead requiring agents to discover latent structure and representations from scratch. Math Takes Two thus provides a novel lens through which to develop and evaluate models with emergent numerical reasoning capabilities.

Editor's pickTechnology
Theregister· 2 days ago

Anthropic's magic code-sniffer: More Swiss cheese than cheddar, for now

AI vuln-hunter finds what humans taught it to find. Funny that Opinion In retrospect, calling it Mythos made it a hostage to fortune. Anthropic may have hoped that the name implied its AI code security model had mythical god-like powers, but there's an alternate reading. Another definition for Mythos is a set of beliefs of obscure origin which are incompatible with reality.…

Editor's pickTechnology
Substack· 3 days ago

We can inspect LLMs. So why do AI systems still feel unpredictable?

Meaningful progress, but the more I think about it, the more it highlights a gap. Most real-world AI applications are no longer a single model call, as we know. They are systems. Agents call tools, passing context between steps, making decisions based on intermediate outputs, and sometimes looping until a goal is reached.

Editor's pickTechnology
Ethan Mollick· 3 days ago

Visualizing the Trajectory of AI Capability Scaling and Future Development Curves

The post provides a conceptual framework for understanding the current pace of AI capability advancement. It emphasizes the non-linear nature of progress and the implications for future technical development.

AI Security & Cybersecurity4 articles

Adoption, Deployment & Impact

16 articles
AI Adoption Barriers & Enablers4 articles
Editor's pickEducation
Arxiv· 2 days ago

A Systematic AI Adoption Framework for Higher Education: From Student GenAI Usage to Institutional Integration

arXiv:2604.22030v1 Announce Type: new Abstract: The rapid development of GenAI technologies is transforming learning, assessment, and academic production in higher education. Despite increasing student adoption, many institutions lack operational mechanisms to systematically align regulations and curricula with evolving generative artificial intelligence practices, creating regulatory ambiguity and academic integrity risks. This study investigates how students utilize generative artificial intelligence tools in computer science-oriented disciplines and develops a structured, lightweight framework supporting institutional adaptation to pervasive GenAI usage. We conducted a case study at the University of Applied Sciences and Arts Hannover (Germany), combining document analysis with an online survey (N = 151) targeting Business Information Systems and E-Government students. Quantitative responses were analyzed statistically, while open-ended responses underwent thematic synthesis. Generative artificial intelligence adoption was widespread, with ChatGPT as the dominant tool. Students primarily used generative artificial intelligence for research assistance, programming support, and text processing. However, substantial policy uncertainty was observed: many students were unaware of or unsure about institutional generative artificial intelligence regulations. Document analysis revealed regulatory gaps, ambiguous terminology, and inconsistencies between formal rules and teaching practices. To address these shortcomings, we propose the AI Adoption Framework for Higher Education, an iterative and operational model integrating document analysis, empirical observation, synthesis of findings, and targeted updates of regulations and curricula. The framework addresses governance, assessment validity, and academic integrity under generative artificial intelligence conditions and provides practical guidance for institutional adaptation.

Editor's pick
Livemint· 2 days ago

How many AI models does a user need? The answer is beginning to emerge | Mint

The pursuit of the perfect mix of AI tools levels an ‘attention tax’, as juggling models, subscriptions and workflows can cramp productivity. Of course, it depends on the kind of task at hand. For individual users doing regular stuff, the optimal number may be quite low.

AI Applications5 articles
Editor's pickHealthcare
Arxiv· 2 days ago

CognitiveTwin: Robust Multi-Modal Digital Twins for Predicting Cognitive Decline in Alzheimer's Disease

arXiv:2604.22428v1 Announce Type: new Abstract: Predicting individual cognitive decline in Alzheimer's disease (AD) is difficult due to the heterogeneity of disease progression. Reliable clinical tools require not only high accuracy but also fairness across demographics and robustness to missing data. We present CognitiveTwin, a digital twin framework that predicts patient-specific cognitive trajectories. The model integrates multi-modal longitudinal data (cognitive scores, magnetic resonance imaging, positron emission tomography, cerebrospinal fluid biomarkers, and genetics). We use a Transformer-based architecture to fuse these modalities and a Deep Markov Model to capture temporal dynamics. We trained and evaluated the framework using data from 1,666 patients in the TADPOLE (Alzheimer's Disease Neuroimaging Initiative) dataset. We assessed the model for prediction error, demographic fairness, and robustness to missing-not-at-random (MNAR) data patterns. ognitiveTwin provides accurate and personalized predictions of cognitive decline. Its demonstrated fairness across patient demographics and resilience to clinical dropout make it a reliable tool for clinical trial enrichment and personalized care planning.

Editor's pickTechnology
Arxiv· 2 days ago

Trust as a Situated User State in Social LLM-Based Chatbots: A Longitudinal Study of Snapchat's My AI

arXiv:2604.22417v1 Announce Type: new Abstract: Social chatbots based on large language models are increasingly embedded in everyday platforms, yet how users develop trust in these systems over time remains unclear. We present a four-week longitudinal qualitative survey study (N = 27) of trust formation in Snapchat's My AI, a socially embedded conversational agent. Our findings show that trust is shaped by perceived ability, conversational behavior, human-likeness, transparency, privacy concerns, and trust in the host platform. Trust does not remain stable, but evolves through interaction as users adapt their expectations, refine their prompting strategies, and actively regulate how and when they rely on the system. These processes reflect a continuous negotiation of trust, not a one-time evaluation. While conversational fluency supports engagement, excessive anthropomorphism and limited transparency can undermine trust over time. We synthesize these findings into a conceptual model that frames trust as a dynamic user state shaped by interaction context and expectations, with implications for the design of human-centered and adaptive conversational agents.

Editor's pickManufacturing & Industrials
Bebeez· 2 days ago

BMW and PepsiCo robotics partner Sereact raises €93 million Series B to scale across the US

Stuttgart-based Sereact, an innovator in physical AI for warehouses and manufacturing, has raised a €93 million ($110 million) Series B round in order to scale their ‘Cortex 2’ offering and to open its first office in the U.S. in Boston at some point during the coming summer – aiming to hire new staff locally. The […]

Editor's pick
Arxiv· 2 days ago

Performance Anomaly Detection in Athletics: A Benchmarking System with Visual Analytics

arXiv:2604.21953v1 Announce Type: cross Abstract: Anti-doping programs rely on biological testing to detect performance-enhancing drugs, but such testing costs over $800 per sample and is limited by short detection windows for many prohibited substances. These constraints leave large portions of athletes without regular testing, motivating complementary screening approaches that analyze routine competition results to identify suspicious performance patterns. We present a system that processes 1.6 million athletics performances from over 19,000 competitions (2010-2025) using eight detection methods ranging from statistical rules to machine learning and trajectory analysis. We validate all methods against publicly confirmed anti-doping violations to measure their effectiveness in identifying sanctioned athletes. Trajectory-based methods, which compare performances to expected career progression, achieve the best balance between detecting violations and limiting false alarms, though all methods face challenges from incomplete data and rare confirmed violations. The system provides an interactive interface for expert-driven investigation, emphasizing transparency and human judgment to support, rather than replace, established anti-doping processes.

Editor's pickPAYWALLProfessional Services
FT· 2 days ago

Help! Our newest client is an AI model

A behind-the-scenes look at the work of Rutherford Hall, critical communications strategist

AI Measurement & Evaluation4 articles
Editor's pickTechnology
VentureBeat· 2 days ago

RAG precision tuning can quietly cut retrieval accuracy by 40%, putting agentic pipelines at risk

Enterprise teams that fine-tune their RAG embedding models for better precision may be unintentionally degrading the retrieval quality those pipelines depend on, according to new research from Redis. The paper, "Training for Compositional Sensitivity Reduces Dense Retrieval Generalization," tested what happens when teams train embedding models for compositional sensitivity. That is the ability to catch sentences that look nearly identical but mean something different — "the dog bit the man" versus "the man bit the dog," or a negation flip that reverses a statement's meaning entirely. That training consistently broke dense retrieval generalization, how well a model retrieves correctly across broad topics and domains it wasn't specifically trained on. Performance dropped by 8 to 9 percent on smaller models and by 40 percent on a current mid-size embedding model teams are actively using in production. The findings have direct implications for enterprise teams building agentic AI pipelines, where retrieval quality determines what context flows into an agent's reasoning chain. A retrieval error in a single-stage pipeline returns a wrong answer. The same error in an agentic pipeline can trigger a cascade of wrong actions downstream. Srijith Rajamohan, AI Research Leader at Redis and one of the paper's authors, said the finding challenges a widespread assumption about how embedding-based retrieval actually works.  "There's this general notion that when you use semantic search or similar semantic similarity, we get correct intent. That's not necessarily true," Rajamohan told VentureBeat. "A close or high semantic similarity does not actually mean an exact intent." The geometry behind the retrieval tradeoff Embedding models work by compressing an entire sentence into a single point in a high-dimensional space, then finding the closest points to a query at retrieval time. That works well for broad topical matching — documents about similar subjects end up near each other. The problem is that two sentences with nearly identical words but opposite meanings also end up near each other, because the model is working from word content rather than structure. That is what the research quantified. When teams fine-tune an embedding model to push structurally different sentences apart — teaching it that a negation flip which reverses a statement's meaning is not the same as the original — the model uses representational space it was previously using for broad topical recall. The two objectives compete for the same vector. The research also found the regression is not uniform across failure types. Negation and spatial flip errors improved measurably with structured training. Binding errors — where a model confuses which modifier applies to which word, such as which party a contract obligation falls on — barely moved. For enterprise teams, that means the precision problem is harder to fix in exactly the cases where getting it wrong has the most consequences. The reason most teams don't catch it is that fine-tuning metrics measure the task being trained for, not what happens to general retrieval across unrelated topics. A model can show strong improvement on near-miss rejection during training while quietly regressing on the broader retrieval job it was hired to do. The regression only surfaces in production. Rajamohan said the instinct most teams reach for — moving to a larger embedding model — does not address the underlying architecture. "You can't scale your way out of this," he said. "It's not a problem you can solve with more dimensions and more parameters." Why the standard alternatives all fall short The natural instinct when retrieval precision fails is to layer on additional approaches. The research tested several of them and found each fails in a different way. Hybrid search. Combining embedding-based retrieval with keyword search is already standard practice for closing precision gaps. But Rajamohan said keyword search cannot catch the failure mode this research identifies, because the problem is not missing words — it is misread structure. "If you have a sentence like 'Rome is closer than Paris' and another that says 'Paris is closer than Rome,' and you do an embedding retrieval followed by a text search, you're not going to be able to tell the difference," he said. "The same words exist in both sentences." MaxSim reranking. Some teams add a second scoring layer that compares individual query words against individual document words rather than relying on the single compressed vector. This approach, known as MaxSim or late interaction and used in systems like ColBERT, did improve relevance benchmark scores in the research. But it completely failed to reject structural near-misses, assigning them near-identity similarity scores.  The problem is that relevance and identity are different objectives. MaxSim is optimized for the former and blind to the latter. A team that adds MaxSim and sees benchmark improvement may be solving a different problem than the one they have. Cross-encoders. These work by feeding the query and candidate document into the model simultaneously, letting it compare every word against every word before making a decision. That full comparison is what makes them accurate — and what makes them too expensive to run at production scale. Rajamohan said his team investigated them. They work in the lab and break under real query volumes. Contextual memory. Also sometimes referred to as agentic memory, these systems are increasingly cited as the path beyond RAG, but Rajamohan said moving to that type of  architecture does not eliminate the structural retrieval problem. Those systems still depend on retrieval at query time, which means the same failure modes apply. The main difference is looser latency requirements, not a precision fix. The two-stage fix the research validated The common thread across every failed approach is the same: a single scoring mechanism trying to handle both recall and precision at once. The research validated a different architecture: stop trying to do both jobs with one vector, and assign each job to a dedicated stage. Stage one: recall. The first stage works exactly as standard dense retrieval does today — the embedding model compresses documents into vectors and retrieves the closest matches to a query. Nothing changes here. The goal is to cast a wide net and bring back a set of strong candidates quickly. Speed and breadth are what matter at this stage, not perfect precision. Stage two: precision. The second stage is where the fix lives. Rather than scoring candidates with a single similarity number, a small learned Transformer model examines the query and each candidate at the token level — comparing individual words against individual words to detect structural mismatches like negation flips or role reversals. This is the verification step the single-vector approach cannot perform. The results. Under end-to-end training, the Transformer verifier outperformed every other approach the research tested on structural near-miss rejection. It was the only approach that reliably caught the failure modes the single-vector system missed. The tradeoff. Adding a verification stage costs latency. The latency cost depends on how much verification a team runs. For precision-sensitive workloads like legal or accounting applications, full verification at every query is warranted. For general-purpose search, lighter verification may be sufficient.  The research grew out of a real production problem. Enterprise customers running semantic caching systems were getting fast but semantically incorrect responses back — the retrieval system was treating similar-sounding queries as identical even when their meaning differed. The two-stage architecture is Redis's proposed fix, with incorporation into its LangCache product on the roadmap but not yet available to customers. What this means for enterprise teams The research does not require enterprise teams to rebuild their retrieval pipelines from scratch. But it does ask them to pressure-test assumptions most teams have never examined — about what their embedding models are actually doing, which metrics are worth trusting and where the real precision gaps live in production. Recognize the tradeoff before tuning around it. Rajamohan said the first practical step is understanding the regression exists. He evaluates any LLM-based retrieval system on three criteria: correctness, completeness and usefulness. Correctness failures cascade directly into the other two, which means a retrieval system that scores well on relevance benchmarks but fails on structural near-misses is producing a false sense of production readiness. RAG is not obsolete — but know what it can't do. Rajamohan pushed back firmly on claims that RAG has been superseded. "That's a massive oversimplification," he said. "RAG is a very simple pipeline that can be productionized by almost anyone with very little lift." The research does not argue against RAG as an architecture. It argues against assuming a single-stage RAG pipeline with a fine-tuned embedding model is production-ready for precision-sensitive workloads. The fix is real but not free. For teams that do need higher precision, Rajamohan said the two-stage architecture is not a prohibitive implementation lift, but adding a verification stage costs latency. "It's a mitigation problem," he said. "Not something we can actually solve."

Editor's pickTechnology
VentureBeat· 3 days ago

Context decay, orchestration drift, and the rise of silent failures in AI systems

The most expensive AI failure I have seen in enterprise deployments did not produce an error. No alert fired. No dashboard turned red. The system was fully operational, it was just consistently, confidently wrong. That is the reliability gap. And it is the problem most enterprise AI programs are not built to catch. We have spent the last two years getting very good at evaluating models: benchmarks, accuracy scores, red-team exercises, retrieval quality tests. But in production, the model is rarely where the system breaks. It breaks in the infrastructure layer, the data pipelines feeding it, the orchestration logic wrapping it, the retrieval systems grounding it, the downstream workflows trusting its output. That layer is still being monitored with tools designed for a different kind of software. The gap no one is measuring Here's what makes this problem hard to see: Operationally healthy and behaviorally reliable are not the same thing, and most monitoring stacks cannot tell the difference. A system can show green across every infrastructure metric, latency within SLA, throughput normal, error rate flat, while simultaneously reasoning over retrieval results that are six months stale, silently falling back to cached context after a tool call degrades, or propagating a misinterpretation through five steps of an agentic workflow. None of that shows up in Prometheus. None of it trips a Datadog alert. The reason is straightforward: Traditional observability was built to answer the question “is the service up?” Enterprise AI requires answering a harder question: “Is the service behaving correctly?” Those are different instruments. What teams typically measure What actually drives AI infrastructure failure Uptime / latency / error rate Retrieval freshness and grounding confidence Token usage Context integrity across multi-step workflows Throughput Semantic drift under real-world load Model benchmark scores Behavioral consistency when conditions degrade Infrastructure error rate Silent partial failure at the reasoning layer  Closing this gap requires adding a behavioral telemetry layer alongside the infrastructure one — not replacing what exists, but extending it to capture what the model actually did with the context it received, not just whether the service responded. Four failure patterns that standard monitoring will not catch Across enterprise AI deployments in network operations, logistics, and observability platforms, I see four failure patterns repeat with enough consistency to name them. The first is context degradation. The model reasons over incomplete or stale data in a way that is invisible to the end user. The answer looks polished. The grounding is gone. Detection usually happens weeks later, through downstream consequences rather than system alerts. The second is orchestration drift. Agentic pipelines rarely fail because one component breaks. They fail because the sequence of interactions between retrieval, inference, tool use, and downstream action starts to diverge under real-world load. A system that looked stable in testing behaves very differently when latency compounds across steps and edge cases stack. The third is a silent partial failure. One component underperforms without crossing an alert threshold. The system degrades behaviorally before it degrades operationally. These failures accumulate quietly and surface first as user mistrust, not incident tickets. By the time the signal reaches a postmortem, the erosion has been happening for weeks. The fourth is the automation blast radius. In traditional software, a localized defect stays local. In AI-driven workflows, one misinterpretation early in the chain can propagate across steps, systems, and business decisions. The cost is not just technical. It becomes organizational, and it is very hard to reverse. Metrics tell you what happened. They rarely tell you what almost happened. Why classic chaos engineering is not enough and what needs to change Traditional chaos engineering asks the right kind of question: What happens when things break? Kill a node. Drop a partition. Spike CPU. Observe. Those tests are necessary, and enterprises should run them. But for AI systems, the most dangerous failures are not caused by hard infrastructure faults. They emerge at the interaction layer between data quality, context assembly, model reasoning, orchestration logic, and downstream action. You can stress the infrastructure all day and never surface the failure mode that costs you the most. What AI reliability testing needs is an intent-based layer: Define what the system must do under degraded conditions, not just what it should do when everything works. Then test the specific conditions that challenge that intent. What happens if the retrieval layer returns content that is technically valid but six months outdated? What happens if a summarization agent loses 30% of its context window to unexpected token inflation upstream? What happens if a tool call succeeds syntactically but returns semantically incomplete data? What happens if an agent retries through a degraded workflow and compounds its own error with each step? These scenarios are not edge cases. They are what production looks like. This is the framework I have applied in building reliability systems for enterprise infrastructure: Intent-based chaos level creation for distributed computing environments. The key insight: Intent defines the test, not just the fault. What the infrastructure layer actually needs None of this requires reinventing the stack. It requires extending four things. Add behavioral telemetry alongside infrastructure telemetry. Track whether responses were grounded, whether fallback behavior was triggered, whether confidence dropped below a meaningful threshold, whether the output was appropriate for the downstream context it entered. This is the observability layer that makes everything else interpretable. Introduce semantic fault injection into pre-production environments. Deliberately simulate stale retrieval, incomplete context assembly, tool-call degradation, and token-boundary pressure. The goal is not theatrical chaos. The goal is finding out how the system behaves when conditions are slightly worse than your staging environment — which is always what production is. Define safe halt conditions before deployment, not after the first incident. AI systems need the equivalent of circuit breakers at the reasoning layer. If a system cannot maintain grounding, validate context integrity, or complete a workflow with enough confidence to be trusted, it should stop cleanly, label the failure, and hand control to a human or a deterministic fallback. A graceful halt is almost always safer than a fluent error. Too many systems are designed to keep going because confident output creates the illusion of correctness. Assign shared ownership for end-to-end reliability. The most common organizational failure is a clean separation between model teams, platform teams, data teams, and application teams. When the system is operationally up but behaviorally wrong, no one owns it clearly. Semantic failure needs an owner. Without one, it accumulates. The maturity curve is shifting For the last two years, the enterprise AI differentiator has been adoption — who gets to production fastest. That phase is ending. As models commoditize and baseline capability converges, competitive advantage will come from something harder to copy: The ability to operate AI reliably at scale, in real conditions, with real consequences. Yesterday’s differentiator was model adoption. Today’s is system integration. Tomorrow’s will be reliability under production stress. The enterprises that get there first will not have the most advanced models. They will have the most disciplined infrastructure around them — infrastructure that was tested against the conditions it would actually face, not the conditions that made the pilot look good. The model is not the whole risk. The untested system around it is. Sayali Patil is an AI infrastructure and product leader.

Editor's pickTechnology
VentureBeat· 3 days ago

Monitoring LLM behavior: Drift, retries, and refusal patterns

The stochastic challenge Traditional software is predictable: Input A plus function B always equals output C. This determinism allows engineers to develop robust tests. On the other hand, generative AI is stochastic and unpredictable. The exact same prompt often yields different results on Monday versus Tuesday, breaking the traditional unit testing that engineers know and love. To ship enterprise-ready AI, engineers cannot rely on mere “vibe checks” that pass today but fail when customers use the product. Product builders need to adopt a new infrastructure layer: The AI Evaluation Stack. This framework is informed by my extensive experience shipping AI products for Fortune 500 enterprise customers in high-stakes industries, where “hallucination” is not funny — it’s a huge compliance risk. Defining the AI evaluation paradigm Traditional software tests are binary assertions (pass/fail). While some AI evals use binary asserts, many evaluate on a gradient. An eval is not a single script; it is a structured pipeline of assertions — ranging from strict code syntax to nuanced semantic checks — that verify the AI system’s intended function. The taxonomy of evaluation checks To build a robust, cost-effective pipeline, asserts must be separated into two distinct architectural layers: Layer 1: Deterministic assertions A surprisingly large share of production AI failures aren't semantic "hallucinations" — they are basic syntax and routing failures. Deterministic assertions serve as the pipeline's first gate, using traditional code and regex to validate structural integrity. Instead of asking if a response is "helpful," these assertions ask strict, binary questions: Did the model generate the correct JSON key/value schema? Did it invoke the correct tool call with the required arguments? Did it successfully slot-fill a valid GUID or email address? // Example: Layer 1 Deterministic Tool Call Assertion {   "test_scenario": "User asks to look up an account",   "assertion_type": "schema_validation",   "expected_action": "Call API: get_customer_record",   "actual_ai_output": "I found the customer.",   "eval_result": "FAIL - AI hallucinated conversational text instead of generating the required API payload." } In the example above, the test failed instantly because the model generated conversational text instead of the required tool call payload. Architecturally, deterministic assertions must be the first layer of the stack, operating on a computationally inexpensive "fail-fast" principle. If a downstream API requires a specific schema, a malformed JSON string is a fatal error. By failing the evaluation immediately at this layer, engineering teams prevent the pipeline from triggering expensive semantic checks (Layer 2) or wasting valuable human review time (Layer 3). Layer 2: Model-based assertions When deterministic assertions pass, the pipeline must evaluate semantic quality. Because natural language is fluid, traditional code cannot easily assert if a response is "helpful" or "empathetic." This introduces model-based evaluation, commonly referred to as "LLM-as-a-Judge” or “LLM-Judge." While using one non-deterministic system to evaluate another seems counterintuitive, it is an exceptionally powerful architectural pattern for use cases requiring nuance. It is virtually impossible to write a reliable regex to verify if a response is "actionable" or "polite." While human reviewers excel at this nuance, they cannot scale to evaluate tens of thousands of CI/CD test cases. Thus, the LLM-as-a-Judge becomes the scalable proxy for human discernment. 3 critical inputs for model-based assertions However, model-based assertions only yield reliable data when the LLM-as-a-Judge is provisioned with three critical inputs: A state-of-the-art reasoning model: The Judge must possess superior reasoning capabilities compared to the production model. If your app runs on a smaller, faster model for latency, the judge must be a frontier reasoning model to approximate human-level discernment. A strict assessment rubric: Vague evaluation prompts ("Rate how good this answer is") yield noisy, stochastic evaluations. A robust rubric explicitly defines the gradients of failure and success. (For example, a "Helpfulness" rubric should define Score 1 as an irrelevant refusal, Score 2 as addressing the prompt but lacking actionable steps, and Score 3 as providing actionable next steps strictly within context.) Ground truth (golden outputs): While the rubric provides the rules, a human-vetted "expected answer" acts as the answer key. When the LLM-Judge can compare the production model's output against a verified Golden Output, its scoring reliability increases dramatically. Architecture: The offline vs online pipeline A robust evaluation architecture requires two complementary pipelines. The online pipeline monitors post-deployment telemetry, while the offline pipeline provides the foundational baseline and deterministic constraints required to evaluate stochastic models safely. The offline evaluation pipeline The offline pipeline's primary objective is regression testing — identifying failures, drift, and latency before production. Deploying an enterprise LLM feature without a gating offline evaluation suite is an architectural anti-pattern; it is the equivalent of merging uncompiled code into a main branch. Process 1. Curating the golden dataset The offline lifecycle begins by curating a "golden dataset" — a static, version-controlled repository of 200 to 500 test cases representing the AI's full operational envelope. Each case pairs an exact input payload with an expected "golden output" (ground truth). Crucially, this dataset must reflect expected real-world traffic distributions. While most cases cover standard "happy-path" interactions, engineers must systematically incorporate edge cases, jailbreaks, and adversarial inputs. Evaluating "refusal capabilities" under stress remains a strict compliance requirement. Example test case payload (standard tool use): Input: "Schedule a 30-minute follow-up meeting with the client for next Tuesday at 10 a.m." Expected output (golden): The system successfully invokes the schedule_meeting tool with the correct JSON payload: {"duration_minutes": 30, "day": "Tuesday", "time": "10 AM", "attendee": "client_email"}. While manually curating hundreds of edge cases is tedious, the process can be accelerated with synthetic data generation pipelines that use a specialized LLM to produce diverse TSV/CSV test payloads. However, relying entirely on AI-generated test cases introduces the risk of data contamination and bias. A human-in-the-loop (HITL) architecture is mandatory at this stage; domain experts must manually review, edit, and validate the synthetic dataset to ensure it accurately reflects real-world user intent and enterprise policy before it is committed to the repository. 2. Defining the evaluation criteria Once the dataset is curated, engineers must design the evaluation criteria to compute a composite score for each model output. A robust architecture achieves this by assigning weighted points across a hybrid of Layer 1 (deterministic) and Layer 2 (model-based) asserts. Consider an AI agent executing a "send email" tool. An evaluation framework might utilize a 10-point scoring system: Layer 1: Deterministic asserts (6 points): Did the agent invoke the correct tool? (2 pts). Did it produce a valid JSON object? (2 pts). Does the JSON strictly adhere to the expected schema? (2 pts). Layer 2: Model-based asserts (4 points): (Note: Semantic rubrics must be highly use-case specific). Does the subject line reflect user intent? (1 pt). Does the email body match expected outputs without hallucination? (1 pt). Were CC/BCC fields leveraged accurately? (1 pt). Was the appropriate priority flag inferred? (1 pt). To understand why the LLM-Judge awarded these points, the engineer must prompt the judge to supply its reasoning for each score. This is crucial for debugging failures. The passing threshold and short-circuit logic  In this example, an 8/10 passing threshold requires 8 points for success. Crucially, the evaluation pipeline must enforce strict short-circuit evaluation (fail-fast logic). If the model fails any deterministic assertion — such as generating a malformed JSON schema — the system must instantly fail the entire test case (0/10). There is zero architectural value in invoking an expensive LLM-Judge to assess the semantic "politeness" of an email if the underlying API call is structurally broken. 3. Executing the pipeline and aggregating signals Using an evaluation infrastructure of choice, the system executes the offline pipeline — typically integrated as a blocking CI/CD step during a pull request. The infrastructure iterates through the golden dataset, injecting each test payload into the production model, capturing the output, and executing defined assertions against it. Each output is scored against the passing threshold. Once batch execution is complete, results are aggregated into an overall pass rate. For enterprise-grade applications, the baseline pass rate must typically exceed 95%, scaling to 99%-plus for strict compliance or high-risk domains. 4. Assessment, iteration, and alignment Based on aggregated failure data, engineering teams conduct a root-cause analysis of failing test cases. This assessment drives iterative updates to core components: refining system prompts, modifying tool descriptions, augmenting knowledge sources, or adjusting hyperparameters (like temperature or top-p). Continuous optimization remains best practice even after achieving a 95% pass rate. Crucially, any system modification necessitates a full regression test. Because LLMs are inherently non-deterministic, an update intended to fix one specific edge case can easily cause unforeseen degradations in other areas. The entire offline pipeline must be rerun to validate that the update improved quality without introducing regressions. The online evaluation pipeline While the offline pipeline acts as a strict pre-deployment gatekeeper, the online pipeline is the post-deployment telemetry system. Its objective is to monitor real-world behavior, capturing emergent edge cases, and quantifying model drift. Architects must instrument applications to capture five distinct categories of telemetry: 1. Explicit user signals Direct, deterministic feedback indicating model performance: Thumbs up/down: Disproportionate negative feedback is the most immediate leading indicator of system degradation, directing immediate engineering investigation. Verbatim in-app feedback: Systematically parsing written comments identifies novel failure modes to integrate back into the offline "golden dataset." 2. Implicit behavioral signals Behavioral telemetry reveals silent failures where users give up without explicit feedback: Regeneration and retry rates: High frequencies of retries indicate the initial output failed to resolve user intent. Apology rate: Programmatically scanning for heuristic triggers ("I’m sorry") detects degraded capabilities or broken tool routing. Refusal rate: Artificially high refusal rates ("I can’t do that") indicate over-calibrated safety filters rejecting benign user queries. 3. Production deterministic asserts (synchronous) Because deterministic code checks execute in milliseconds, teams can seamlessly reuse Layer 1 offline asserts (schema conformity, tool validity) to synchronously evaluate 100% of production traffic. Logging these pass/fail rates instantly detects anomalous spikes in malformed outputs — the earliest warning sign of silent model drift or provider-side API changes. 4. Production LLM-as-a-Judge (asynchronous) If strict data privacy agreements (DPAs) permit logging user inputs, teams can deploy model-based asserts. Architecturally, production LLM-Judges must never execute synchronously on the critical path, which doubles latency and compute costs. Instead, a background LLM-Judge asynchronously samples a fraction (5%) of daily sessions, grading outputs against the offline rubric to generate a continuous quality dashboard. Engineering the feedback loop (the “flywheel”) Evaluation pipelines are not "set-it-and-forget-it" infrastructure. Without continuous updates, static datasets suffer from "rot" (concept drift) as user behavior evolves and customers discover novel use cases. For example, an HR chatbot might boast a pristine 99% offline pass rate for standard payroll questions. However, if the company suddenly announces a new equity plan, users will immediately begin prompting the AI about vesting schedules — a domain entirely missing from the offline evaluations. To make the system smarter over time, engineers must architect a closed feedback loop that mines production telemetry for continuous improvement. The continuous improvement workflow: Capture: A user triggers an explicit negative signal (a "thumbs down") or an implicit behavioral flag in production. Triage: The specific session log is automatically flagged and routed for human review. Root-cause analysis: A domain expert investigates the failure, identifies the gap, and updates the AI system to successfully handle similar requests. Dataset augmentation: The novel user input, paired with the newly corrected expected output, is appended to the offline Golden Dataset alongside several synthetic variations. Regression testing: The model is continuously re-evaluated against this newly discovered edge case in all future runs. Building an evaluation pipeline without monitoring production logs and updating datasets is fundamentally insufficient. Users are unpredictable. Evaluating on stale data creates a dangerous illusion: High offline pass rates masking a rapidly degrading real-world experience. Conclusion: The new “definition of done” In the era of generative AI, a feature or product is no longer "done" simply because the code compiles and the prompt returns a coherent response. It is only done when a rigorous, automated evaluation pipeline is deployed and stable — and when the model consistently passes against both a curated golden dataset and newly discovered production edge cases. This guide has equipped you with a comprehensive blueprint for building that reality. From architecting offline regression pipelines and online telemetry to the continuous feedback flywheel and navigating enterprise anti-patterns, you now have the structural foundation required to deploy AI systems with greater confidence. Now, it is your turn. Share this framework with your engineering, product, and legal teams to establish a unified, cross-functional standard for AI quality in your organization. Stop guessing whether your models are degrading in production, and start measuring. Derah Onuorah is a Microsoft senior product manager.

Editor's pick
Arxiv· 2 days ago

Testing for Spillovers in Resource Conservation: Evidence from a Natural Field Experiment

arXiv:2508.04371v2 Announce Type: replace Abstract: This paper studies whether behavioral interventions designed to promote resource conservation in one domain generate spillovers in another. Using a natural field experiment involving over 2,000 residents, we identify the direct and spillover effects of real-time feedback and social comparisons on water and energy consumption. We implement three interventions: two targeting shower use and one targeting air-conditioning use. We find significant reductions in shower use from both water-saving interventions, but no direct effect of the energy-saving intervention on air-conditioning use. For spillovers, we estimate precise null effects of water-saving interventions on air-conditioning use, and of the energy-saving intervention on shower use.

Geopolitics, Policy & Governance

20 articles
AI Geopolitics10 articles
Editor's pickPAYWALLTechnology
Bloomberg· 2 days ago

China Blocks Meta’s $2 Billion Acquisition of AI Firm Manus

China has decided to block Meta Platforms Inc.’s $2 billion acquisition of agentic AI startup Manus, a surprise move to unwind a controversial deal that’s drawn fire for the leakage of technology to the US.

Editor's pickPAYWALLTechnology
FT· Yesterday

China blocks Meta’s $2bn purchase of AI group Manus

Regulators had reviewed whether deal violated Beijing’s investment rules

Editor's pickTechnology
Reuters· 2 days ago

Exclusive: US State Dept orders global warning about alleged AI thefts by DeepSeek, other Chinese firms | Reuters

DeepSeek, whose low-cost AI model stunned the world last year, on Friday launched ​a preview of a highly anticipated new model, called the V4, adapted for Huawei chip technology, underlining China&#x27;s growing autonomy in the sector.

Editor's pickPAYWALLTechnology
Washington Post· Yesterday

China says it ordered reversal of Meta’s Manus AI acquisition - The Washington Post

SINGAPORE — Chinese authorities say they have banned Meta’s acquisition of Manus AI, an artificial intelligence company founded in China — taking Beijing’s most aggressive step yet to stanch the loss of AI talent and resources to the ...

Editor's pickPAYWALLTechnology
NYT· 2 days ago

China Will Require Meta to Unwind Acquisition of AI Start-Up Manus

The impact of the ruling was not immediately clear, but it could send a chilling signal to Chinese tech founders seeking to team up with foreign companies.

Editor's pickDefense & National Security
Fortune· 3 days ago

The ‘obscene economics’ of modern warfare show how the race to military supremacy is transforming, while U.S. rearmament relies on China

"This imbalance has haunted Western military planners since the early days of Russia's invasion of Ukraine."

Editor's pickTechnology
ABC News· 2 days ago

China blocks Meta from acquiring AI startup Manus - ABC News

China has banned a planned acquisition of the AI startup Manus following a probe into Meta’s planned purchase of the firm

Editor's pickTechnology
Emirates 24|7· 2 days ago

DeepSeek's new AI model fails to spark market rally as China's AI race intensifies - Emirates 24|7

He highlighted DeepSeek’s efforts ... U.S. export controls aim to restrict China’s access to advanced American semiconductors critical to AI development. “The ‘wow factor’ was last year — that’s already priced in,” he said. “What matters now is whether China can continue advancing AI development, potentially using its own chips. The geopolitical implications ...

Editor's pickTechnology
Cloudnews· 3 days ago

The US intensifies its crackdown on DeepSeek over model distillation | Cloud News

The United States has decided to turn AI model distillation into a political, industrial, and strategic front. Washington’s messaging is no longer limited to

Editor's pickTechnology
Azeem Azhar· 2 days ago

Observing the Chinese AI Ecosystem: Insights from Beijing Lab Visits

A series of site visits to Beijing-based AI labs provides a window into the current state of China's AI development. These observations are critical for understanding the competitive landscape and the impact of international trade restrictions on regional innovation.

AI Policy & Regulation6 articles
Editor's pickManufacturing & Industrials
Arxiv· 2 days ago

The Biggest Risk of Embodied AI is Governance Lag

arXiv:2604.21938v1 Announce Type: new Abstract: Embodied AI is widely discussed as a job-displacement problem. The deeper risk, however, is governance lag: the inability of public institutions to keep pace with how fast the technology spreads through the physical economy. As reusable robotic platforms are combined with increasingly general AI models, embodied AI may scale across manufacturing, logistics, care, and infrastructure faster than governance systems can observe, interpret, and respond. We argue that this lag appears in three connected forms: observational, institutional, and distributive. The central policy challenge, therefore, is not automation alone, but whether governance and compliance systems can adapt before disruption becomes entrenched.

Editor's pickProfessional Services
Arxiv· 2 days ago

How Supply Chain Dependencies Complicate Bias Measurement and Accountability Attribution in AI Hiring Applications

arXiv:2604.22679v1 Announce Type: new Abstract: The increasing adoption of AI systems in hiring has raised concerns about algorithmic bias and accountability, prompting regulatory responses including the EU AI Act, NYC Local Law 144, and Colorado's AI Act. While existing research examines bias through technical or regulatory lenses, both perspectives overlook a fundamental challenge: modern AI hiring systems operate within complex supply chains where responsibility fragments across data vendors, model developers, platform providers, and deploying organizations. This paper investigates how these dependency chains complicate bias evaluation and accountability attribution. Drawing on literature review and regulatory analysis, we demonstrate that fragmented responsibilities create two critical problems. First, bias emerges from component interactions rather than isolated elements, yet proprietary configurations prevent integrated evaluation. A resume parser may function without bias independently but contribute to discrimination when integrated with specific ranking algorithms and filtering thresholds. Second, information asymmetries mean deploying organizations bear legal responsibility without technical visibility into vendor-supplied algorithms, while vendors control implementations without meaningful disclosure requirements. Each stakeholder may believe they are compliant; nevertheless, the integrated system may produce biased outcomes. Analysis of implementation ambiguities reveals these challenges in practice. We propose multi-layered interventions including system-level audits, vendor guidelines, continuous monitoring mechanisms, and documentation across dependency chains. Our findings reveal that effective governance requires coordinated action across technical, organizational, and regulatory domains to establish meaningful accountability in distributed development environments.

Editor's pickFinancial Services
Reuters· 2 days ago

Association of Banks in Singapore monitoring potential threats from frontier AI models | Reuters

Anthropic ⁠earlier this month debuted Mythos, its most advanced AI model to date and designed for defensive cybersecurity tasks, though it has limited the release due ​to concerns about ​its potential ⁠for misuse.

Editor's pickEducation
Arxiv· 2 days ago

Rethinking Publication: A Certification Framework for AI-Enabled Research

arXiv:2604.22026v1 Announce Type: new Abstract: AI research pipelines now produce a growing share of publishable academic output, including work that meets existing peer-review standards for quality and novelty. Yet the publication system was built on the assumption of universal human authorship and lacks a principled way to evaluate knowledge produced through automated pipelines. This paper proposes a two-layer certification framework that separates knowledge quality assessment from grading of human contribution, allowing publication systems to handle pipeline-generated work consistently and transparently without creating new institutions. The paper uses normative-conceptual analysis, framework design under four explicit constraints, and dry-run validation on two representative submission cases spanning key attribution scenarios. The framework grades contributions as Category A (pipeline-reachable), Category B (requiring human direction at identifiable stages), and Category C (beyond current pipeline reach at the formulation stage). It also introduces benchmark slots for fully disclosed automated research as both a transparent publication track and a calibration instrument for reviewer judgment. Contribution grading is contemporaneous, based on pipeline capability at the time of submission. Dry-run validation shows that the framework can certify knowledge appropriately while tolerating irreducible attribution uncertainty. The paper argues that publication has always certified both that knowledge is valid and that a human made it. AI pipelines separate these functions for the first time. The framework is implementable within existing editorial infrastructure and grounds recognition of frontier human contribution in epistemic achievement rather than unverifiable claims of human origin.

Editor's pickFinancial Services
Business Standard· 2 days ago

Why Indian govt is warning banks against Anthropic's Claude Mythos AI | Tech News - Business Standard

“The same capability that detects ... of the threat landscape”. Anthropic has decided against a public release of Claude Mythos Preview, citing serious cybersecurity and national security risks. The company believes that in the wrong hands, the model could enable sophisticated cyberattacks by automating the discovery and exploitation of vulnerabilities at scale. By limiting access to vetted partners, Anthropic is signalling a shift towards controlled deployment of high-risk AI systems rather ...

Editor's pickTechnology
TechNode· 2 days ago

China bars foreign investment in Manus AI project as scrutiny on AI exports grows · TechNode

China’s National Development and Reform Commission (NDRC) today announced that, in accordance with laws and regulations, it has issued a decision

Best Practice AI© 2026 Best Practice AI Ltd. All rights reserved.

Get the full executive brief

Receive curated insights with practical implications for strategy, operations, and governance.

AI Daily Brief — leaders actually read it.

Free email — not hiring or booking. Optional BPAI updates for company news. Unsubscribe anytime.

Include

No spam. Unsubscribe anytime. Privacy policy.