AI Intelligence Brief

Thu 7 May 2026

Daily Brief — Curated and contextualised by Best Practice AI

147Articles
Editor's pickEditor's Highlights

MIT Finds Wage Control, Trump Avoids Picking AI Winners, and Yale Warns of Worker Abandonment

TL;DR MIT economists reveal US companies use automation to control wages, exacerbating inequality without boosting productivity. The Trump administration signals it will not pick winners in the AI race, as new policy directives are prepared. A Yale Budget Lab report suggests AI could solve America's $39 trillion debt crisis, but warns of abandoning displaced workers. Anthropic's CEO predicts an 80-fold growth, driving a need for more computing power, with SpaceX renting data center capacity to support it.

Editor's highlights

The stories that matter most

Selected and contextualised by the Best Practice AI team

10 of 147 articles
Editor's pick
Arxiv· 2 days ago

Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation

arXiv:2605.04135v1 Announce Type: new Abstract: Readers of applied-domain LLM capability evaluations want to know what AI systems can currently do. That literature answers a related, but consequentially different, question: what older, cheaper, less-elicited models could do months or years earlier (a 2026 paper evaluating GPT-4o-mini zero-shot, say, against a frontier of reasoning-capable, tool-u

Editor's pickPAYWALLTechnology
FT· 3 days ago

SpaceX to rent data centre capacity to Anthropic

AI start-up is racing to add computing power to keep up with its growth

Editor's pick
Artificial Intelligence Newsletter | May 7, 2026· 3 days ago

Singapore regulator says companies liable for AI antitrust harms

The Competition and Consumer Commission of Singapore stated that companies are responsible for foreseeable antitrust harms caused by AI algorithms, whether developed internally or via third-party systems.

Editor's pickTransportation & Logistics
Daily Brew· 3 days ago

Uber Shares What Happens When 1,500 AI Agents Hit Production

Uber details the operational challenges and outcomes of deploying 1,500 AI agents into their production environment.

Editor's pickTechnology
Arxiv· 2 days ago

Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?

arXiv:2605.03195v1 Announce Type: new Abstract: Modern coding agents increasingly delegate specialized subtasks to subagents, which are smaller, focused agentic loops that handle narrow responsibilities like search, debugging or terminal execution. This architectural pattern keeps the main agent's context window clean by isolating verbose outputs (e.g. build logs, test results, etc.) within the s

Editor's pickFinancial Services
Arxiv· 2 days ago

A Regulatory Governance Framework for AI-Driven Financial Fraud Detection in U.S. Banking: Integrating OCC, SR 11-7, CFPB, and FinCEN Compliance Requirements for Model Development, Validation, and Monitoring Lifecycles

arXiv:2605.04076v1 Announce Type: cross Abstract: U.S. financial institutions deploying AI-based fraud detection face a fragmented compliance landscape spanning four regulatory frameworks -- OCC Bulletin 2011-12, SR 11-7, the CFPB AI circular, and FinCEN BSA/SAR requirements -- with no integrated governance life cycle connecting these requirements to model development, validation, and monitoring

Economics & Markets

39 articles
AI Investment & Valuations11 articles
Editor's pickPAYWALLTechnology
Bloomberg· 2 days ago

Korea Surpasses Canada as World’s Seventh-Largest Stock Market

South Korea’s equity market has overtaken Canada’s as the world’s seventh largest, propelled by insatiable demand for chips powering artificial intelligence.

Editor's pickFinancial Services
Arxiv· 2 days ago

ESG as Priced Crash Insurance: State-Dependent Tail Risk and Deconfounding Evidence

arXiv:2605.04479v1 Announce Type: cross Abstract: This research establishes ESG as a state dependent insurance mechanism against equity crashes by addressing the decoupling of unconditional alpha from tail risk resilience. By validating market stress regimes as distinct economic states through a drawdown-based truncation rule, the study demonstrates that high ESG ratings materially reduce the incidence of discrete crash events during systemic drawdowns. To address the selection bias and high-dimensional confounding inherent in traditional linear frameworks, we implement Double Machine Learning as a structural deconfounding layer. Unlike simple predictive modeling, the Double Machine Learning framework utilizes machine learning to handle complex nuisance parameters, allowing us to isolate the asymmetric treatment effects of ESG across different market states. Distributional analysis reveals the underlying mechanism as ESG specifically attenuates the severity of realized tail losses at the most adverse quantiles instead of shifting the entire return distribution. Confirmed by structural estimates, this protection functions as priced insurance that incurs performance drags during stable periods while providing critical resilience when tail risks are most acute.

Editor's pickPAYWALLTechnology
Bloomberg· 2 days ago

Montage Tops CATL as Priciest Dual-Listed Stock After Chip Rally

Montage Technology Co. has overtaken Contemporary Amperex Technology Co. Ltd. as the most expensive dual-listed stock in Hong Kong relative to its mainland shares, propelled by surging demand for AI chips.

Editor's pickFinancial Services
Fortune· 3 days ago

‘FOMO has proven a stronger incentive than poor stock performance’: Goldman Sachs finds insecurity is a key part of the AI boom

Goldman Sachs looked at the giant data-center question from both sides of the equation — and shrugged.

Editor's pickFinancial Services
The Guardian· 3 days ago

Global finance watchdog warns over private credit industry fuelling AI boom | Financial sector | The Guardian

Financial Stability Board report reveals tech, healthcare and services sectors as the biggest borrowers

Editor's pickTechnology
Ethan Mollick· 3 days ago

Strategic Compute Partnerships and the Shifting Competitive Landscape of Frontier Models

Recent industry deals suggest a realignment of compute resources that may impact the long-term viability of specific frontier models. These shifts highlight the critical role of compute access in maintaining competitive positioning.

Editor's pickManufacturing & Industrials
International Business Times· 3 days ago

Veeco Instruments Surges 19% on Massive $250M AI Laser Orders Despite Q1 Miss

The move reflects broader enthusiasm ... to AI infrastructure buildout, even if quarterly results are not perfect. ... Despite the positive momentum, Veeco faces ongoing challenges. Reduced shipments to China due to export controls impacted Q1 results, and the company continues navigating a complex geopolitical environment. Competition in the ...

Editor's pickFinancial Services
Daily Brew· 3 days ago

Reserv Secures $125M Series C to Revolutionize Claims with AI-Driven Platform

Reserv secures $125 million Series C funding from KKR and partners to enhance its AI-native claims platform, aiming to outpace traditional models.

Editor's pick
Substack· 3 days ago

Founders AM: Important AI Bubble Piece - by VBL - GoldFix

The report warns that AI expansion now depends heavily on uninterrupted credit-market liquidity and investor …

Editor's pickTechnology
Novamediagroup· 3 days ago

Tech Giants Navigate Profit Paradox as AI Investment Surge Reshapes Industry Economics - Nova Media Group Blog

Sony's gaming revenue drops yet profits soar 19%, while AI investments hit record levels with companies like Adaption Labs securing $50 million in seed funding. This paradox reveals how the tech industry is fundamentally restructuring its economics, prioritizing operational efficiency over ...

Editor's pickTechnology
Reddit· 3 days ago

r/wallstreetbets on Reddit: Corning surges 14% on massive NVIDIA partnership to boost AI fiber capacity By Investing.com

America is just one giant bet on AI right now.

AI Macroeconomics5 articles
Editor's pick
Arxiv· 2 days ago

The Demand Externality of Automation

arXiv:2605.05127v1 Announce Type: new Abstract: Automation raises productivity and reduces paid human labor, but it also reallocates income and ownership claims. This paper studies that tradeoff in a static benchmark and in a stationary heterogeneous-agent general equilibrium. Firms choose automation from a profit function. Households differ by skill and wealth, save in a capital/equity claim, and face incomplete insurance. Wages and returns are determined by market clearing from a Cobb--Douglas final-good firm, while the wealth distribution is pinned down by a Hamilton--Jacobi--Bellman (HJB) equation and a Kolmogorov forward equation (KFE). The paper is deliberately two-sided. With strong productivity growth, high-skill complementarity, low obsolescence, and broad ownership, automation raises output, capital, and consumption. With strong exposure of low-wealth, high-marginal-propensity-to-consume (high-MPC) households and concentrated ownership, privately chosen automation can be excessive even though it raises high-skilled labor income. The central object is the derivative of household consumption demand and collective wage bill with respect to automation. Fiscal policy is modeled as a government problem rather than as an abstract planner: a tax changes the firm's automation first-order condition, raises revenue only on the remaining automation base, and must specify rebates and administrative losses.

Editor's pick
Arxiv· 2 days ago

Coupled-NeuralHP: Directional Temporal Coupling Between AI Innovation Exposure and Public Response

arXiv:2605.04194v1 Announce Type: new Abstract: Artificial intelligence innovation exposure and public response co-evolve, but innovation arrives as irregular event streams while response is observed monthly. We introduce Coupled-NeuralHP, a hybrid event-plus-state model linking eight-domain USPTO AI patent publication streams to a train-only Google Trends response index. Under the cleaned response protocol, the validation-selected one-way real-data variant gives the best held-out innovation count forecasts in the registered comparison set (pseudo-log-likelihood -30.4 vs. -34.7; root mean squared error (RMSE) 471 vs. 532) while matching the stronger multi-lag factor-family baseline on response RMSE (0.295). Ablations show that the real-data response signal is carried mainly by the structured forecast head, whereas the reverse response-to-innovation block is not supported on held-out count prediction. Across 60 semi-synthetic replications with known structure, the broader coupled family recovers innovation-to-response links much better than vector autoregression with exogenous inputs (VARX) (F1 = 0.734 vs. 0.386). A placebo-controlled 2022 split-date analysis finds no robust milestone-specific regime break.

Editor's pickGovernment & Public Sector
Axios· 3 days ago

What an AI productivity surge would mean for the fiscal outlook

New modeling from the Budget Lab at Yale shows that in the most optimistic scenario, the national debt would level off as a share of the economy.

AI Market Competition5 articles
AI Pricing & Cost Curves3 articles
AI Productivity6 articles
Editor's pickEducation
Arxiv· 2 days ago

LLMs learn scientific taste from institutional traces across the social sciences

arXiv:2603.16659v2 Announce Type: replace-cross Abstract: Reinforcement-learned reasoning has powered recent AI leaps on verifiable tasks, including mathematics, code, and structure prediction. The harder bottleneck is evaluative judgment in low-verifiability domains, where no oracle anchors reward and the core question is which untested ideas deserve attention. We test whether institutional traces, the record of what fields published, where, and at which tier, can serve as a training signal for AI evaluators. Across eight social science disciplines (psychology, economics, communication, sociology, political science, management, business and finance, public administration), we built held-out four-tier research-pitch benchmarks and supervised-fine-tuned (SFT) LLMs on field-specific publication outcomes. The fine-tuned models cleared the 25 percent chance baseline and exceeded frontier-model performance by wide margins, with best single-model accuracy ranging from 55.0 percent in public administration to 85.5 percent in psychology. In management, evaluated against 48 expert gatekeepers, 174 junior researchers, and 11 frontier reasoning models, the best single fine-tuned model (Qwen3-4B) reached 59.2 percent, 17.6 percentage points above expert majority vote (41.6 percent, non-tied) and 28.1 percentage points above the frontier mean (31.1 percent). The fine-tuned models also showed calibrated confidence: confidence rose when predictions were correct and fell when wrong, mirroring how a skilled reviewer can say "I'm sure" versus "I'm guessing." Selective triage on this signal reached very high accuracy on the highest-confidence subsets in every field. Institutional traces, we conclude, encode a scalable training signal for the low-verifiability judgment on which science depends.

Editor's pickProfessional Services
VentureBeat· 3 days ago

Market research is too slow for the AI era, so Brox built 60,000 identical 'digital twins' of real people you can survey instantly, repeatedly

In a world where a viral TikTok video can cause a brand to trend globally in mere hours, the traditional market research cycle — often spanning 12 weeks — is becoming a liability. The lag between a survey question and the answers from a wide (or targeted) pool of respondents has become a primary bottleneck for Fortune 500 decision-makers who are forced to navigate volatile geopolitical and economic shifts with data that is frequently outdated by the time it reaches a slide deck, as industry experts have observed. Brox, a predictive human intelligence startup, recently announced a strategic funding round following a year where they reported 10X revenue growth. Their proposition is as ambitious as it is technical: the creation of a "parallel universe" populated by 60,000 digital twins of real, living human beings and their entire demographic profiles and consumer preferences, allowing enterprises to run unlimited experiments in hours rather than months. “These digital twins are one-to-one replicas of actual, real individuals," said Brox CEO Hamish Brocklebank in a recent video call interview with VentureBeat. "We recruit real people like a normal panel company does, pay them to interview them, and capture all the data around them — fully consent-driven.” The company, currently a lean 14-person operation, is positioning itself as the antithesis of the "insane" research industry. By replacing statistical models with behavioral replicas, Brox aims to transform how the world’s largest banks and pharmaceutical giants anticipate human reactions to high-stakes global and market-shifting events, or narrow, targeted product releases and personnel news, and everything in between. The kinds of surveys and specific questions that Brox asks its digital twins are completely open-ended and can be customized to fit any conceivable business customer's use cases and goals. According to Brocklebank, examples of survey questions include: “What happens if America invades Iran or Greenland? Will depositors at Bank of America put more money into their account or take more money out? Or, in pharmaceuticals, if RFK Jr. says something next week, will that make people more likely to take vaccines or less likely?” Not synthetic people — AI copies of real ones The core differentiator of Brox’s technology lies in the fidelity of its input data. While many competitors in the "digital audience" space rely on purely synthetic identities — generic personas generated by Large Language Models (LLMs ) — Brocklebank argues that these methods inevitably produce "AI slop". Purely synthetic audiences often cluster around a tight distribution of answers, over-indexing for "correct" or "healthy" behaviors (such as eating broccoli) because of inherent biases in the underlying models. Brox’s "Digital Twins" are instead one-to-one behavioral replicas of real individuals who have been recruited and interviewed with exhaustive depth. The process is intensive: Deep Interviews: The company conducts hours of real and AI-driven interviews with each participant. Psychological Depth: The data collection seeks to understand fundamental "decision drivers," including upbringing, relationships, and even marital stability. Data Density: For some twins, Brox maintains up to 300 pages of text data, representing what Brocklebank calls "the deepest per person data set that exists". To solve the "black box" problem common in AI, Brox utilizes a "reasoning chain" for its predictive outputs. When a digital twin predicts a reaction — such as how a $2 billion net-worth individual might respond to a specific interest rate hike — the model introspects and provides a step-by-step explanation for that decision. This allows clients to understand not just what will happen, but the underlying psychology of why it is happening. Scaling the "unscalable" interview The product offering is currently live in the US, UK, Japan, and Turkey. Brox has successfully digitized specific, high-value cohorts that are traditionally difficult for researchers to access. This includes a panel of "high-net-worth" individuals (those worth over $5 million) and specialized medical professionals like dermatologists — including a multibillionaire. However, the largest value for customers is likely in the aggregate mass of all individuals that can be polled en masse and/or segmented across demographics, especially those of medium and lower income levels, whose purchasing power and decision-making is more constrained and whose market- One of the more unique aspects of the Brox platform is its incentive structure. To ensure twins remain up-to-date, real-world counterparts are re-contacted frequently. For high-value individuals who are not motivated by small cash payments, Brox has issued Stock Appreciation Rights (SARs), essentially making these participants "investors" in the company’s success to ensure they continue to provide high-fidelity personal updates. The platform’s use cases currently focus on two primary sectors: Pharmaceuticals: Predicting vaccine hesitancy or how physicians might react to new biologics based on shifting political climates. Finance: Simulating how depositors at major banks might move funds in response to geopolitical events, such as conflicts in the Middle East. As for why go to the trouble of interviewing and digitally cloning real people instead of just creating wholly fictitious, synthetic audience characters and personas using LLMs and other AI models, Brocklebank offered his perspective. “You can create 10,000 truly synthetic digital twins, but the answers will still normalize into a very tight distribution, which is not realistic when you’re actually asking real people," Brocklebank said. By maintaining a pre-built audience of 60,000 twins, the company enables clients to bypass the recruitment phase of research. A large US bank or a global pharma giant can now "query" the digital population and receive a validated analysis in a matter of hours. Pricing and accessibility Unlike traditional research firms that charge on a per-project or per-respondent basis, Brox operates as a high-end Software-as-a-Service (SaaS) platform with enterprise-level commercial licensing. The company avoids the "seat" or "usage" limits that often hinder rapid experimentation within large organizations. Pricing Tiers: Subscriptions are sold as blanket flat fees, starting at a minimum of $100,000 per year. Top-Tier Contracts: For larger deployments involving multiple teams and global data access, contracts scale up to $1.5 million per year. Usage Rights: Clients are granted unlimited usage during the contract period. This allows them to run thousands of simulations without worrying about incremental costs, encouraging a culture of "testing everything" before deployment. From a legal and privacy standpoint, the digital twins are built on a "fully consent-driven" framework. While the twins can be traced back to real human data for internal validation, the platform is designed to provide aggregated behavioral insights that protect the anonymity of the participants while maintaining the predictive power of their digital replicas. Rejecting the rise of Kalshi, Polymarket and 'prediction markets' The tech industry has recently seen a surge in valuations and interest in "prediction markets" like PolyMarket and Kalshi, which allow users to bet on the outcomes of various global events. However, the leadership at Brox maintains a distinct distance from these platforms, citing a "personal disdain" for betting markets from both a moral and intellectual perspective. Brocklebank argues that while betting markets can predict outcomes (e.g., who wins an election), they offer zero utility for business decision-makers because they fail to provide the "why". Knowing there is a 60% chance of a certain candidate winning does not help a company adjust its consumer strategy; knowing why a specific cohort of depositors is feeling anxious does. Investors including Scribble Ventures, Wonder Ventures, and Vela Partners have backed this "human-first" approach to AI, betting that the moat created by deep human data will prove more resilient than the commoditized models of synthetic data providers. As Brox prepares for launches in the Middle East and APAC, the company is moving toward its ultimate goal: simulating the entire world as a "parallel universe" for risk-free decision-making.

Editor's pick
Arxiv· 2 days ago

Are you with me? A Framework for Detecting Mental Model Discrepancies in Task-Based Team Dialogues

arXiv:2605.03149v1 Announce Type: new Abstract: Humans typically use natural language to update teammates on task states. Since not all updates are communicated, discrepancies arise between the team members' mental models that negatively affect overall team performance. How can we categorize such discrepancies? Do misalignments detected in team dialogue predict future mental model misalignments? Traditional shared mental model (SMM) assessment methods rely on retrospective expert coding that cannot capture real-time coordination dynamics. We propose a framework to identify and categorize four types of mental model discrepancies: unsupported beliefs, false beliefs, belief contradictions, and omissions, all of which can naturally emerge in team dialogues. Using dialogues from twenty dyad teams performing collaborative object identification tasks across four sequential levels, we demonstrate that these discrepancy patterns contain predictive signals. Averaging historical discrepancy counts achieves meaningful prediction accuracy using uniform weighting as an exploratory baseline, with differential predictability across discrepancy types.

Editor's pickManufacturing & Industrials
Arxiv· 2 days ago

Automated Large-scale CVRP Solver Design via LLM-assisted Flexible MCTS

arXiv:2605.03339v1 Announce Type: new Abstract: Solving large-scale CVRP (LSCVRP) with hundreds to thousands of nodes remains difficult for even state-of-the-art solvers. Divide-and-conquer can scale by decomposing the instance into size-reduced subproblems, but designing decomposition logic and configuring sub-solvers is highly expertise- and labor-intensive. Large Language Models (LLMs) have emerged as promising tools for automated algorithm design. However, existing LLM-driven approaches struggle with LSCVRP primarily due to the difficulty in generating sophisticated search strategies within a limited context window. To bridge this gap, we propose the LLM-assisted Flexible Monte Carlo Tree Search (LaF-MCTS), a novel framework that automates the design of high-performance LSCVRP solvers. We develop a three-tier decision hierarchy to enable incremental design of decomposition policies and sub-solvers for LSCVRP. To enable efficient search within the algorithmic hypothesis space, we introduce semantic pruning to eliminate semantically and structurally redundant codes, and branch regrowth to regenerate codes and preserve diversity. Extensive experiments on CVRPLib demonstrate that LaF-MCTS autonomously composes and optimizes decomposition-enhanced solvers that surpasses various state-of-the-art CVRP solvers.

Editor's pick
Medium· 3 days ago

Are You Still Calling It an “AI Bubble”? | by Atsushi Ito | May, 2026 | Medium

Today’s AI — at least across many practical domains — is not simply a “toy that tells convincing lies.” Text processing, summarization, translation, classification, search assistance, code completion, template generation, meeting note organization, first-line inquiry response, data structuring, and pattern extraction.

Editor's pickTechnology
StockStory· 3 days ago

WK Q1 Deep Dive: Large Deal Momentum and AI Investments Drive Growth Amid Cautious Outlook - StockStory

Cloud reporting platform Workiva (NYSE:WK) reported revenue ahead of Wall Street’s expectations in Q1 CY2026, with sales up 19.9% year on year to $247.3 mill...

AI Startups & Venture8 articles

Labor, Society & Culture

18 articles
AI & Employment10 articles
Editor's pick
SQ Magazine· 3 days ago

How AI Replaces Jobs in 2026: Which Industries Are Most Affected

Challenger data shows AI was cited in roughly 13% of 2026 US job-cut plans year-to-date. ... The U.S. Bureau of Labor Statistics projects paralegal employment will grow about 1% from 2024 to 2034, slower than the average for all occupations, with about 39,300 annual openings projected and 367,220 paralegals and legal assistants employed at a median annual wage of $61,010 as of May 2024. Has AI saved newsroom jobs through productivity ...

Editor's pickManufacturing & Industrials
Supplychaindigital· 3 days ago

Gartner: The Cost of Reducing Entry-Level Hiring for AI | Supply Chain Magazine

As more businesses look to save money and fill workforce gaps with AI, Gartner warns that pausing entry-level hiring could result in higher business costs

Editor's pickManufacturing & Industrials
The Korea Times· 3 days ago

AI era forces Korea's labor, capital to negotiate new ‘survival pact' - The Korea Times

As artificial intelligence (AI) spreads from coding assistants to factory robots and hiring tools, experts say Korea’s familiar labor disputes over...

Editor's pickGovernment & Public Sector
The Straits Times· 3 days ago

Parliament supports motion on ‘no jobless growth’ | The Straits Times

The grant provides funds of up ... job redesign projects, including consultancy fees and worker reskilling costs. Later in 2026, eligible businesses will also receive $10,000 under the redesigned SkillsFuture Enterprise Credit, which can be used to offset costs incurred from workforce transformation programmes such as those under the EWTP. The council will also pay special attention to students and younger workers who are anxious about AI’s impact ...

Editor's pick
Times Square Chronicles· 2 days ago

AI Layoffs Are Becoming the Defining Business Story of 2026 | Times Square Chronicles

Corporate America has entered a new phase of the AI revolution. For the last two years, businesses treated artificial intelligence as a productivity tool. In 2026, companies are increasingly treating it as a workforce replacement strategy. This week alone, multiple major companies either announced ...

AI Ethics & Safety5 articles
Editor's pickHealthcare
Arxiv· 2 days ago

Evaluating Patient Safety Risks in Generative AI: Development and Validation of a FMECA Framework for Generated Clinical Content

arXiv:2605.04085v1 Announce Type: new Abstract: Objectives: Large language models (LLMs) are increasingly used for clinical text summarization, yet structured methods to assess associated patient safety risks remain limited. Failure Mode, Effects, and Criticality Analysis (FMECA) provides a proactive framework for systematic risk identification but has not been adapted to LLM-generated clinical content. This study aimed to develop and validate a novel FMECA framework for the prospective assessment of patient safety risks in LLM-generated clinical summaries. Materials and Methods: An interdisciplinary expert panel (n = 8) developed a taxonomy of failure modes through literature review and brainstorming. Standard FMECA dimensions (occurrence, severity, detectability) were adapted into 5-point ordinal scales. The framework was applied to 36 discharge summaries from four patients, generated by an open LLM (GPT-OSS 120B) using real-world clinical data from the Geneva University Hospitals. Reviewers independently annotated the summaries across two rounds. Inter-rater reliability was assessed at failure mode, severity and detectability score levels. Usability and content validity were evaluated using an adapted System Usability Scale and structured feedback. Results: The final framework comprised 14 failure modes organized into categories. Inter-rater agreement improved between rounds, reaching moderate-to-substantial agreement for failure mode identification and good agreement for severity and detectability scoring. Usability was rated as good (mean SUS: 79.2/100), with high evaluator confidence. Discussion and Conclusion: This study presents the first FMECA-based framework for systematic patient safety risk assessment of LLM-generated clinical summaries. The framework provides a structured and reproducible method for identifying clinically relevant risks caused by these summaries.

Editor's pickPAYWALLTechnology
FT· 3 days ago

AI Labs: Are Anthropic really the good guys?

Dario Amodei casts his company as the good guys of the AI race. Will that last?

Editor's pickPAYWALLTechnology
FT· 3 days ago

Conned by a chatbot

Like tricksters, LLMs have perfected the art of plausibility

Editor's pickEducation
Arxiv· 2 days ago

Stop Automating Peer Review Without Rigorous Evaluation

arXiv:2605.03202v1 Announce Type: new Abstract: Large language models offer a tempting solution to address the peer review crisis. This position paper argues that today's AI systems should not be used to produce paper reviews. We ground this position in an empirical comparison of human- versus AI-generated ICLR 2026 reviews and an evaluation of the effect of automated paper rewriting on different AI reviewers. We identify two critical issues: 1) AI reviewers exhibit a hivemind effect of excessive agreement within and across papers that reduces perspective diversity. 2) AI review scores are trivially gameable through paper laundering: prompting an LLM to rewrite a paper could significantly increase the scores from AI reviewers, demonstrating that LLM reviewers are easy to game through stylistic changes rather than scientific results. However, non-gameability and review diversity are necessary but not sufficient conditions for automation. We argue that addressing the peer review crisis requires a science of peer review automation -- not general-purpose LLMs deployed without rigorous evaluation.

Editor's pickHealthcare
Arxiv· 2 days ago

AI and Suicide Prevention: A Cross-Sector Primer

arXiv:2605.04321v1 Announce Type: new Abstract: AI chatbots already function as de facto mental health support tools for millions of people, including people in crisis. Yet, they lack the clinical validation, shared standards, and coordinated oversight that their societal role demands. This primer was developed in conjunction with a multistakeholder workshop hosted by Partnership on AI in 2026, convening AI labs, mental health practitioners, people with lived experience, and policymakers, to provide a common cross-sector reference point for the current state of the field of AI and suicide prevention. It begins with an overview of clinical best practices, then turns to how frontier AI systems (as of winter 2026) detect and respond to suicide and non-suicidal self-injury (NSSI) queries. Together, these provide insight into what it would take to design and implement AI tools that not only better prevent suicide and NSSI, but also promote overall well-being. Drawing on clinical literature, publicly available AI lab policies, an emerging landscape of evaluation frameworks, and conversations with leaders across the AI and mental health fields, we map challenges posed by general-purpose AI chatbots for mental health across model, product, and policy layers, ultimately highlighting priority areas where cross-industry alignment is both urgently needed and achievable.

Public Attitudes to AI2 articles
Editor's pick
Arxiv· 2 days ago

Heterogeneous Ordinal Structure Learning with Bayesian Nonparametric Complexity Discovery

arXiv:2605.04191v1 Announce Type: cross Abstract: Public attitudes toward artificial intelligence are heterogeneous, ordinally measured, and poorly captured by any single dependency graph. Existing ordinal structure learners assume a shared directed acyclic graph (DAG) across all respondents; recent heterogeneous ordinal graphical-model approaches focus on subgroup discovery rather than confirmatory cluster-specific DAG estimation; and latent profile analyses discard dependency structure entirely. We introduce a heterogeneous ordinal structure-learning framework combining monotone Gaussian score embedding, Bayesian nonparametric (BNP) complexity discovery via a truncated stick-breaking prior, and confirmatory fixed-K estimation with cluster-specific sparse DAG learning. The key methodological insight is a discovery-to-confirmation workflow: the nonparametric stage calibrates plausible archetype complexity, while inner-validated confirmatory refitting yields stable, interpretable structural estimates. On the 2024 Pew American Trends Panel AI attitudes survey, Wave 152 (W152) survey, (N = 4,788, 8 ordinal items), the confirmatory K*=5 model reduces holdout transformed-score mean squared error (MSE) by 25.8% over a single-graph baseline and by 4.6% over mixture-only clustering. A controlled tiered semi-synthetic benchmark calibrated to W152 structure validates recovery across difficulty regimes and transparently reveals failure modes under stress conditions.

Technology & Infrastructure

43 articles
AI Agents & Automation8 articles
Editor's pickTechnology
Arxiv· 2 days ago

cotomi Act: Learning to Automate Work by Watching You

arXiv:2605.03231v1 Announce Type: new Abstract: What if a browser agent could learn your work simply by watching you do it? We present cotomi Act, a browser-based computer-using agent that combines reliable multi-step task execution with persistent organizational knowledge learned from user behavior. For execution, an agent scaffold with adaptive lazy observation, verbal-diff-based history compression, coarse-grained actions, and test-time scaling via best-of-N action selection achieves 80.4% on the 179-task WebArena human-evaluation subset, exceeding the reported 78.2% human baseline. For organizational knowledge, a behavior-to-knowledge pipeline passively observes the user's browsing and progressively abstracts it into artifacts (task boards, wiki) exposed through a shared workspace editable by both user and agent. A controlled proxy evaluation confirms that task success improves as behavior-derived knowledge accumulates. In our live demonstration, attendees interact with the system in a real browser, issuing tasks and observing end-to-end autonomous execution and shared knowledge management.

Editor's pickTechnology
Substack· 3 days ago

GTM: ⁠How AI Will Go From Influencing 20% to Making 90% of Your Decisions | Steve Lucas, Boomi CEO

In this episode, Sophie Buonassisi sits down with Steve at HumanX in San Francisco to unpack why 2026 is the year AI moves from pilot to production, the three diagnostic questions every CEO should ask before deploying agentic AI , and why “change only happens at the speed of trust.” Steve also gets candid about the internal protest that erupted when Boomi rolled out its first CSM agent, his prediction that AI will go from influencing 20% of executive decisions to making 90% of them within 2 years, and how he picked up Claude Code on an airplane to vibe-code a contracts agent for his sales team.

Editor's pickTechnology
VentureBeat· 3 days ago

The app store for robots has arrived: Hugging Face launches open-source Reachy Mini App Store with 200+ apps

There's an app for nearly every imaginable user and use case these days, but one thing they all have in common is that they're centered around one device: the smartphone. That changes today as Hugging Face, the 10-year-old New York City startup best known for being the go-to place online to host and use cutting-edge, open-source AI models, agents and applications, launches a new App Store for Reachy Mini, its low-cost ($299) open-source physical robot that debuted back in July 2025 (itself the fruit of Hugging Face's acquisition of another startup, Pollen Robotics). The new Hugging Face Reachy Mini App Store already hosts a library of over 200 community-built applications, and Reachy Mini owners will be able to download any of these free of charge to start (unlike smartphone apps, there's no monetization option for app creators on this store — yet). The Reachy Mini App Store will also offer Reachy Mini owners — around 10,000 units have been sold so far since last year — an easy means of building their own custom apps for the tiny, stationary desktop robot with built-in camera eyes, speaker, and microphone, via Hugging Face's existing, AI-powered agent called "ML Intern." The significance lies not just in the hardware, but in the removal of the "roboticist" barrier; for the first time, individuals without a background in engineering or coding are shipping functional robotics software in under an hour. "Anyone can build the apps," said Clément Delangue, CEO and co-founder of Hugging Face, in a video interview with VentureBeat. "My intuition is that more and more [AI] model builders will release on Reachy Mini as a way to test the robotics ability of new models." Make robots as accessible to laypeople as PCs and smartphones The technical bottleneck in robotics has historically been the scarcity of high-quality training data. While Large Language Models (LLMs) have mastered general-purpose coding by training on massive repositories like Microsoft's GitHub, the volume of code specific to robotics remains "tiny" by comparison (though Github does contain likely the largest existent, publicly accessible library of robotics code to date, with more than 17,000 different repositories or "repos" dedicated to the field). This lack of data has meant that, until now, AI agents were relatively poor at understanding the physical abstractions and firmware requirements of hardware. Hugging Face’s solution is an agentic toolkit that acts as an intermediary. Rather than forcing a user to learn a specific robotics SDK or master the nuances of a robot's firmware, the toolkit allows a user to describe a desired behavior in plain English—for instance, "wave when someone says good morning". An AI agent then handles the heavy lifting: it writes the code, tests it against the robot's specific constraints, and ships the final package "Historically, it’s been extremely hard," Delangue told VentureBeat of building robotics applications. "But we’ve worked really hard on the topic with a mix of open sourcing everything we do, working on the right abstractions for robotics, and making it easier for agents to understand and use it." The platform is model-agnostic, supporting a wide range of leading intelligence engines. Users can build apps using Hugging Face’s own ML Intern agent or leverage external models including GPT-5.5, Claude Opus 4.6, Kimmy 2.6, Mini Max GM5, and Deep Sig V4 Pro. For real-time interaction, the official conversation apps utilize OpenAI Realtime and Gemini Live. By providing these high-level abstractions, Hugging Face has collapsed the traditional "integration weeks" of robotics work into a process that takes minutes. Low-cost Reachy Mini is a hit In order to take advantage of the new Hugging Face Reachy Mini App Store, users are encouraged to purchase Reachy Mini, a cute desktop robot Hugging Face launched back in July 2025 as an affordable, open-source alternative to the existing, commercially available robots from the likes of Boston Dynamics, whose infamous Spot robot dog retails for around $70,000. Even Chinese competitors start at $1,900+. In contrast, the Reachy Mini is accessibly priced for hobbyists and developers. It comes in two variants: Reachy Mini Lite ($299 plus shipping): A tethered version that connects via USB and uses an external computer for processing. Reachy Mini Wireless ($449 plus shipping): A standalone version featuring an on-board Raspberry Pi CM 4 and Wi-Fi connectivity. Delangue said that of the 10,000 Reachy Mini units sold so far, 3,000 were sold in just the past two weeks. Hugging Face expects to ship another 1,000 units within the next 30 days. Even those who don't own a Reachy Mini can still develop apps for it, however, using the Reachy Mini App Store and the Reachy App, which contains a 3D simulation of the robot and its responses. The App Store itself is hosted on the Hugging Face Hub. It functions much like a standard software repository but for hardware behaviors: Search and Install: Users can find apps, click a button, and install them directly to their robot. Forkability: Every app is "forkable," meaning a user can duplicate an existing app and ask an AI agent to modify it (e.g., "make it answer in French"). Simulation Mode: Crucially, the store includes a browser-based simulator. This allows users who do not own a physical Reachy Mini to build, test, and play with the catalog in a virtual environment. Both are part of Hugging Face's ongoing "Le Robot" effort — a project that began in 2024 with Hugging Face researchers specializing in robotics and AI developing and publishing on the web their own open-source code, tutorials, and hardware to make robotics development more accessible to a wider audience. And unlike Github, which is designed for a developer audience, the Hugging Face Reachy Mini App Store is designed for robot owners and users who may have no technical experience or training whatsoever. Continuing with the open-source ethos and practice Hugging Face’s strategy is rooted in the belief that closed-source hardware and software are "almost impossible" to build for at scale. Delangue notes that closed systems prevent the training of agents and limit the ability of the community to innovate. Consequently, the entire Reachy Mini platform is open-source. This open licensing model has two primary implications for the ecosystem: Accelerated Development: Because the code is public and integrated with the Hugging Face ecosystem via "Spaces," Hugging Face's feature for hosting AI-powered web apps launched in 2021, agents can more easily learn how to interact with the hardware. Community Sovereignty: Apps are not locked behind a proprietary wall. Currently, all 200+ apps on the store are free, though the platform's foundation on "Spaces" provides the flexibility for creators to potentially monetize their work in the future. "For the moment, all the apps are free," Delangue noted. "It’s flexible, it’s built on [Hugging Face] Spaces, so at some point maybe people are going to make them paid." Robotics enters its accessible hobbyist era Hugging Face's Reachy Mini App Store is launching with 200 apps already available. So who built them, and how did they do it without this platform existing prior? Delangue told VentureBeat that more than 150 different creators have contributed to the store, most of whom had never written a line of robotics code before. Yet, they have been able to do so thanks to Hugging Face's ML Intern and Github. The new Hugging Face Reachy Mini App Store now puts the tools and existing apps into one place for easier accessibility. Delangue was keen to highlight one of the early Reachy robotics app developers in particular to VentureBeat: Joel Cohen, a 78-year-old retired marketing executive. Cohen, who is colorblind and has no technical background, spent two weeks assembling his Reachy Mini Lite (a task that usually takes three hours). Despite these physical challenges, he used an AI agent to build a "VP of Future Thinking" facilitator for his Zoom-based CEO peer groups. The app enables the robot to: Greet 29 members by name. Fact-check discussions in real-time. Summarize key themes and push back on surface-level answers. "I built this by describing what I needed in plain English," Cohen stated in a press release provided to VentureBeat ahead of the launch. "No SDK. No robotics background. No developer experience". Other community-driven applications include: Emotional Damage Chess: A robot that plays chess and mocks the user’s blunders. Reachy Phone Home: An anti-procrastination tool that detects when a user picks up their phone and tells them to get back to work. Language Tutor: A physical companion that listens to speech and corrects accents. F1 Race Commentator: A desk companion that calls Formula 1 races live as they happen. Delangue himself related to VentureBeat that in only a few hours, he built an app for his own Reachy Mini robot at the Hugging Face Miami office to have the robot act as a receptionist. “It basically does face recognition to detect when you arrive in the office, and then it looks at you and onboards you," Delangue related. "It says, ‘Hey, welcome to the office. Who are you here to see?’ Then it sends me a message: ‘Carl just arrived at the office. He’s here to meet you, and for these reasons.’ It works a little bit as my welcoming booth at the office, and it took me less than two hours to build that.” Even for an experienced founder and developer as Delangue, building apps for a robot was out of the question until the combination of Reachy Mini and ML Intern. “For me, it would have been impossible," the Hugging Face CEO said. "If you weren’t a robotics developer, it probably would have been impossible, or it would have taken a few months." Democratizing robotics The launch of the agentic App Store signals a fundamental shift in how we interact with machines. For sixty years, the field was gated by the requirement for deep technical expertise. By combining low-cost open hardware with the reasoning capabilities of modern AI agents, Hugging Face is moving toward a future where the hardware is a commodity and the behavior is limited only by what a user can describe. As Delangue noted during the launch, the goal was to provide a platform for people who "want to get into robotics but don’t have the hardware or the skills". With nearly 10,000 robots now "in the wild" and a burgeoning store of agent-written apps, the Reachy Mini has become the most widely deployed open-source desktop robot in history. The question is no longer how to build a robot, but what—now that the gate is open—we will ask them to do.

Editor's pickTechnology
Siliconrepublic· 3 days ago

ServiceNow wants to be ‘AI agent of agents’ with Otto platform and AI tools

SaaS giant ServiceNow want to move from being the ‘platform of platforms’ to the ‘AI agent of agents’ with launch of Otto and wider AI suite. Read more: ServiceNow wants to be ‘AI agent of agents’ with Otto platform and AI tools

Editor's pickTransportation & Logistics
Daily Brew· 3 days ago

Uber Shares What Happens When 1,500 AI Agents Hit Production

Uber details the operational challenges and outcomes of deploying 1,500 AI agents into their production environment.

Editor's pickFinancial Services
Daily AI News May 6, 2026: Are AI Finance Analysts the Future?· 3 days ago

Your Financial Competitive Edge, from Signal to Decision

Claude's financial services update introduces AI agents, connectors, and templates for workflows like KYC, valuation, and reporting. This signals a shift toward industry-specific AI automation in finance.

Editor's pickTechnology
Theregister· 3 days ago

Using AI to click around on a website burns 45x as many tokens as just using APIs

For AI agents, seeing is expensive

Editor's pickTechnology
Arxiv· 2 days ago

DAO-enabled decentralized physical AI: A new paradigm for human-machine collaboration

arXiv:2605.04522v1 Announce Type: cross Abstract: We propose DAO-enabled decentralized physical AI (DePAI), a democratic architecture for coordinating humans and autonomous machines in the operation and governance of physical-digital systems. We (1) synthesize foundations in blockchains, decentralized autonomous organizations (DAOs), and cryptoeconomics; (2) connect DAO design with digital-democracy research on deliberation and voting, showing how each can advance the other; (3) position DAO-governed decentralized physical infrastructure networks (DePIN) within a vertically integrated stack that links energy and sensing to connectivity, storage/compute, models, and robots; (4) show how these elements specify workflows that couple machine execution with human oversight, enabling enhanced self-organization of techno-socio-economic systems, which we call DePAI; and (5) analyze risks, including security, centralization, incentive failure, legal exposure, and the crowding-out of intrinsic motivation, and argue for value-sensitive design and continuously adaptive governance. DePAI offers a path to scalable, resilient self-organization that integrates physical infrastructure, AI, and community ownership under transparent rules, on-chain incentives, and permissionless participation, aiming to preserve human autonomy.

AI Infrastructure & Compute13 articles
Editor's pickTechnology
Theregister· 2 days ago

Neocloud IREN buys OpenStack champion Mirantis

Former bitcoin miner plans to build an easier cloudy AI on ramp while remaining a friend to FOSS

Editor's pickTechnology
Daily Brew· 2 days ago

Higher usage limits for Claude and a compute deal with SpaceX

Anthropic has announced increased usage limits for Claude and a new compute partnership with SpaceX.

Editor's pickTechnology
Tom's Hardware· 3 days ago

Global semiconductor sales hit nearly $300 billion in Q1 2026 — chips are on track to top $1 trillion for this year, says report | Tom's Hardware

Volume is up 25% quarter-over-quarter, and sales totaled $99.5 billion in March alone.

Editor's pickTechnology
Bebeez· 3 days ago

Argyll and SambaNova launch sovereign AI cloud in the UK

Argyll Data Development and SambaNova have launched their sovereign AI inference cloud in the UK. The cloud platform runs on SambaNova’s AI hardware and software stack within existing UK data centers. SambaNova's new SN50 AI chip – SambaNova The two companies first announced their intention to establish a sovereign AI cloud platform in October 2025, […]

Editor's pickPAYWALLTechnology
NYT· 2 days ago

Anthropic’s C.E.O. Says It Could Grow by 80 Times This Year

The chief executive, Dario Amodei, said the rapid growth had exponentially increased the start-up’s need for more computing power.

Editor's pickPAYWALLTechnology
FT· 3 days ago

SpaceX to rent data centre capacity to Anthropic

AI start-up is racing to add computing power to keep up with its growth

Editor's pickPAYWALLEnergy & Utilities
Bloomberg· 2 days ago

AI Boom Trumps Sleep, Says Boss of Data Center Operator NEXTDC

Short on sleep but flush with new funds, the head of Australian data center operator NEXTDC Ltd. has a message to investors: You snooze, you lose.

Editor's pickEnergy & Utilities
Daily Brew· 3 days ago

A Michigan farm town voted down plans for a giant OpenAI-Oracle data center. Weeks later, construction began

Despite local opposition, construction has commenced on a large-scale data center project in a Michigan town.

Editor's pickEducation
Bebeez· 3 days ago

University of Southern Denmark brings AI supercomputer online in Sønderborg

The University of Southern Denmark (SDU) has announced that its new national AI supercomputer in Sønderborg has been brought online. – Biljana Weber/HPC The supercomputer, dubbed Bitten, was built in partnership with Danfoss and HPE. The facility was designed to serve researchers and students across the Danish university system through UCloud, a sovereign research cloud […]

Editor's pickManufacturing & Industrials
Daily Brew· 2 days ago

Nvidia and Corning Team Up to Boost AI Manufacturing in US

Nvidia and Corning are partnering to boost U.S. manufacturing for AI infrastructure, specifically focusing on fiber and optical components for data centers.

Editor's pickTechnology
Bebeez· 3 days ago

OneQode signs 15-year 110MW lease at Bitzero data center in Norway

Bitzero has secured a long-term customer at its data center in Norway. The HPC data center and blockchain provider revealed that it had signed an agreement with OneQode – an AI cloud and network infrastructure provider – for 110MW of data center capacity at the site earlier this week. – Bitzero The agreement will span […]

Editor's pickTechnology
Bebeez· 3 days ago

T.Loop looks to develop data center in Hanko, Finland

Data center firm T.Loop is expanding into Finland. The company this week announced that the city of Hanko has reserved land in the Eastern Industrial Area for the development of a new data center. – T.Loop Further details of the project have not been shared. Hanko is a town on the southern coast of Finland […]

Editor's pickEnergy & Utilities
Arxiv· 2 days ago

Lithium enrichment threatens to curb fusion deployment

arXiv:2605.04707v1 Announce Type: cross Abstract: The impact of lithium isotopic enrichment on the global deployment of nuclear fusion energy is analysed. Lithium - the 6Li isotope in particular - is essentially one of two elemental fuels required by fusion reactors for tritium breeding. Whilst variable consumption of lithium is low enough to present negligible cost, it is instead the large stored inventory volume (50-100 tonnes) and its required enrichment that compound to significantly drive capital costs. These costs are driven by the inefficiency of the tritium breeding process, making this challenge fundamental to almost all fusion power plant concepts. Financing would further compound these effects, making lithium fusion fuels more akin to an upfront capital expenditure than operational expenditure. Other potential barriers to fusion deployment created by lithium are also discussed: enrichment technologies of today are shown to be too expensive, not scalable, and environmentally risky, and highly enriched 6Li is a controlled substance. Mitigating actions include: developing alternative enrichment technologies that are affordable, scalable, and do not rely on mercury; incorporating lithium enrichment as an explicit cost driver in reactor design processes, producing more compact reactors with smaller lithium inventories; establishing distinct enrichment levels to enable supply chain monitoring for misuse; and the most radical solution: breeding blankets that use natural, unenriched lithium. These actions may impact tritium breeding capabilities, which calls for an urgent re-assessment of the tritium breeding paradigm. Whatever solution is sought, lithium supply is a mission-critical issue that needs urgently addressing.

AI Models & Capabilities10 articles
Editor's pickTechnology
Arxiv· 2 days ago

Evaluating Prompting and Execution-Based Methods for Deterministic Computation in LLMs

arXiv:2605.03227v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in natural language understanding and reasoning. However, their ability to perform exact, deterministic computation remains unclear. In this work, we systematically evaluate multiple prompting strategies, including Chain-of-Thought (CoT), Least-to-Most decomposition, Program-of-Thought (PoT), and Self-Consistency (SC), on tasks requiring precise and error-free outputs, including binary counting, longest substring detection, and arithmetic evaluation. To support this study, we introduce a synthetic dataset with diverse natural language instructions, enabling controlled evaluation of exact computation across multiple task types. Our results show that standard prompting methods achieve only moderate accuracy on sequence-based tasks. CoT provides limited improvement, while Least-to-Most suffers from error accumulation. In contrast, PoT achieves perfect accuracy by generating executable code and delegating computation to an external interpreter. Self-Consistency improves robustness through majority voting, but incurs substantial computational overhead. We further train a small domain-specific model (CodeT5-small) to generate executable programs, which achieves perfect accuracy on held-out synthetic test data across all tasks with minimal training cost. Overall, our findings suggest that LLMs may simulate reasoning patterns rather than reliably perform exact symbolic computation. For deterministic tasks, combining LLMs with external tools or using specialized models provides a more reliable and efficient solution.

Editor's pickManufacturing & Industrials
Reuters· 3 days ago

French startup unveils AI model for robots and human-like hand | Reuters

May 6 (Reuters) - Genesis AI , a French robotics startup backed by former ​Google CEO Eric Schmidt and telecoms tycoon Xavier Niel, on Wednesday unveiled an AI ‌model designed to make robots more adaptable, along with a human-like robotic hand.

Editor's pick
Arxiv· 2 days ago

CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing

arXiv:2605.02910v2 Announce Type: new Abstract: Recent advances in large language models have led to strong performance on reasoning and environment-interaction tasks, yet their ability for creative problem-solving remains underexplored. We study this capability through the lens of creative tool use, where a model repurposes available objects by reasoning about their affordances and attributes rather than relying on canonical usage. As a first step, we introduce CreativityBench, a benchmark for evaluating affordance-based creativity in LLMs. To this end, we build a large-scale affordance knowledge base (KB) with 4K entities and 150K+ affordance annotations, explicitly linking objects, parts, attributes, and actionable uses. Building on this KB, we generate 14K grounded tasks that require identifying non-obvious yet physically plausible solutions under constraints. Evaluations across 10 state-of-the-art LLMs, including closed and open-source models, show that models can often select a plausible object, but fail to identify the correct parts, their affordances, and the underlying physical mechanism needed to solve the task, leading to a significant drop in performance. Furthermore, improvements from model scaling quickly saturate, strong general reasoning does not reliably translate to creative affordance discovery, and common inference-time strategies such as Chain-of-Thought yield limited gains. These results suggest that creative tool use remains a major challenge for current models, and that CreativityBench provides a useful testbed for studying this missing dimension of intelligence, with potential implications for planning and reasoning modules in future agents.

Editor's pickTransportation & Logistics
Arxiv· 2 days ago

Revisiting the Travel Planning Capabilities of Large Language Models

arXiv:2605.03308v1 Announce Type: new Abstract: Travel planning serves as a critical task for long-horizon reasoning, exposing significant deficits in LLMs. However, existing benchmarks and evaluations primarily assess final plans in an end-to-end manner, which lacks interpretability and makes it difficult to analyze the root causes of failures. To bridge this gap, we decompose travel planning into five constituent atomic sub-capabilities, including \emph{Constraint Extraction}, \emph{Tool Use}, \emph{Plan Generation}, \emph{Error Identification}, and \emph{Error Correction}. We implement a decoupled evaluation protocol leveraging oracle intermediate contexts to rigorously isolate these components, thereby measuring the atomic performance boundary without the noise of cascading errors. Our results highlight a clear contrast in performance: while LLMs are proficient in extracting explicit constraints, they struggle to infer implicit, open-world requirements. Furthermore, they exhibit structural biases in plan generation and suffer from ineffective self-correction, characterized by excessive sensitivity and erroneous persistence. These findings offer precise directions for improving LLM reasoning and planning abilities.

Editor's pick
Arxiv· 2 days ago

Position: the Stochastic Parrot in the Coal Mine. Model Collapse is a Threat to Low-Resource Communities

arXiv:2605.04127v1 Announce Type: cross Abstract: Model collapse, the degradation in performance that arises when generative models are trained on the outputs of prior models, is an increasing concern as artificially generated content proliferates. Related critiques of large language models have highlighted their tendency to reproduce frequent patterns in training data, their reliance on vast datasets, and their substantial environmental cost. Together, these factors contribute to data degradation, the reinforcement of cultural biases, and inefficient resource use. In this position paper we aim to combine these views and argue that model collapse threatens current efforts to democratize AI. By reducing training efficiency and skewing data distributions away from the tails of their support, model collapse disproportionately impacts low-resource and marginalized communities. We examine both the environmental and cultural implications of this phenomenon, situate our position within recent position papers on model collapse, and conclude with a call to action. Finally, we outline initial directions for mitigating these effects.

AI Security & Cybersecurity7 articles
Editor's pickMedia & Entertainment
Arxiv· 2 days ago

An Evaluation of Chat Safety Moderations in Roblox

arXiv:2605.04491v1 Announce Type: new Abstract: Roblox is among the most popular online gaming platforms, used by hundreds of millions of users every day. A substantial portion of these users are underage, who are at a greater risk, where abusive users may utilize Roblox's real-time chat interface to make the initial contact with potential victims. Roblox employs automated chat moderation mechanisms to detect potentially abusive messages; however, to date, their effectiveness has not been independently investigated. Toward this goal, we collected approximately 2 million chat messages from four games across multiple age groups and analyzed them to evaluate the moderation system. These messages were collected from public game servers following ethical and legal norms as well as Roblox's terms of service. We use this corpus to qualitatively study which types of unsafe chats escape the moderation system and how policy-violating users evade the moderation system. Given the dataset's scale, it is prohibitively expensive to conduct qualitative content analysis manually. Therefore, we adopt a two-step approach. First, we manually labeled safe and unsafe messages (n=99.8K) and used them as a ground truth to evaluate four locally hosted state-of-the-art large language models (LLMs). Next, the best-performing LLM was applied to the entire corpus to identify potentially unsafe messages, which we manually categorized using iterative open and axial coding methods until thematic saturation was reached. Overall, our findings reveal a troublesome reality: numerous instances of unsafe chat messages related to grooming, sexualizing minors, bullying, & harassment, violence, self-harm, and sharing sensitive information, etc., escaped the current moderation. Our analysis of users whose messages were previously flagged revealed that they continue to send harmful messages by employing a wide range of techniques to evade the moderation system.

Editor's pickDefense & National Security
Arxiv· 2 days ago

Stable Agentic Control: Tool-Mediated LLM Architecture for Autonomous Cyber Defense

arXiv:2605.03034v1 Announce Type: new Abstract: Agentic systems involved in high-stake decision-making under adversarial pressure need formal guarantees not offered by existing approaches. Motivated by the operational needs of security operations centers (SOCs) that must configure endpoint detection and response (EDR) policies under adversarial pressure, we present a tool-mediated architecture: LLM agents use deterministic tools (Stackelberg best-response, Bayesian observer updates, attack-graph primitives) and select from finite action catalogs enforced at the tool-output interface. A composite Lyapunov function machine-checked in Lean 4 with zero sorry certifies controllability, observability from asymmetric sensor data, and Input-to-State Stability (ISS) robustness under intelligent adversarial disturbance, with two corollaries extending the certificate to any controller or adversary from the catalogs. On 282 real enterprise attack graphs, the claims hold with margin. On paired offensive/defensive telemetry, a tool-mediated Claude Sonnet 4 controller reduces the attacker's expected payoff (game value) by 59% relative to a deterministic greedy baseline, with zero variance across 40 runs at four temperatures. A Claude Haiku 4.5 controller converges to suboptimal game values but stays catalog-bounded over an additional 40 runs, demonstrating that architectural stability is not dependent on the controller capability. The LLM agent's non-determinism furthers creative exploration of strategies, while the tool-mediated architecture ensures system stability.

Editor's pickTechnology
Arxiv· 2 days ago

Enhancing Agent Safety Judgment: Controlled Benchmark Rewriting and Analogical Reasoning for Deceptive Out-of-Distribution Scenarios

arXiv:2605.03242v1 Announce Type: new Abstract: Tool-using agent systems powered by large language models (LLMs) are increasingly deployed across web, app, operating-system, and transactional environments. Yet existing safety benchmarks still emphasize explicit risks, potentially overstating a model's ability to judge deceptive or ambiguous trajectories. To address this gap, we introduce ROME (Red-team Orchestrated Multi-agent Evolution), a controlled benchmark-construction pipeline that rewrites known unsafe trajectories into more deceptive evaluation instances while preserving their underlying risk labels. Starting from 100 unsafe source trajectories, ROME produces 300 challenge instances spanning contextual ambiguity, implicit risks, and shortcut decision-making. Experiments show that these challenge sets substantially degrade safety-judgment performance, with hidden-risk cases remaining particularly non-trivial even for recent frontier models. We further study ARISE (Analogical Reasoning for Inference-time Safety Enhancement), a retrieval-guided inference-time enhancement that retrieves ReAct-style analogical safety trajectories from an external analogical base and injects them as structured reasoning exemplars. ARISE improves judgment quality without retraining, but is best viewed as a task-specific robustness enhancement rather than a standalone safety guarantee. Together, ROME and ARISE provide practical tools for stress-testing and improving agent safety judgment under deceptive distribution shifts.

Editor's pickTechnology
Arxiv· 2 days ago

Learning Correct Behavior from Examples: Validating Sequential Execution in Autonomous Agents

arXiv:2605.03159v1 Announce Type: new Abstract: As autonomous agents become increasingly sophisticated, validating their sequential behavior presents a significant challenge. Traditional testing approaches require manual specification, exact sequence matching, or thousands of training examples. We present a novel algorithm that automatically learns correct behavior from just 2-10 passing execution traces and validates new executions against this learned model. Our approach combines dominator analysis from compiler theory with multimodal large language model-powered semantic understanding to identify essential states and handle non-deterministic behavior. The system constructs a generalized ground truth model using Prefix Tree Acceptors, merges traces through multi-tiered equivalence detection, and validates new executions via topological subsequence matching. In controlled experiments, our system achieved high accuracy in detecting product bugs and false successes using only 3 training traces. This approach provides explainable validation results with coverage metrics and works across diverse domains including UI testing, code generation, and robotic processes.

Editor's pickPAYWALLTechnology
Washington Post· 3 days ago

Opinion | AI-powered cyberattack threats are growing. Here's how to combat them. - The Washington Post

Artificial intelligence is tearing down cyberdefenses. Here’s what the government can do to protect Americans.

Editor's pickFinancial Services
Daily Brew· 3 days ago

X user tricks Grok into sending them $200,000 in crypto using morse code

A security vulnerability allowed an X user to manipulate the Grok AI into authorizing a large cryptocurrency transfer.

Editor's pick
Arxiv· 2 days ago

Connecting online criminal behavior with machine learning: Using authorship attribution to analyze and link potential online traffickers

arXiv:2605.04080v1 Announce Type: cross Abstract: This research investigated how online criminal activities can be better understood and connected using data-driven machine learning methods. Many illegal activities, such as human trafficking and illicit trade, have moved to online platforms where offenders hide behind anonymous accounts and frequently change identities. This makes it difficult for authorities to understand how large these networks are and how different online profiles may be linked. The research shows that people tend to maintain consistent patterns in how they write advertisements and present images online, even when they try to stay anonymous. By analysing these patterns across large collections of online advertisements, the research demonstrates how to link related accounts and identify repeated behaviour across illegal online markets. In addition, the research also addresses how such methods should be used responsibly. It proposes clear guidelines to ensure that privacy, fairness, and transparency are respected when these tools are applied. Overall, the research provides practical ways to support law enforcement investigations while emphasising careful and ethical use.

Adoption, Deployment & Impact

25 articles
AI Adoption Barriers & Enablers9 articles
Editor's pickProfessional Services
Arxiv· 2 days ago

Making the Invisible Visible: Understanding the Mismatch Between Organizational Goals and Worker Experiences in AI Adoption

arXiv:2605.03078v1 Announce Type: new Abstract: While AI is often introduced into organizations to drive innovation and efficiency, many adoption efforts fail as workers resist and struggle to integrate these systems. These failures point to a deeper issue: workers, the very people expected to collaborate with AI, are often invisible in decisions about how AI is designed and used. Drawing on interviews with professionals who interact with AI systems daily in healthcare, finance, and management, we examine the disconnect between organizational expectations and worker experiences. We identify key barriers, including poor usability and interoperability, misaligned expectations, limited control, and insufficient communication. These challenges highlight a gap between how organizations implement AI and the evolving worker needs, tasks, and workflows that it fails to support. We argue that successful adoption requires recognizing workers as central to AI integration and propose adaptation strategies at the individual, task, and organizational levels to better align AI systems with real-world practices.

Editor's pickTechnology
Bebeez· 2 days ago

France’s OpsMill raises €11.9 million to help enterprises prepare infrastructure data for AI and automation

OpsMill, a Paris-based infrastructure data management company, has raised €11.9 million ($14 million) in Series A funding to grow its engineering and product teams and continue developing data-centric AIOps solutions.  The round was led by IRIS with participation from BGV and existing investors Serena and Partech. The company aims to transform fragmented IT data into […]

Editor's pickTechnology
Tom's Hardware· 3 days ago

Microsoft says 'Transformation Paradox' holding back AI adoption in the workplace — 45% of respondents say it's safer to focus on current goals, rather than AI innovation | Tom's Hardware

“Employees are ready to reinvent how they work, but the system around them continues to reinforce the old way.”

Editor's pickFinancial Services
PYMNTS.com· 3 days ago

Anthropic Races OpenAI to Capture the Banking’s Services Core | PYMNTS.com

For financial institutions, the central question is shifting away from whether AI can improve productivity. The more consequential issue is the embedding of those systems inside regulated financial environments where cybersecurity failures, operational interruptions and compliance lapses carry ...

Editor's pickTechnology
IT Pro· 3 days ago

AI adoption is accelerating in the UK, but ‘trust is not keeping pace’ | IT Pro

Organizations need to do more to reassure customers over governance

Editor's pick
UC Today· 3 days ago

UCX Manchester: Why Enterprise AI Adoption Fails - UC Today

Akash Joshi explains why enterprise AI succeeds when companies build trust, unlock access, and invest in people, not just models.

Editor's pickTechnology
Artificial Intelligence Newsletter | May 7, 2026· 3 days ago

OpenAI reaches deal with Canadian regulators to limit private data in AI training

OpenAI has agreed to limit the use of personal and sensitive data in training new ChatGPT models to resolve concerns raised by the Privacy Commissioner of Canada and provincial counterparts.

Editor's pickEducation
Arxiv· 2 days ago

Guidelines for Designing AI Technologies to Support Adult Learning

arXiv:2605.04616v1 Announce Type: new Abstract: AI-powered educational technologies have demonstrated measurable benefits for learners, but their design and evaluation have largely centered on K-12 contexts. As a result, many AI-supported learning systems remain poorly aligned with the needs, constraints, and goals of adult learners. To better understand how AI systems function in adult education, this paper examines the deployment of several AI learning technologies developed within a multidisciplinary, national research institute in the United States focused on adult learning and online education. Drawing on longitudinal deployment data, we conducted a reflexive thematic analysis to identify recurring challenges and design considerations across systems. These insights were synthesized into a set of 19 design guidelines intended to inform future AI-supported adult learning technologies. We demonstrate the utility of these guidelines through a heuristic evaluation of the deployed systems. Lastly, we present a guideline exploration tool that aids in the ideation of technologies by connecting the guidelines to stakeholder statements surfaced in the analysis process.

Editor's pickTechnology
Morning Call· 3 days ago

The rapid embrace of AI in China, its biggest testing ground, may shape how AI is used globally

Of its 1.4 billion population, more than 600 million were using generative AI as of December.

AI Applications3 articles
Editor's pickHealthcare
Arxiv· 2 days ago

ADAPTS: Agentic Decomposition for Automated Protocol-agnostic Tracking of Symptoms

arXiv:2605.03212v2 Announce Type: new Abstract: Modeling latent clinical constructs from unconstrained clinical interactions is a unique challenge in affective computing. We present ADAPTS (Agentic Decomposition for Automated Protocol-agnostic Tracking of Symptoms), a framework for automated rating of depression and anxiety severity using a mixture-of-agents LLM architecture. This approach decomposes long-form clinical interviews into symptom-specific reasoning tasks, producing auditable justifications while preserving temporal and speaker alignment. Generalization was evaluated across two independent datasets ($N=204$) with distinct interview structures. On high-discrepancy interviews, automated ratings approximated expert benchmarks ($\text{absolute error}=22$) more closely than original human ratings ($\text{absolute error}=26$). Implementing an ``extended'' protocol that incorporates qualitative clinical conventions significantly stabilized ratings, with absolute agreement reaching $\text{ICC(2,1)} = 0.877$. These findings suggest that the ADAPTS framework enables promising evaluations of psychiatric severity. While the current implementation is purely text-based, the underlying architecture is readily extensible to multimodal inputs, including acoustic and visual features. By approximating expert-level precision in a protocol-agnostic manner, this framework provides a foundation for objective and scalable psychiatric assessment, especially in resource-limited settings.

Editor's pickManufacturing & Industrials
Daily Brew· 2 days ago

Cognex Launches In-Sight 3900: High-Speed AI Vision System

Cognex has launched the In-Sight 3900, an AI-driven vision system designed to enhance factory floor operations without the need for an external PC.

AI Measurement & Evaluation2 articles
Editor's pickHealthcare
Arxiv· 2 days ago

Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

arXiv:2605.04098v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) have demonstrated promise on publicly available dermatology benchmarks. However, benchmark performance may not generalize to real-world dermatologic decision-making. To quantify this benchmark-to-bedside gap, we evaluated four open-weight MLLMs (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4 and MedGemma-4B-Instruct) and one commercial MLLM (GPT-4.1) across three publicly available dermatology datasets and a retrospective multi-site hospital-based dermatology consultation cohort comprising 5,811 cases and 46,405 clinical images. Models were evaluated on two clinically relevant tasks: differential diagnosis generation and severity-based triage. Diagnostic performance was modest on public datasets and declined substantially in the real-world cohort. On public benchmarks, top-3 diagnostic accuracy reached 26.55% for the best open-weight model and 42.25% for GPT-4.1. On real-world consultation cases using images alone, top-3 diagnostic accuracy fell to 1.50%-13.35% among open-weight models and 24.65% for GPT-4.1. Incorporating clinical context improved performance across all models, increasing top-3 diagnostic accuracy up to 28.75% among open-weight models and 38.93% for GPT-4.1. However, model outputs were highly sensitive to incomplete or erroneous consultation context. For severity-based triage, models achieved moderate sensitivity (above 60%), suggesting potential utility for screening but insufficient reliability for clinical deployment. These findings demonstrate that benchmark performance substantially overestimates the real-world clinical capability of current dermatology MLLMs.

AI Organisational Change6 articles
AI Productivity Evidence2 articles
Editor's pickEducation
Arxiv· 2 days ago

A Dialogue-Based Framework for Correcting Multimodal Errors in AI-Assisted STEM Education

arXiv:2605.04131v1 Announce Type: cross Abstract: Large Language Models (LLMs) are democratizing access to personalized tutoring; however, their effectiveness is hindered by challenges in processing multimodal content, which limits AI's potential to provide equitable, high-quality STEM support. This study evaluates LLM performance on multimodal physics problems, identifies specific failure modes through an empirical error taxonomy, and tests practical interventions designed to overcome multimodal processing limitations. We assessed three publicly available LLMs (Claude, Gemini, and ChatGPT) on multimodal physics problems from the OpenStax database and compared the results with text-only performance. An empirically derived error taxonomy was developed through pilot testing, followed by evaluation of a structured multimodal dialogue intervention. All three models achieved near-ceiling accuracy (96%) on text-only physics problems. Performance declined substantially on multimodal problems, consistent with what we term the Multimodal Interference Effect. Error analysis identified four failure modes: visual processing errors, context misinterpretation, mathematical computational errors, and hybrid errors, with visual processing errors being the most prevalent. The structured dialogue intervention corrected 82% of errors overall; visual processing errors were corrected at 100% across all models. Educators and students can implement these interventions immediately, requiring no model retraining, to improve AI tutoring reliability on image-rich STEM content, advancing equitable access to high-quality learning support.

Geopolitics, Policy & Governance

22 articles
AI Policy & Regulation18 articles
Editor's pickPAYWALLGovernment & Public Sector
Bloomberg· 2 days ago

Top Trump Aide Says Administration Won’t Pick Winners in AI Race

White House Chief of Staff Susie Wiles said the US government would refrain from choosing winners and losers in artificial intelligence, the latest signal from a top aide to President Donald Trump as his administration prepares new AI policy directives.

Editor's pickFinancial Services
Arxiv· 2 days ago

A Regulatory Governance Framework for AI-Driven Financial Fraud Detection in U.S. Banking: Integrating OCC, SR 11-7, CFPB, and FinCEN Compliance Requirements for Model Development, Validation, and Monitoring Lifecycles

arXiv:2605.04076v1 Announce Type: cross Abstract: U.S. financial institutions deploying AI-based fraud detection face a fragmented compliance landscape spanning four regulatory frameworks -- OCC Bulletin 2011-12, SR 11-7, the CFPB AI circular, and FinCEN BSA/SAR requirements -- with no integrated governance life cycle connecting these requirements to model development, validation, and monitoring

Editor's pick
Arxiv· 2 days ago

Decision Evidence Maturity Model for Agentic AI: A Property-Level Method Specification

arXiv:2605.04093v1 Announce Type: new Abstract: Agentic AI systems produce decision evidence at scale through execution telemetry, but property-level reconstruction often fails when an external party asks a specific governance question about a specific decision: the assembled evidence is insufficient to answer it. We name this pattern the container fallacy: the automatic equation of evidence-container presence with audit sufficiency. This paper specifies the Decision Evidence Maturity Model (DEMM), a property-level reconstructability method for agentic decisions. DEMM classifies evidence sufficiency into four executable categories plus a protocol-level "conflicting" category and aggregates per-property verdicts into a five-level capability rubric anchored to the established maturity-model lineage. The open-source Decision Trace Reconstructor ships ten executable adapter-fallback classes spanning vendor SDKs, protocol traces, public-postmortem prose, and generic JSONL records. A reproducible feasibility exercise runs the protocol on 140 synthetic scenarios plus three public incidents; the resulting completeness range (53.6% to 100%) is implementation behaviour, not external validation.

Editor's pickPAYWALLGovernment & Public Sector
Washington Post· 3 days ago

AI & Tech Brief: Trump admin to test frontier models - The Washington Post

The Commerce Department announced Tuesday that it will be conducing pre-deployment testing of AI models from Google, Microsoft and x AI .

Editor's pick
CAclubindia· 3 days ago

What's Driving the Global AI Regulation and Tax Debate Right Now?

Stay ahead of evolving AI regulations and taxation challenges. Learn how global compliance, cross-border rules, and emerging tax frameworks impact AI-driven businesses and revenue.

Editor's pickGovernment & Public Sector
Federal News Network· 3 days ago

WH ‘studying’ AI security executive order | Federal News Network

An EO requiring pre-deployment review of frontier AI models would likely increase the workload at NIST's Center for AI Standards and Innovation.

Editor's pickGovernment & Public Sector
The Verge· 3 days ago

How David Sacks crashed and burned in the White House | The Verge

The Trump administration pulled a 180 on AI oversight, inducing Sacks’ worst nightmare: more government regulation on technology.

Editor's pickMedia & Entertainment
Daily Brew· 3 days ago

Meta Hit With Massive Lawsuit; Publishers Say AI Training Infringes Copyright

A group of publishers has filed a major lawsuit against Meta, alleging unauthorized use of copyrighted material for AI training.

Editor's pickDefense & National Security
Artificial Intelligence Newsletter | May 7, 2026· 3 days ago

US relationship with frontier AI developers takes the spotlight

As AI models become more capable, the Trump administration is using federal contracts and voluntary pre-deployment assessments to exert influence over frontier developers.

Editor's pickFinancial Services
News18· 3 days ago

AI Model Worrying India’s Banks: Why FM Sitharaman Held A High-Level Meeting Over Claude Mythos AI | Banking and Finance News - News18

Following Sitharaman’s review ... cybersecurity framework aimed at protecting banks and financial institutions from AI-driven threats. ... The Reserve Bank of India is also believed to be reviewing preparedness measures with financial institutions as AI-led cyber risks move higher ...

Editor's pickTechnology
Daily Brew· 3 days ago

Fiddler Ashley MacIsaac Sues Google Over Defamatory AI Summary, Potential Landmark Case in AI Liability

Ashley MacIsaac has sued Google in Canada, claiming their AI falsely labeled him as a sex offender, damaging his career.

Editor's pickTechnology
PYMNTS· 3 days ago

IBM CEO Calls for AI Regulation That Protects Innovation | PYMNTS.com

IBM Chairman and CEO Arvind Krishna warned that federal regulators need to find what he called the "Goldilocks" middle ground on AI oversight.

Editor's pickFinancial Services
The Financial Express· 3 days ago

Review cyber risks in 2 moths: RBI to banks - Business News | The Financial Express

Reserve Bank of India asks banks to review cybersecurity readiness within two months amid rising AI-driven threats and system vulnerabilities.

Editor's pickGovernment & Public Sector
The Straits Times· 2 days ago

AI disinformation tests South Korean laws ahead of local elections | The Straits Times

The government has hired hundreds of staff to track and counter manipulated content ahead of local ballots. Read more at straitstimes.com.

Editor's pickPAYWALL
NYT· 3 days ago

Elon Musk’s Confidante Shivon Zilis Is Cast as His Inside Source at OpenAI

Shivon Zilis worked closely with Elon Musk while she was on OpenAI’s board. Her ties to the world’s richest man were detailed in a landmark trial on Wednesday.

Editor's pickTelecommunications
Asian Business Review· 3 days ago

Companies must establish comprehensive, responsible AI governance frameworks – PwC’s Wilson Chow | Asian Business Review

He suggests that the deployment of the technology in the TMT sector represents both the most significant growth opportunity and the most consequential risk land

Editor's pickGovernment & Public Sector
Artificial Intelligence Newsletter | May 7, 2026· 2 days ago

Australians want safe AI they can trust, minister says

Assistant Minister Andrew Charlton stated that Australia is avoiding a single AI regulator, instead relying on the AI Safety Institute to identify risks for existing agencies to manage.

Editor's pickTechnology
Daily Brew· 2 days ago

Google Chrome 'silently' downloads 4GB AI model to your device without permission

A report claims Google Chrome is downloading large AI models without user consent, potentially violating EU law.

Best Practice AI© 2026 Best Practice AI Ltd. All rights reserved.

Get the full executive brief

Receive curated insights with practical implications for strategy, operations, and governance.

AI Daily Brief — leaders actually read it.

Free email — not hiring or booking. Optional BPAI updates for company news. Unsubscribe anytime.

Include

No spam. Unsubscribe anytime. Privacy policy.