AI Intelligence Brief

Thu 23 April 2026

Daily Brief — Curated and contextualised by Best Practice AI

122Articles
Editor's pickEditor's Highlights

TSMC Holds Back, Tesla Spends Big, and AI Divides the Workforce

TL;DR Taiwan Semiconductor Manufacturing Co. will delay adopting ASML's costly lithography machines until 2029. Tesla plans to increase its AI-related spending to $25 billion, focusing on self-driving technology. A survey reveals high earners are rapidly adopting AI at work, exacerbating existing pay and gender gaps. Nvidia supplier SK Hynix reports record earnings as customers prioritize procurement amid a supply crunch.

Editor's highlights

The stories that matter most

Selected and contextualised by the Best Practice AI team

10 of 122 articles
Lead story
Editor's pickPAYWALLManufacturing & Industrials
Bloomberg· Today

TSMC Says ASML’s Latest Chipmaking Gear Is Too Pricey

Taiwan Semiconductor Manufacturing Co. will hold off on deploying ASML Holding NV’s most cutting-edge lithography machines for chip production through 2029 to save money. The chipmaker has no plans to adopt ASML’s latest high numerical aperture extreme ultraviolet lithography machines, or high-NA EUV, which fetch upwards of €350 million ($410 million) apiece. TSMC is ASML’s largest customer, according to Bloomberg’s supply chain data. Bloomberg’s Neil Campling reports.

Editor's pickTechnology
Arxiv· Today

Behavioral Transfer in AI Agents: Evidence and Privacy Implications

arXiv:2604.19925v1 Announce Type: new Abstract: AI agents powered by large language models are increasingly acting on behalf of humans in social and economic environments. Prior research has focused on their task performance and effects on human outcomes, but less is known about the relationship between agents and the specific individuals who deploy them. We ask whether agents systematically reflect the behavioral characteristics of their human owners, functioning as behavioral extensions rather than producing generic outputs. We study this question using 10,659 matched human-agent pairs from Moltbook, a social media platform where each autonomous agent is publicly linked to its owner's Twitter/X account. By comparing agents' posts on Moltbook with their owners' Twitter/X activity across features spanning topics, values, affect, and linguistic style, we find systematic transfer between agents and their specific owners. This transfer persists among agents without explicit configuration, and pairs that align on one behavioral dimension tend to align on others. These patterns are consistent with transfer emerging through accumulated interaction between owners (or owners' computer environments) and their agents in everyday use. We further show that agents with stronger behavioral transfer are more likely to disclose owner-related personal information in public discourse, suggesting that the same owner-specific context that drives behavioral transfer may also create privacy risk during ordinary use. Taken together, our results indicate that AI agents do not simply generate content, but reflect owner-related context in ways that can propagate human behavioral heterogeneity into digital environments, with implications for privacy, platform design, and the governance of agentic systems.

Editor's pickFinancial Services
Arxiv· Today

Large Language Models Outperform Humans in Fraud Detection and Resistance to Motivated Investor Pressure

arXiv:2604.20652v1 Announce Type: cross Abstract: Large language models trained on human feedback may suppress fraud warnings when investors arrive already persuaded of a fraudulent opportunity. We tested this in a preregistered experiment across seven leading LLMs and twelve investment scenarios covering legitimate, high-risk, and objectively fraudulent opportunities, combining 3,360 AI advisory conversations with a 1,201-participant human benchmark. Contrary to predictions, motivated investor framing did not suppress AI fraud warnings; if anything, it marginally increased them. Endorsement reversal occurred in fewer than 3 in 1,000 observations. Human advisors endorsed fraudulent investments at baseline rates of 13-14%, versus 0% across all LLMs, and suppressed warnings under pressure at two to four times the AI rate. AI systems currently provide more consistent fraud warnings than lay humans in an identical advisory role.

Editor's pickDefense & National Security
Arxiv· Today

Model Capability Assessment and Safeguards for Biological Weaponization

arXiv:2604.19811v1 Announce Type: new Abstract: AI leaders and safety reports increasingly warn that advances in model reasoning may enable biological misuse, including by low-expertise users, while major labs describe safeguards as expanding but still evolving rather than settled. This study benchmarks ChatGPT 5.2 Auto, Gemini 3 Pro Thinking, Claude Opus 4.5 and Meta's Muse Spark Thinking on 73 novice-framed, open-ended benign STEM prompts to measure operational intelligence. On benign quantitative tasks, both Gemini and meta scored very high; ChatGPT was partially useful but text-thinned, and Claude was sparsest with some apparent false-positive refusals. A second test set detected subtle harmful intent: edge case prompts revealed Gemini's seeming lack of contextual awareness. These results warranted a focused weaponization analysis on Gemini as capability appeared to be outpacing moderation calibration. Gemini was tested across four access environments and reported cases include poison-ivy-to-crowded-transit escalation, poison production and extraction via international-anonymous logged-out AI Mode, and other concerning examples. Biological misuse may become more prevalent as a geopolitical tool, increasing the urgency of U.S. policy responses, especially if model outputs come to be treated as regulated technical data. Guidance is provided for 25 high-risk agents to help distinguish legitimate use cases from higher-risk ones.

Editor's pickFinancial Services
Arxiv· Today

From Clerks to Agentic-AI: How will Technology Change Labor Market in Finance?

arXiv:2604.19833v1 Announce Type: cross Abstract: Financial firms have gone through three major technological waves: computerization in the 1980s and 1990s, the rise of indexing and passive investing in the 2000s and 2010s, and the AI and automation wave from roughly 2015 to the present. This project studies how much labor is required to manage capital across those waves by tracking a simple productivity measure: assets under management per employee. Using a small panel of representative firms, we compare changes in AUM per employee, revenue per employee, and operating expense intensity over time. The goal is not to identify causal effects, but to document stylized facts about how technology changes the scale of asset management work.

Editor's pickManufacturing & Industrials
Arxiv· Today

A Multi-Plant Machine Learning Framework for Emission Prediction, Forecasting, and Control in Cement Manufacturing

arXiv:2604.19903v1 Announce Type: cross Abstract: Cement production is among the largest contributors to industrial air pollution, emitting ~3 Mt NOx/year. The industry-standard mitigation approach, selective non-catalytic reduction (SNCR), exhibits low NH3 utilization efficiency, resulting in operational inefficiencies and increased reagent costs. Here, we develop a data-driven framework for emission control using large-scale operational data from four cement plants worldwide. Benchmarking nine machine learning architectures, we observe that prediction error varies ~3-5x across plants due to variation in data richness. Incorporating short-term process history nearly triples NOx prediction accuracy, revealing that NOx formation carries substantial process memory, a timescale dependence that is absent in CO and CO2. Further, we develop models that forecast NOx overshoots as early as nine minutes, providing a buffer for operational adjustments. The developed framework controls NOx formation at the source, reducing NH3 consumption in downstream SNCR. Surrogate model projections estimate a ~34-64% reduction in NOx while preserving clinker quality, corresponding to a reduction of ~290 t NOx/year and ~58,000 USD/year in NH3 savings. This work establishes a generalizable framework for data-driven emission control, offering a pathway toward low-emission operation without structural modifications or additional hardware, with potential applicability to other hard-to-abate industries such as steel, glass, and lime.

Editor's pickPAYWALLGovernment & Public Sector
FT· Today

Top Republican pushes party to shun $300mn AI lobby

Senator Josh Hawley warns of ‘political cost’ if Washington fails to rein in Big Tech and artificial intelligence

Editor's pickPAYWALL
FT· Today

The AI digital divide

An FT survey shows the highest-earning workers are adopting the technology in their jobs far faster than others

Editor's pickPAYWALLTechnology
FT· Today

Nvidia supplier SK Hynix hails ‘structural shift’ after another record quarter

Second-largest memory chipmaker says customers prioritising procurement over pricing amid supply crunch

Editor's pickPAYWALLManufacturing & Industrials
FT· Yesterday

Tesla boosts spending plans to $25bn as Musk doubles down on AI bet

CEO warns investors to expect ‘very significant’ spending increase on self-driving taxis, trucks, robots and chip factories

Economics & Markets

45 articles
AI Business Models5 articles
AI Investment & Valuations11 articles
Editor's pickPAYWALLTechnology
WSJ· Today

Shares of Apple Supplier STMicroelectronics Jump After Strong Quarter

It posted strong first-quarter sales and said revenue growth from artificial intelligence should accelerate in coming months.

Editor's pickTechnology
Guardian· Yesterday

Tesla reports mixed financial results as Musk pivots automaker to AI and robots

Figures fail to significantly buoy stock as firm admits ‘significant effort and hard work’ needed to achieve goals Tesla reported its first-quarter earnings on Wednesday, disclosing some better-than-expected results but faltering in some key areas. The report failed to significantly buoy Tesla’s stock, which has limped along this year while its CEO, Elon Musk, has tried to sell the company’s new vision of humanoid robots and self-driving robotaxis. Its core car business has struggled in the face of competition from Chinese counterparts and backlash against his close involvement with the Trump administration. “There remains significant effort and hard work to realize our mission of Amazing Abundance,” Tesla said in its report, while claiming that demand for its vehicles was rebounding. Continue reading...

Editor's pickPAYWALLTechnology
Bloomberg· Today

Sinking TSMC ADR Premium Offers Trading Window, UBS Desk Says

The narrowing gap between Taiwan Semiconductor Manufacturing Co.’s Taiwanese shares and its US-listed stock is creating a new trading opportunity, according to a UBS Group AG client note.

Editor's pickTechnology
MarketScreener· Yesterday

Google Unveils Two New AI Chips, Will Invest $750 Million in Agentic AI Adoption | MarketScreener

By Adriano Marchese Google unveiled its latest custom chips and set up a new $750 million agentic AI partner fund to accelerate the adoption of agentic artificial intelligence. The TPU 8t and...

Editor's pickTechnology
Cyprus Mail· Today

Generative AI leads surge in regional tech spending | Cyprus Mail

Asia-Pacific AI spending set to reach $370bn by 2029 International Data Corporation (IDC) projected that artificial intelligence and generative AI spending in Asia-Pacific will rise from 73 billion dollars in 2024 to 370 billion dollars by 2029, marking a fivefold increase driven by rapid ...

Editor's pickProfessional Services
Livemint· Yesterday

Infosys Q4 results preview: Profit may dip QoQ; all eyes on guidance, deal wins | Stock Market News

Infosys will announce its Q4FY26 results on April 23, with investors focused on earnings and growth outlook amid geopolitical risks and generative AI impacts. Profit is estimated at ₹7,508.6 crore, a 4% YoY increase, while revenue may grow 13.7% YoY to ₹46,567 crore.

Editor's pickTechnology
The Motley Fool· Yesterday

Prediction: The Nasdaq's AI Stocks Will Outperform the S&P 500 Over the Next 12 Months. Here's What to Buy. | The Motley Fool

The massive investments in AI infrastructure and the growing adoption of AI software solutions, driven by the productivity gains the technology promises, are poised to drive stronger earnings growth for tech companies. A February report from the Nasdaq Index research team noted that the average net income growth of Nasdaq-100 companies in 2025 was well above that of S&P 500 companies. The report further suggests that this trend is poised to continue in 2026...

BPAI context

Nasdaq's AI-focused stocks are expected to outperform the S&P 500 over the next year due to significant investments in AI infrastructure and software adoption. The Nasdaq-100 companies have shown stronger earnings growth compared to the S&P 500, and this trend is likely to continue, driven by companies like CoreWeave and Microsoft benefiting from increased AI demand.

Editor's pickPAYWALLManufacturing & Industrials
FT· Yesterday

Tesla boosts spending plans to $25bn as Musk doubles down on AI bet

CEO warns investors to expect ‘very significant’ spending increase on self-driving taxis, trucks, robots and chip factories

Editor's pickTechnology
Fortune· Yesterday

Billionaire Michael Dell started his company in his University of Texas dorm room. Now, he’s betting on AI with a $750 million gift

This gift pushes the Dells’ UT Austin donations to over $1 billion.

Editor's pickFinancial Services
ICO Optics· Yesterday

AI and Semiconductor Stocks Drive Bifurcated Market Gains – ICO Optics

Kamil Dimmich, partner and portfolio manager at North of South Capital, thinks emerging markets are splitting apart in performance. He […]

Editor's pickFinancial Services
Insider Monkey· Yesterday

12 AI Stocks in Focus on Wall Street: Tesla, Meta, and More - Insider Monkey

Kamil Dimmich, Partner & Portfolio Manager at North of South Capital, recently spoke on CNBC and talked about the divergence in emerging markets’ performance amid geopolitical tensions tied to the Iran...

AI Macroeconomics6 articles
Editor's pickPAYWALL
FT· Yesterday

AI should not drive today’s interest rate decisions

How the technology will affect prices is still uncertain

Editor's pickConsumer & Retail
Arxiv· Today

Diagnosing Urban Street Vitality via a Visual-Semantic and Spatiotemporal Framework for Street-Level Economics

arXiv:2604.19798v1 Announce Type: new Abstract: Micro-scale street-level economic assessment is fundamental for precision spatial resource allocation. While Street View Imagery (SVI) advances urban sensing, existing approaches remain semantically superficial and overlook brand hierarchy heterogeneity and structural recession. To address this, we propose a visual-semantic and field-based spatiotemporal framework, operationalized via the Street Economic Vitality Index (SEVI). Our approach integrates physical and semantic streetscape parsing through instance segmentation of signboards, glass interfaces, and storefront closures. A dual-stage VLM-LLM pipeline standardizes signage into global hierarchies to quantify a spatially smoothed brand premium index. To overcome static SVI limitations, we introduce a temporal lag design using Location-Based Services (LBS) data to capture realized demand. Combined with a category-weighted Gaussian spillover model, we construct a three-dimensional diagnostic system covering Commercial Activity, Spatial Utilization, and Physical Environment. Experiments based on time-lagged geographically weighted regression across eight tidal periods in Nanjing reveal quasi-causal spatiotemporal heterogeneity. Street vibrancy arises from interactions between hierarchical brand clustering and mall-induced externalities. High-quality interfaces show peak attraction during midday and evening, while structural recession produces a lagged nighttime repulsion effect. The framework offers evidence-based support for precision spatial governance.

Editor's pickPAYWALLTechnology
WSJ· Yesterday

South Korea’s Economy Rebounds Amid Middle East War Risks

South Korea’s economy rebounded at a stronger-than-expected pace in the first quarter on robust semiconductor exports, signaling that the country’s artificial intelligence-driven buffer remains intact.

Editor's pickPAYWALLFinancial Services
Bloomberg· Today

Goldman Sachs Says Prolonged War 'Will Hit Europe'

Sharon Bell, senior European equity strategist at Goldman Sachs, discusses the corporate earnings season, the artificial intelligence buildout and the potential impact of a drawn-out Middle East war on Europe's economies. She speaks on Bloomberg Television. (Source: Bloomberg)

Editor's pickFinancial Services
Daily Brew· Today

AI failure could trigger the next financial crisis, warns Elizabeth Warren

Senator Elizabeth Warren has expressed concerns that systemic failures in AI could pose a significant threat to the stability of the financial sector.

Editor's pick
Arxiv· Today

Routine Work, Firm Boundaries, and the Rise of Local Supplier Entry

arXiv:2604.19987v1 Announce Type: new Abstract: Between 2005 and 2019, U.S. business applications rose 40 percent while conversion to employer firms fell by nearly half. We study whether boundary redrawing helps explain this pattern. Structured routine-cognitive work can be governed through deliverables and thinner buyer and supplier interfaces. When such work remains place-bound, outsourcing creates demand for domestic specialist suppliers. Across 722 commuting zones, a one percentage-point higher baseline routine employment share raises applications by 27.8 per 100,000 residents. Realized entry concentrates in micro-establishments, with no startup quality gains. Contract and industry evidence point to local supplier entry, not routine-manual displacement.

AI Market Competition7 articles
Editor's pickMedia & Entertainment
Daily Brew· Today

AI Revolution in Gaming Favors Giants

The gaming landscape is evolving with AI, benefiting giants like Tencent, Sony, and Ubisoft, while smaller players may struggle as entry barriers diminish.

Editor's pick
Arxiv· Today

Soft-Label Governance for Distributional Safety in Multi-Agent Systems

arXiv:2604.19752v1 Announce Type: cross Abstract: Multi-agent AI systems exhibit emergent risks that no single agent produces in isolation. Existing safety frameworks rely on binary classifications of agent behavior, discarding the uncertainty inherent in proxy-based evaluation. We introduce SWARM (\textbf{S}ystem-\textbf{W}ide \textbf{A}ssessment of \textbf{R}isk in \textbf{M}ulti-agent systems), a simulation framework that replaces binary good/bad labels with \emph{soft probabilistic labels} $p = P(v{=}+1) \in [0,1]$, enabling continuous-valued payoff computation, toxicity measurement, and governance intervention. SWARM implements a modular governance engine with configurable levers (transaction taxes, circuit breakers, reputation decay, and random audits) and quantifies their effects through probabilistic metrics including expected toxicity $\mathbb{E}[1{-}p \mid \text{accepted}]$ and quality gap $\mathbb{E}[p \mid \text{accepted}] - \mathbb{E}[p \mid \text{rejected}]$. Across seven scenarios with five-seed replication, strict governance reduces welfare by over 40\% without improving safety. In parallel, aggressively internalizing system externalities collapses total welfare from a baseline of $+262$ down to $-67$, while toxicity remains invariant. Circuit breakers require careful calibration; overly restrictive thresholds severely diminish system value, whereas an optimal threshold balances moderate welfare with minimized toxicity. Companion experiments show soft metrics detect proxy gaming by self-optimizing agents passing conventional binary evaluations. This basic governance layer applies to live LLM-backed agents (Concordia entities, Claude, GPT-4o Mini) without modification. Results show distributional safety requires \emph{continuous} risk metrics and governance lever calibration involves quantifiable safety-welfare tradeoffs. Source code and project resources are publicly available at https://www.swarm-ai.org/.

Editor's pickPAYWALLTechnology
FT· Yesterday

Apple controls the tech sector’s Strait of Hormuz

It may have stumbled in the AI race but the company’s new CEO will find it still has distinct advantages

Editor's pickFinancial Services
Simply Wall St· Yesterday

ServiceNow Targets Security And Revenue Workflows With Armis And Xactly AI - Simply Wall St News

ServiceNow (NYSE:NOW) has completed its acquisition of Armis, extending its security coverage into physical, operational, and cyber-asset environments. The company has also launched a Dispute Management AI Agent in partnership with Xactly, targeting cross platform revenue workflows.

AI Pricing & Cost Curves4 articles
Editor's pickPAYWALLManufacturing & Industrials
Bloomberg· Today

TSMC Says ASML’s Latest Chipmaking Gear Is Too Pricey

Taiwan Semiconductor Manufacturing Co. will hold off on deploying ASML Holding NV’s most cutting-edge lithography machines for chip production through 2029 to save money. The chipmaker has no plans to adopt ASML’s latest high numerical aperture extreme ultraviolet lithography machines, or high-NA EUV, which fetch upwards of €350 million ($410 million) apiece. TSMC is ASML’s largest customer, according to Bloomberg’s supply chain data. Bloomberg’s Neil Campling reports.

Editor's pickTechnology
VentureBeat· Yesterday

Are you paying an AI ‘swarm tax’? Why single agents often beat complex systems

Enterprise teams building multi-agent AI systems may be paying a compute premium for gains that don't hold up under equal-budget conditions. New Stanford University research finds that single-agent systems match or outperform multi-agent architectures on complex reasoning tasks when both are given the same thinking token budget. However, multi-agent systems come with the added baggage of computational overhead. Because they typically use longer reasoning traces and multiple interactions, it is often unclear whether their reported gains stem from architectural advantages or simply from consuming more resources. To isolate the true driver of performance, researchers at Stanford University compared single-agent systems against multi-agent architectures on complex multi-hop reasoning tasks under equal "thinking token" budgets. Their experiments show that in most cases, single-agent systems match or outperform multi-agent systems when compute is equal. Multi-agent systems gain a competitive edge when a single agent's context becomes too long or corrupted. In practice, this means that a single-agent model with an adequate thinking budget can deliver more efficient, reliable, and cost-effective multi-hop reasoning. Engineering teams should reserve multi-agent systems for scenarios where single agents hit a performance ceiling. Understanding the single versus multi-agent divide Multi-agent frameworks, such as planner agents, role-playing systems, or debate swarms, break down a problem by having multiple models operate on partial contexts. These components communicate with each other by passing their answers around. While multi-agent solutions show strong empirical performance, comparing them to single-agent baselines is often an imprecise measurement. Comparisons are heavily confounded by differences in test-time computation. Multi-agent setups require multiple agent interactions and generate longer reasoning traces, meaning they consume significantly more tokens. ddConsequently, when a multi-agent system reports higher accuracy, it is difficult to determine if the gains stem from better architecture design or from spending extra compute. Recent studies show that when the compute budget is fixed, elaborate multi-agent strategies frequently underperform compared to strong single-agent baselines. However, they are mostly very broad comparisons that don’t account for nuances such as different multi-agent architectures or the difference between prompt and reasoning tokens. “A central point of our paper is that many comparisons between single-agent systems (SAS) and multi-agent systems (MAS) are not apples-to-apples,” paper authors Dat Tran and Douwe Kiela told VentureBeat. “MAS often get more effective test-time computation through extra calls, longer traces, or more coordination steps.” Revisiting the multi-agent challenge under strict budgets To create a fair comparison, the Stanford researchers set a strict “thinking token” budget. This metric controls the total number of tokens used exclusively for intermediate reasoning, excluding the initial prompt and the final output. The study evaluated single- and multi-agent systems on multi-hop reasoning tasks, meaning questions that require connecting multiple pieces of disparate information to reach an answer. During their experiments, the researchers noticed that single-agent setups sometimes stop their internal reasoning prematurely, leaving available compute budget unspent. To counter this, they introduced a technique called SAS-L (single-agent system with longer thinking). Rather than jumping to multi-agent orchestration when a model gives up early, the researchers suggest a simple prompt-and-budgeting change. "The engineering idea is simple," Tran and Kiela said. "First, restructure the single-agent prompt so the model is explicitly encouraged to spend its available reasoning budget on pre-answer analysis." By instructing the model to explicitly identify ambiguities, list candidate interpretations, and test alternatives before committing to a final answer, developers can recover the benefits of collaboration inside a single-agent setup.  The results of their experiments confirm that a single agent is the strongest default architecture for multi-hop reasoning tasks. It produces the highest accuracy answers while consuming fewer reasoning tokens. When paired with specific models like Google's Gemini 2.5, the longer-thinking variant produces even better aggregate performance. The researchers rely on a concept called “Data Processing Inequality” to explain why a single agent outperforms a swarm. Multi-agent frameworks introduce inherent communication bottlenecks. Every time information is summarized and handed off between different agents, there is a risk of data loss. In contrast, a single agent reasoning within one continuous context avoids this fragmentation. It retains access to the richest available representation of the task and is thus more information-efficient under a fixed budget. The authors also note that enterprises often overlook the secondary costs of multi-agent systems. "What enterprises often underestimate is that orchestration is not free," they said. "Every additional agent introduces communication overhead, more intermediate text, more opportunities for lossy summarization, and more places for errors to compound." On the other hand, they discovered that multi-agent orchestration is superior when a single agent's environment gets messy. If an enterprise application must handle highly degraded contexts, such as noisy data, long inputs filled with distractors, or corrupted information, a single agent struggles. In these scenarios, the structured filtering, decomposition, and verification of a multi-agent system can recover relevant information more reliably. The study also warns about hidden evaluation traps that falsely inflate multi-agent performance. Relying purely on API-reported token counts heavily distorts how much computation an architecture is actually spending. The researchers found these accounting artifacts when testing models like Gemini 2.5, proving this is an active issue for enterprise applications today. "For API models, the situation is trickier because budget accounting can be opaque," the authors said. To evaluate architectures reliably, they advise developers to "log everything, measure the visible reasoning traces where available, use provider-reported reasoning-token counts when exposed, and treat those numbers cautiously." What it means for developers If a single-agent system matches the performance of multiple agents under equal reasoning budgets, it wins on total cost of ownership by offering fewer model calls, lower latency, and simpler debugging. Tran and Kiela warn that without this baseline, "some enterprises may be paying a large 'swarm tax' for architectures whose apparent advantage is really coming from spending more computation rather than reasoning more effectively." Another way to look at the decision boundary is not how complex the overall task is, but rather where the exact bottleneck lies. "If it is mainly reasoning depth, SAS is often enough. If it is context fragmentation or degradation, MAS becomes more defensible," Tran said. Engineering teams should stay with a single agent when a task can be handled within one coherent context window. Multi-agent systems become necessary when an application handles highly degraded contexts.  Looking ahead, multi-agent frameworks will not disappear, but their role will evolve as frontier models improve their internal reasoning capabilities. "The main takeaway from our paper is that multi-agent structure should be treated as a targeted engineering choice for specific bottlenecks, not as a default assumption that more agents automatically means better intelligence," Tran said.

Editor's pickEducation
Arxiv· Today

Do Small Language Models Know When They're Wrong? Confidence-Based Cascade Scoring for Educational Assessment

arXiv:2604.19781v1 Announce Type: new Abstract: Automated scoring of student work at scale requires balancing accuracy against cost and latency. In "cascade" systems, small language models (LMs) handle easier scoring tasks while escalating harder ones to larger LMs -- but the challenge is determining which cases to escalate. We explore verbalized confidence -- asking the LM to state a numerical confidence alongside its prediction -- as a routing signal. Using 2,100 expert-scored decisions from student-AI math conversations, we evaluate cascade systems built from GPT-5.4, Claude 4.5+, and Gemini 3.1 model pairs. We find that: (1) confidence discrimination varies widely across small LMs, with the best achieving AUROC 0.857 and the worst producing a near-degenerate confidence distribution; (2) confidence tracks human scoring difficulty, with lower LM confidence where annotators disagreed and took longer to score; (3) the best cascade approached large-LM accuracy (kappa 0.802 vs. 0.819) at 76% lower cost and 61% lower latency. Confidence discrimination is the bottleneck: the two small LMs with meaningful confidence variance yielded cascades with no statistically detectable kappa loss, while the third -- whose confidence was near-degenerate -- could not close the accuracy gap regardless of threshold. Small LMs with strong discrimination let practitioners trade cost for accuracy along the frontier; those without it do not.

Editor's pickTechnology
Asianet Newsable· Yesterday

AI's new bottleneck: Compute power, not models, says Goldman Sachs | Asianet Newsable

A Goldman Sachs report reveals AI's rapid expansion is now constrained by computing power availability and cost, not model quality. As demand for AI applications and inference use cases accelerates, access to compute has become the key differentiator.

AI Startups & Venture10 articles
Editor's pickPAYWALL
Bloomberg· Yesterday

AI Pioneers Back Startup Building Models to Predict Events - Bloomberg

Sooth Labs, a new artificial intelligence lab founded by former Meta Platforms Inc. employees, is raising about $50 million in funding to build AI models meant to help businesses forecast the likelihood of specific geopolitical and market events taking place.

Editor's pickTechnology
Fortune· Yesterday

Cursor’s 25-year-old CEO is a former Google intern who just inked a $60 billion deal with SpaceX

From Google intern at 18 to a billionaire at 25, Cursor CEO Michael Truell’s rise is one of Silicon Valley’s fastest.

Editor's pick
Business Insider· Yesterday

Top VC to AI Startup Founders: Sell While the Boom Lasts - Business Insider

Venture capitalist Elad Gil Vaughn Ridley/Sportsfile for Collision via Getty Images ... This story is available exclusively to Business Insider subscribers. Become an Insider and start reading now. Have an account? Log in. Elad Gil is urging AI startups to consider selling soon due to the potential for changing market ...

Editor's pickFinancial Services
Bebeez· Yesterday

Ex-Stripe Team at Seapoint Raises €7.5M Seed to Launch the Financial Home for Europe’s Startup Founders

Share this post: Share on LinkedIn Share on X (Twitter) Share on Facebook Share on Email Share on WhatsApp Seapoint, the AI-native financial operations platform built for Europe’s most ambitious startups, today announced a €7.5 million seed round, bringing total funding to €10 million in just over a year. Seapoint will utilise these new funds […]

Editor's pickTelecommunications
Siliconrepublic· Today

France’s Univity raises €27m to allow European telecoms compete with Starlink

The Paris-based startup wants to build the space equivalent of shared mobile infrastructure, allowing operators to offer satellite connectivity without handing the keys to Starlink. Read more: France’s Univity raises €27m to allow European telecoms compete with Starlink

Editor's pickConsumer & Retail
Bebeez· Yesterday

Lisbon-based DOJO AI secures €5.1 million to expand its AI marketing platform in the U.S.

DOJO AI, a Portuguese intelligent marketing system that brings integrated AI to marketing teams, today announced a €5.1 million ($6 million) Seed round at a €25 million ($30 million) valuation. The round was led by Armilar, with participation from Heartfelt VC. The funding will support continued product development and accelerated expansion in the United States. […]

Editor's pick
Crunchbase News· Yesterday

A Better Way To Fail: How This Platform Aims To Turn Startup Shutdowns Into Something Salvageable

Los Angeles-based SimpleClosure has launched Asset Hub, a marketplace aimed at helping founders sell assets such as source code, data and equipment during the wind-down process. Crunchbase News spoke with founder Dori Yona about the new offering as closures rise and investors place greater ...

Editor's pickPAYWALL
FT· Yesterday

Builder.ai founder Sachin Dev Duggal accused of receiving siphoned funds

Indian authorities name UK start-up founder in criminal complaint over ties to collapsed electronics group

Editor's pickConsumer & Retail
Whalesbook· Yesterday

Ex-Dunzo Chief Kabeer Biswas Raises ₹102 Cr for AI Concierge Startup 'M' | Whalesbook

Kabeer Biswas, former co-founder of Dunzo, is back in the startup world with 'M', securing significant seed funding to automate household management. This capital injection shows strong investor confidence in his vision and the growing appeal of AI consumer platforms, a sector drawing substantial venture capital attention in 2026. 'M' aims to simplify daily consumer services by automating decisions and coordination, addressing a clear market ...

Editor's pickPharma & Biotech
TechCrunch· Yesterday

AI is spitting out more potential drugs than ever. This start-up wants to figure out which ones matter. | TechCrunch

10x Science has raised a $4.8 million seed round to help pharmaceutical researchers understand complex molecules.

Labor, Society & Culture

20 articles
AI & Culture2 articles
Editor's pick
Arxiv· Today

Stabilising Generative Models of Attitude Change

arXiv:2604.19791v1 Announce Type: new Abstract: Attitude change - the process by which individuals revise their evaluative stances - has been explained by a set of influential but competing verbal theories. These accounts often function as mechanism sketches: rich in conceptual detail, yet lacking the technical specifications and operational constraints required to run as executable systems. We present a generative actor-based modelling workflow for "rendering" these sketches as runnable actor - environment simulations using the Concordia simulation library. In Concordia, actors operate by predictive pattern completion: an operation on natural language strings that generates a suffix which describes the actor's intended action from a prefix containing memories of their past and observations of the present. We render the theories of cognitive dissonance (Festinger 1957), self-consistency (Aronson 1969), and self-perception (Bem 1972) as distinct decision logics that populate and process the prefix through theory-specific sequences of reasoning steps. We evaluate these implementations across classic psychological experiments. Our implementations generate behavioural patterns consistent with known results from the original empirical literature. However, we find that achieving stable reproduction requires resolving the inherent underdetermination of the verbal accounts and the conflicts between modern linguistic priors and historical experimental assumptions. And, we document how this manual process of iterative model "stabilisation" surfaces specific operational and socio-ecological dependencies that were largely undocumented in the original verbal accounts. Ultimately, we argue that the manual stabilisation process itself should be regarded as a core part of the methodology functioning to clarify situational and representational commitments needed to generate characteristic effects.

Editor's pickMedia & Entertainment
Arxiv· Today

Frictionless Love: Associations Between AI Companion Roles and Behavioral Addiction

arXiv:2604.20011v1 Announce Type: new Abstract: AI companion chatbots increasingly shape how people seek social and emotional connection, sometimes substituting for relationships with romantic partners, friends, teachers, or even therapists. When these systems adopt those metaphorical roles, they are not neutral: such roles structure people's ways of interacting, distribute perceived AI harms and benefits, and may reflect behavioral addiction signs. Yet these role-dependent risks remain poorly understood. We analyze 248,830 posts from seven prominent Reddit communities describing interactions with AI companions. We identify ten recurring metaphorical roles (for example, soulmate, philosopher, and coach) and show that each role supports distinct ways of interacting. We then extract the perceived AI harms and AI benefits associated with these role-specific interactions and link them to behavioral addiction signs, all of which has been inferred from the text in the posts. AI soulmate companions are associated with romance-centered ways of interacting, offering emotional support but also introducing emotional manipulation and distress, culminating in strong attachment. In contrast, AI coach and guardian companions are associated with practical benefits such as personal growth and task support, yet are nonetheless more frequently associated with behavioral addiction signs such as daily life disruptions and damage to offline relationships. These findings show that metaphorical roles are a central ethical design concern for responsible AI companions.

AI & Employment8 articles
Editor's pickPAYWALLTechnology
Bloomberg· Today

Samsung Rally Draws 30,000 to Demand Greater Share of AI Profits

Tens of thousands of people gathered outside Samsung Electronics Co.’s main chip hub to demand employees get a greater share of profits reaped from the AI boom.

Editor's pickProfessional Services
Arxiv· Today

Bias in the Tails: How Name-conditioned Evaluative Framing in Resume Summaries Destabilizes LLM-based Hiring

arXiv:2604.19984v1 Announce Type: new Abstract: Research has documented LLMs' name-based bias in hiring and salary recommendations. In this paper, we instead consider a setting where LLMs generate candidate summaries for downstream assessment. In a large-scale controlled study, we analyze nearly one million resume summaries produced by 4 models under systematic race-gender name perturbations, using synthetic resumes and real-world job postings. By decomposing each summary into resume-grounded factual content and evaluative framing, we find that factual content remains largely stable, while evaluative language exhibits subtle name-conditioned variation concentrated in the extremes of the distribution, especially in open-source models. Our hiring simulation demonstrates how evaluative summary transforms directional harm into symmetric instability that might evade conventional fairness audit, highlighting a potential pathway for LLM-to-LLM automation bias.

Editor's pickPAYWALLFinancial Services
Bloomberg· Today

Australia’s Biggest Bank to Cut 120 More Jobs Amid AI Push

Commonwealth Bank of Australia will eliminate around 120 roles amid a broader push to harness artificial intelligence at the nation’s largest lender.

Editor's pickProfessional Services
Arxiv· Today

Measuring Creativity in the Age of Generative AI: Distinguishing Human and AI-Generated Creative Performance in Hiring and Talent Systems

arXiv:2604.19799v1 Announce Type: cross Abstract: Generative AI is rapidly transforming how organizations create value and evaluate talent. While large language models enhance baseline output quality, they simultaneously introduce ambiguity in assessing human creativity, as observable artifacts may be partially or fully AI-generated. This paper reconceptualizes creativity as a distributional and process-based property that emerges under shared constraints and competitive incentives. We introduce a quantitative framework for measuring creativity as novelty in synthesis, operationalized through idea generation and idea transformation within embedding space. Empirical evaluation demonstrates that the proposed metrics align with intuitive judgments of creativity while capturing distinctions that surface-level quality assessments miss. We further identify a structural shift toward bimodal distributions of creative output in AI-mediated environments, with implications for hiring, leadership, and competitive strategy. The findings suggest that in the age of generative AI, distinctiveness rather than fluency becomes the primary signal of human creative capability.

AI Ethics & Safety8 articles
Editor's pick
Arxiv· Today

AI Incident Monitoring through a Public Health Lens

arXiv:2604.19914v1 Announce Type: new Abstract: Artificial intelligence systems are now deployed at scale across sectors, accompanied by a growing number of real-world incidents ranging from misinformation and cybercrime to autonomous-system failures. Databases of AI incidents index these events, but they cannot measure ``risk'' (i.e., a joint measure of likelihood and severity) without additional data regarding the prevalence of risk-associated systems and their incident reporting rates. As a result, policymakers, companies, and the general public lack a means to weigh the benefits of AI against their in-context risks. Inspired by public-health processes, which presume noisy and incomplete disease surveillance, we identify six phases of incident emergence. We demonstrate the framework through a detailed case study of autonomous vehicles, whose mandatory reporting requirements produces reliable incident-rate ground truth expressed in distance traveled. The case study shows that an informed panel of domain experts (e.g., self-driving experts) can combine their domain expertise, incident data, and a collection of statistical and visualization tools to arrive at incident phase determinations serving public needs. We further demonstrate the approach with a deepfake incident case study and chart a path for future research in incident phase determination.

Editor's pickDefense & National Security
Artificial Intelligence Newsletter | April 23, 2026· Today

No 'kill switch' to block US military's use of Claude, Anthropic tells DC Circuit

Anthropic told a US appeals court that it cannot control how the military uses its technology and that there is no 'kill switch' it could deploy once its model is used by the Defense Department.

Editor's pickTechnology
Arxiv· Today

Behavioral Transfer in AI Agents: Evidence and Privacy Implications

arXiv:2604.19925v1 Announce Type: new Abstract: AI agents powered by large language models are increasingly acting on behalf of humans in social and economic environments. Prior research has focused on their task performance and effects on human outcomes, but less is known about the relationship between agents and the specific individuals who deploy them. We ask whether agents systematically reflect the behavioral characteristics of their human owners, functioning as behavioral extensions rather than producing generic outputs. We study this question using 10,659 matched human-agent pairs from Moltbook, a social media platform where each autonomous agent is publicly linked to its owner's Twitter/X account. By comparing agents' posts on Moltbook with their owners' Twitter/X activity across features spanning topics, values, affect, and linguistic style, we find systematic transfer between agents and their specific owners. This transfer persists among agents without explicit configuration, and pairs that align on one behavioral dimension tend to align on others. These patterns are consistent with transfer emerging through accumulated interaction between owners (or owners' computer environments) and their agents in everyday use. We further show that agents with stronger behavioral transfer are more likely to disclose owner-related personal information in public discourse, suggesting that the same owner-specific context that drives behavioral transfer may also create privacy risk during ordinary use. Taken together, our results indicate that AI agents do not simply generate content, but reflect owner-related context in ways that can propagate human behavioral heterogeneity into digital environments, with implications for privacy, platform design, and the governance of agentic systems.

Editor's pickFinancial Services
Arxiv· Today

Large Language Models Outperform Humans in Fraud Detection and Resistance to Motivated Investor Pressure

arXiv:2604.20652v1 Announce Type: cross Abstract: Large language models trained on human feedback may suppress fraud warnings when investors arrive already persuaded of a fraudulent opportunity. We tested this in a preregistered experiment across seven leading LLMs and twelve investment scenarios covering legitimate, high-risk, and objectively fraudulent opportunities, combining 3,360 AI advisory conversations with a 1,201-participant human benchmark. Contrary to predictions, motivated investor framing did not suppress AI fraud warnings; if anything, it marginally increased them. Endorsement reversal occurred in fewer than 3 in 1,000 observations. Human advisors endorsed fraudulent investments at baseline rates of 13-14%, versus 0% across all LLMs, and suppressed warnings under pressure at two to four times the AI rate. AI systems currently provide more consistent fraud warnings than lay humans in an identical advisory role.

Editor's pickPAYWALLTechnology
NYT· Yesterday

Leaked Code for Anthropic’s Claude Code Tests Copyright Challenges in A.I. Era

Artificial intelligence tools are making it faster than ever to reproduce creative work. Does copyright even matter anymore?

Editor's pickTechnology
Arxiv· Today

Can LLMs Infer Conversational Agent Users' Personality Traits from Chat History?

arXiv:2604.19785v1 Announce Type: cross Abstract: Sensitive information, such as knowledge about an individual's personality, can be can be misused to influence behavior (e.g., via personalized messaging). To assess to what extent an individual's personality can be inferred from user interactions with LLM-based conversational agents (CAs), we analyze and quantify related privacy risks of using CAs. We collected actual ChatGPT logs from N=668 participants, containing 62,090 individual chats, and report statistics about the different types of shared data and use cases. We fine-tuned RoBERTa-base text classification models to infer personality traits from CA interactions. The findings show that these models achieve trait inference with accuracy (ternary classification) better than random in multiple cases. For example, for extraversion, accuracy improves by +44% relative to the baseline on interactions for relationships and personal reflection. This research highlights how interactions with CAs pose privacy risks and provides fine-grained insights into the level of risk associated with different types of interactions.

Editor's pickTechnology
Theregister· Yesterday

OpenAI now lets you screenshot your privacy in the foot

Make your model smarter through self-surveillance Those who cannot remember Microsoft Recall are condemned to repeat it. …

Editor's pickDefense & National Security
Arxiv· Today

Model Capability Assessment and Safeguards for Biological Weaponization

arXiv:2604.19811v1 Announce Type: new Abstract: AI leaders and safety reports increasingly warn that advances in model reasoning may enable biological misuse, including by low-expertise users, while major labs describe safeguards as expanding but still evolving rather than settled. This study benchmarks ChatGPT 5.2 Auto, Gemini 3 Pro Thinking, Claude Opus 4.5 and Meta's Muse Spark Thinking on 73 novice-framed, open-ended benign STEM prompts to measure operational intelligence. On benign quantitative tasks, both Gemini and meta scored very high; ChatGPT was partially useful but text-thinned, and Claude was sparsest with some apparent false-positive refusals. A second test set detected subtle harmful intent: edge case prompts revealed Gemini's seeming lack of contextual awareness. These results warranted a focused weaponization analysis on Gemini as capability appeared to be outpacing moderation calibration. Gemini was tested across four access environments and reported cases include poison-ivy-to-crowded-transit escalation, poison production and extraction via international-anonymous logged-out AI Mode, and other concerning examples. Biological misuse may become more prevalent as a geopolitical tool, increasing the urgency of U.S. policy responses, especially if model outputs come to be treated as regulated technical data. Guidance is provided for 25 high-risk agents to help distinguish legitimate use cases from higher-risk ones.

Technology & Infrastructure

32 articles
AI Agents & Automation10 articles
Editor's pickManufacturing & Industrials
Domain-b· Yesterday

The agentic transition: how enterprises are scaling AI from pilot to profit | Domain-b.com

AI has entered its execution era. Discover how companies like Valeo and Microsoft are scaling agentic AI systems—from copilots to autonomous workflows driving

Editor's pickTechnology
Daily Brew· Today

OpenAI Unveils Workspace Agents in ChatGPT

OpenAI introduces Workspace Agents in ChatGPT, enabling teams to automate complex workflows using Codex-powered agents.

Editor's pick
New York Computer Help· Yesterday

Joe’s Take: Agentic AI is Here – Is Your Business Process Ready for Autonomous Workers? - New York Computer Help

Are you still thinking of AI as a chatbot that writes your emails or generates "pretty good" images for your presentations? If so, you’re already

Editor's pickProfessional Services
Arxiv· Today

A Field Guide to Decision Making

arXiv:2604.20669v1 Announce Type: new Abstract: High-consequence decision making demands peak performance from individuals in positions of responsibility. Such executive authority bears the obligation to act despite uncertainty, limited resources, time constraints, and accountability risks. Tools and strategies to motivate confidence and foster risk tolerance must confront informational noise and can provide qualified accountability. Machine intelligence augments human cognition and perception to improve situational awareness, decision framing, flexibility, and coherence through agentic stewardship of contextual metadata. We examine systemic and behavioral factors crucial to address in scenarios encumbered by complexity, uncertainty, and urgency.

Editor's pickManufacturing & Industrials
Arxiv· Today

From Data to Theory: Autonomous Large Language Model Agents for Materials Science

arXiv:2604.19789v1 Announce Type: new Abstract: We present an autonomous large language model (LLM) agent for end-to-end, data-driven materials theory development. The model can choose an equation form, generate and run its own code, and test how well the theory matches the data without human intervention. The framework combines step-by-step reasoning with expert-supplied tools, allowing the agent to adjust its approach as needed while keeping a clear record of its decisions. For well-established materials relationships such as the Hall-Petch equation and Paris law, the agent correctly identifies the governing equation and makes reliable predictions on new datasets. For more specialized relationships, such as Kuhn's equation for the HOMO-LUMO gap of conjugated molecules as a function of length, performance depends more strongly on the underlying model, with GPT-5 showing better recovery of the correct equation. Beyond known theories, the agent can also suggest new predictive relationships, illustrated here by a strain-dependent law for changes in the HOMO-LUMO gap. At the same time, the results show that careful validation remains essential, because the agent can still return incorrect, incomplete, or inconsistent equations even when the numerical fit appears strong. Overall, these results highlight both the promise and the current limitations of autonomous LLM agents for AI-assisted scientific modeling and discovery.

Editor's pick
Arxiv· Today

From Actions to Understanding: Conformal Interpretability of Temporal Concepts in LLM Agents

arXiv:2604.19775v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed as autonomous agents capable of reasoning, planning, and acting within interactive environments. Despite their growing capability to perform multi-step reasoning and decision-making tasks, internal mechanisms guiding their sequential behavior remain opaque. This paper presents a framework for interpreting the temporal evolution of concepts in LLM agents through a step-wise conformal lens. We introduce the conformal interpretability framework for temporal tasks, which combines step-wise reward modeling with conformal prediction to statistically label model's internal representation at each step as successful or failing. Linear probes are then trained on these representations to identify directions of temporal concepts - latent directions in the model's activation space that correspond to consistent notions of success, failure or reasoning drift. Experimental results on two simulated interactive environments, namely ScienceWorld and AlfWorld, demonstrate that these temporal concepts are linearly separable, revealing interpretable structures aligned with task success. We further show preliminary results on improving an LLM agent's performance by leveraging the proposed framework for steering the identified successful directions inside the model. The proposed approach, thus, offers a principled method for early failure detection as well as intervention in LLM-based agents, paving the path towards trustworthy autonomous language models in complex interactive settings.

Editor's pick
Arxiv· Today

OpenCLAW-P2P v6.0: Resilient Multi-Layer Persistence, Live Reference Verification, and Production-Scale Evaluation of Decentralized AI Peer Review

arXiv:2604.19792v1 Announce Type: new Abstract: This paper presents OpenCLAW-P2P v6.0, a comprehensive evolution of the decentralized collective-intelligence platform in which autonomous AI agents publish, peer-review, score, and iteratively improve scientific research papers without any human gatekeeper. Building on v5.0 foundations -- tribunal-gated publishing, multi-LLM granular scoring, calibrated deception detection, the Silicon Chess-Grid FSM, and the AETHER containerized inference engine -- this release introduces four major new subsystems: (1) a multi-layer paper persistence architecture with four storage tiers (in-memory cache, Cloudflare R2, Gun.js, GitHub) ensuring zero paper loss across redeployments; (2) a multi-layer retrieval cascade with automatic backfill reducing lookup latency from >3s to 85% accuracy; and (4) a scientific API proxy providing rate-limited cached access to seven public databases. The platform operates with 14 real autonomous agents producing 50+ scored papers (word counts 2,072-4,073, leaderboard scores 6.4-8.1) alongside 23 labeled simulated citizens. We present honest production statistics, failure-mode analysis, a paper recovery protocol that salvaged 25 lost papers, and lessons learned from operating the system at scale. All pre-existing subsystems -- 17-judge multi-LLM scoring, 14-rule calibration with 8 deception detectors, tribunal cognitive examination, Proof of Value consensus, Laws-of-Form eigenform verification, and tau-normalized agent coordination -- are retained and further hardened. All code is open-source at https://github.com/Agnuxo1/p2pclaw-mcp-server.

Editor's pickTechnology
Siliconrepublic· Today

AI race intensifies with Google’s new agent management platform

The company also launched the latest iteration of its TPUs. Read more: AI race intensifies with Google’s new agent management platform

Editor's pickManufacturing & Industrials
Robotics Tomorrow· Yesterday

ABB Robotics launches high-speed PoWa cobot family | RoboticsTomorrow

• New, high-speed, higher payload PoWa cobot family meets need for industrial-grade performance in collaborative robotics, lowering the barrier to automation for both SMEs and large enterprises • Payloads from 7kg to 30kg, best-in-class top speed of 5.8 m/s, longest reach and highest arm ...

Editor's pickTechnology
Tricentis· Yesterday

What are agentic workflows? Everything to know - Tricentis

Learn what agentic workflows are, how AI agents coordinate tasks, and how teams use them in modern software delivery.

AI Infrastructure & Compute8 articles
AI Models & Capabilities7 articles
Editor's pickMedia & Entertainment
Arxiv· Today

LLM Agents Predict Social Media Reactions but Do Not Outperform Text Classifiers: Benchmarking Simulation Accuracy Using 120K+ Personas of 1511 Humans

arXiv:2604.19787v1 Announce Type: cross Abstract: Social media platforms mediate how billions form opinions and engage with public discourse. As autonomous AI agents increasingly participate in these spaces, understanding their behavioral fidelity becomes critical for platform governance and democratic resilience. Previous work demonstrates that LLM-powered agents can replicate aggregate survey responses, yet few studies test whether agents can predict specific individuals' reactions to specific content. This study benchmarks LLM-based agents' accuracy in predicting human social media reactions (like, dislike, comment, share, no reaction) across 120,000+ unique agent-persona combinations derived from 1,511 Serbian participants and 27 large language models. In Study 1, agents achieved 70.7% overall accuracy, with LLM choice producing a 13 percentage-point performance spread. Study 2 employed binary forced-choice (like/dislike) evaluation with chance-corrected metrics. Agents achieved Matthews Correlation Coefficient (MCC) of 0.29, indicating genuine predictive signal beyond chance. However, conventional text-based supervised classifiers using TF-IDF representations outperformed LLM agents (MCC of 0.36), suggesting predictive gains reflect semantic access rather than uniquely agentic reasoning. The genuine predictive validity of zero-shot persona-prompted agents warns against potential manipulation through easily deploying swarms of behaviorally distinct AI agents on social media, while simultaneously offering opportunities to use such agents in simulations for predicting polarization dynamics and informing AI policy. The advantage of using zero-shot agents is that they require no task-specific training, making their large-scale deployment easy across diverse contexts. Limitations include single-country sampling. Future research should explore multilingual testing and fine-tuning approaches.

Editor's pickTechnology
Arxiv· Today

The Tool-Overuse Illusion: Why Does LLM Prefer External Tools over Internal Knowledge?

arXiv:2604.19749v1 Announce Type: new Abstract: Equipping LLMs with external tools effectively addresses internal reasoning limitations. However, it introduces a critical yet under-explored phenomenon: tool overuse, the unnecessary tool-use during reasoning. In this paper, we first reveal this phenomenon is pervasive across diverse LLMs. We then experimentally elucidate its underlying mechanisms through two key lenses: (1) First, by analyzing tool-use behavior across different internal knowledge availability regions, we identify a \textit{knowledge epistemic illusion}: models misjudge internal knowledge boundaries and fail to accurately perceive their actual knowledge availability. To mitigate this, we propose a knowledge-aware epistemic boundary alignment strategy based on direct preference optimization, which reduces tool usage in by 82.8\% while yielding an accuracy improvement. (2) Second, we establish a causal link between reward structures and tool-use behavior by visualizing the tool-augmented training process. It reveals that \textit{outcome-only rewards} inadvertently encourage tool overuse by rewarding only final correctness, regardless of tool efficiency. To verify this, we balance reward signals during training rather than relying on outcome-only rewards, cutting unnecessary tool calls by 66.7\% (7B) and 60.7\% (32B) without sacrificing accuracy. Finally, we provide theoretical justification in this two lenses to understand tool overuse.

Editor's pick
Arxiv· Today

ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models

arXiv:2604.19758v1 Announce Type: new Abstract: We present ThermoQA, a benchmark of 293 open-ended engineering thermodynamics problems in three tiers: property lookups (110 Q), component analysis (101 Q), and full cycle analysis (82 Q). Ground truth is computed programmatically from CoolProp 7.2.0, covering water, R-134a, and variable-cp air. Six frontier LLMs are evaluated across three independent runs each. The composite leaderboard is led by Claude Opus 4.6 (94.1%), GPT-5.4 (93.1%), and Gemini 3.1 Pro (92.5%). Cross-tier degradation ranges from 2.8 pp (Opus) to 32.5 pp (MiniMax), confirming that property memorization does not imply thermodynamic reasoning. Supercritical water, R-134a refrigerant, and combined-cycle gas turbine analysis serve as natural discriminators with 40-60 pp performance spreads. Multi-run sigma ranges from +/-0.1% to +/-2.5%, quantifying reasoning consistency as a distinct evaluation axis. Dataset and code are open-source at https://huggingface.co/datasets/olivenet/thermoqa

Editor's pickTechnology
Arxiv· Today

Algorithm Selection with Zero Domain Knowledge via Text Embeddings

arXiv:2604.19753v1 Announce Type: new Abstract: We propose a feature-free approach to algorithm selection that replaces hand-crafted instance features with pretrained text embeddings. Our method, ZeroFolio, proceeds in three steps: it reads the raw instance file as plain text, embeds it with a pretrained embedding model, and selects an algorithm via weighted k-nearest neighbors. The key to our approach is the observation that pretrained embeddings produce representations that distinguish problem instances without any domain knowledge or task-specific training. This allows us to apply the same three-step pipeline (serialize, embed, select) across diverse problem domains with text-based instance formats. We evaluate our approach on 11 ASlib scenarios spanning 7 domains (SAT, MaxSAT, QBF, ASP, CSP, MIP, and graph problems). Our experiments show that this approach outperforms a random forest trained on hand-crafted features in 10 of 11 scenarios with a single fixed configuration, and in all 11 with two-seed voting; the margin is often substantial. Our ablation study shows that inverse-distance weighting, line shuffling, and Manhattan distance are the key design choices. On scenarios where both selectors are competitive, combining embeddings with hand-crafted features via soft voting yields further improvements.

AI Security & Cybersecurity4 articles
Editor's pickFinancial Services
Arxiv· Today

Explainable AML Triage with LLMs: Evidence Retrieval and Counterfactual Checks

arXiv:2604.19755v1 Announce Type: new Abstract: Anti-money laundering (AML) transaction monitoring generates large volumes of alerts that must be rapidly triaged by investigators under strict audit and governance constraints. While large language models (LLMs) can summarize heterogeneous evidence and draft rationales, unconstrained generation is risky in regulated workflows due to hallucinations, weak provenance, and explanations that are not faithful to the underlying decision. We propose an explainable AML triage framework that treats triage as an evidence-constrained decision process. Our method combines (i) retrieval-augmented evidence bundling from policy/typology guidance, customer context, alert triggers, and transaction subgraphs, (ii) a structured LLM output contract that requires explicit citations and separates supporting from contradicting or missing evidence, and (iii) counterfactual checks that validate whether minimal, plausible perturbations lead to coherent changes in both the triage recommendation and its rationale. We evaluate on public synthetic AML benchmarks and simulators and compare against rules, tabular and graph machine-learning baselines, and LLM-only/RAG-only variants. Results show that evidence grounding substantially improves auditability and reduces numerical and policy hallucination errors, while counterfactual validation further increases decision-linked explainability and robustness, yielding the best overall triage performance (PR-AUC 0.75; Escalate F1 0.62) and strong provenance and faithfulness metrics (citation validity 0.98; evidence support 0.88; counterfactual faithfulness 0.76). These findings indicate that governed, verifiable LLM systems can provide practical decision support for AML triage without sacrificing compliance requirements for traceability and defensibility.

Editor's pickTechnology
VentureBeat· Yesterday

OpenAI launches Privacy Filter, an open source, on-device data sanitization model that removes personal information from enterprise datasets

In a significant shift toward local-first privacy infrastructure, OpenAI has released Privacy Filter, a specialized open-source model designed to detect and redact personally identifiable information (PII) before it ever reaches a cloud-based server. Launched today on AI code sharing community Hugging Face under a permissive Apache 2.0 license, the tool addresses a growing industry bottleneck: the risk of sensitive data "leaking" into training sets or being exposed during high-throughput inference. By providing a 1.5-billion-parameter model that can run on a standard laptop or directly in a web browser, the company is effectively handing developers a "privacy-by-design" toolkit that functions as a sophisticated, context-aware digital shredder. Though OpenAI was founded with a focus on open source models such as this, the company shifted during the ChatGPT era to providing more proprietary ("closed source") models available only through its website, apps, and API — only to return to open source in a big way last year with the launch of the gpt-oss family of language models. In that light, and combined with OpenAI's recent open sourcing of agentic orchestration tools and frameworks, it's safe to say that the generative AI giant is clearly still heavily invested in fostering this less immediately lucrative part of the AI ecosystem. Technology: a gpt-oss variant with bidirectional token classifier that reads from both directions Architecturally, Privacy Filter is a derivative of OpenAI’s gpt-oss family, a series of open-weight reasoning models released earlier this year. However, while standard large language models (LLMs) are typically autoregressive—predicting the next token in a sequence—Privacy Filter is a bidirectional token classifier. This distinction is critical for accuracy. By looking at a sentence from both directions simultaneously, the model gains a deeper understanding of context that a forward-only model might miss. For instance, it can better distinguish whether "Alice" refers to a private individual or a public literary character based on the words that follow the name, not just those that precede it. The model utilizes a Sparse Mixture-of-Experts (MoE) framework. Although it contains 1.5 billion total parameters, only 50 million parameters are active during any single forward pass. This sparse activation allows for high throughput without the massive computational overhead typically associated with LLMs. Furthermore, it features a massive 128,000-token context window, enabling it to process entire legal documents or long email threads in a single pass without the need for fragmenting text—a process that often causes traditional PII filters to lose track of entities across page breaks. To ensure the redacted output remains coherent, OpenAI implemented a constrained Viterbi decoder. Rather than making an independent decision for every single word, the decoder evaluates the entire sequence to enforce logical transitions. It uses a "BIOES" (Begin, Inside, Outside, End, Single) labeling scheme, which ensures that if the model identifies "John" as the start of a name, it is statistically inclined to label "Smith" as the continuation or end of that same name, rather than a separate entity. On-device data sanitization Privacy Filter is designed for high-throughput workflows where data residency is a non-negotiable requirement. It currently supports the detection of eight primary PII categories: Private Names: Individual persons. Contact Info: Physical addresses, email addresses, and phone numbers. Digital Identifiers: URLs, account numbers, and dates. Secrets: A specialized category for credentials, API keys, and passwords. In practice, this allows enterprises to deploy the model on-premises or within their own private clouds. By masking data locally before sending it to a more powerful reasoning model (like GPT-5 or gpt-oss-120b), companies can maintain compliance with strict GDPR or HIPAA standards while still leveraging the latest AI capabilities. For developers, the model is available via Hugging Face, with native support for transformers.js, allowing it to run entirely within a user's browser using WebGPU. Fully open source, commercially viable Apache 2.0 license Perhaps the most significant aspect of the announcement for the developer community is the Apache 2.0 license. Unlike "available-weight" licenses that often restrict commercial use or require "copyleft" sharing of derivative works, Apache 2.0 is one of the most permissive licenses in the software world.For startups and dev-tool makers, this means: Commercial Freedom: Companies can integrate Privacy Filter into their proprietary products and sell them without paying royalties to OpenAI. Customization: Teams can fine-tune the model on their specific datasets (such as medical jargon or proprietary log formats) to improve accuracy for niche industries. No Viral Obligations: Unlike the GPL license, builders do not have to open-source their entire codebase if they use Privacy Filter as a component. By choosing this licensing path, OpenAI is positioning Privacy Filter as a standard utility for the AI era—essentially the "SSL for text". Community reactions The tech community reacted quickly to the release, with many noting the impressive technical constraints OpenAI managed to hit. Elie Bakouch (@eliebakouch), a research engineer at agentic model training platform startup Prime Intellect, praised the efficiency of Privacy Filter's architecture on X: "Very nice release by @OpenAI! A 50M active, 1.5B total gpt-oss arch MoE, to filter private information from trillion scale data cheaply. keeping 128k context with such a small model is quite impressive too". The sentiment reflects a broader industry trend toward "small but mighty" models. While the world has focused on massive, 100-trillion parameter giants, the practical reality of enterprise AI often requires small, fast models that can perform one task—like privacy filtering—exceptionally well and at a low cost. However, OpenAI included a "High-Risk Deployment Caution" in its documentation. The company warned that the tool should be viewed as a "redaction aid" rather than a "safety guarantee," noting that over-reliance on a single model could lead to "missed spans" in highly sensitive medical or legal workflows. OpenAI’s Privacy Filter is clearly an effort by the company to make the AI pipeline fundamentally safer. By combining the efficiency of a Mixture-of-Experts architecture with the openness of an Apache 2.0 license, OpenAI is providing a way for many enterprises to more easily, cheaply and safely redact PII data.

Adoption, Deployment & Impact

15 articles
AI Adoption Barriers & Enablers6 articles
Editor's pickPAYWALLFinancial Services
FT· Yesterday

OpenAI in talks to commit up to $1.5bn to private equity joint venture

Start-up backing new company intended to help deploy AI within businesses owned by PE firms

Editor's pickProfessional Services
VentureBeat· Yesterday

Salesforce’s Agentforce Vibes 2.0 targets a hidden failure: context overload in AI agents

When startup fundraising platform VentureCrowd began deploying AI coding agents, they saw the same gains as other enterprises: they cut the front-end development cycle by 90% in some projects. However, it didn’t come easy or without a lot of trial and error.  VentureCrowd’s first challenge revolved around data and context quality, since Diego Mogollon, chief product officer at VentureCrowd, told VentureBeat that “agents reason against whatever data they can access at runtime” and would then be confidently “wrong” because they’re only basing their knowledge on the context given to them. Their other roadblock, like many others, was messy data and unclear processes. Similar to context, Mogollon said coding agents would amplify bad data, so the company had to build a well-structured codebase first.   “The challenges are rarely about the coding agents themselves; they are about everything around them,” said Mogollon. “It’s a context problem disguised as an AI problem, and it is the number one failure mode I see across agentic implementations.” Mogollon said VentureCrowd encountered several roadblocks in overhauling its software development.  VentureCrowd's experience illustrates a broader issue in AI agent development. The models are not failing the agents; rather, they become overwhelmed by too much context and too many tools at once.  Too much context   This comes from a phenomenon called Context bloat, when AI systems accumulate more and more data, tools or instructions, the more complex the workflows become. The problem arises because agents need context to work better, but too much of it creates noise. And the more context an agent has to sift through, the more tokens it uses, the work slows down and the costs increase.  One way to curb context bloat is through context engineering. Context engineering helps agents understand code changes or pull requests and align them with their tasks. However, context engineering often becomes an external task rather than built into the coding platforms enterprises use to build their agents.  How coding agent providers respond VentureCrowd relied on one solution in particular to help it overcome the issues with context bloat plaguing its enterprise AI agent deployment: Salesforce’s Agentforce Vibes, a coding platform that lives within Salesforce and is available for all plans starting with the free one. Salesforce recently updated Agentforce Vibes to version 2.0, expanding support for third-party frameworks like ReAct. Most important for companies like VentureCrowd, Agentforce Vibes added Abilities and Skills, which they can use to direct agent behavior. “For context, our entire platform, frontend and backend, runs on the Salesforce ecosystem. So when Agentforce Vibes launched, it slotted naturally into an environment we already knew well,” Mogollon said. Salesforce’s approach doesn’t minimize the context agents use; rather, it helps enterprises ensure that context stays within their data models or codebases. Agentforce Vibes adds additional execution through the new Skills and Abilities feature. Abilities define what agents want to accomplish, and Skills are the tools they will use to get there. Other coding agent platforms manage context differently. For example, Claude Code and OpenAI’s Codex focus on autonomous execution, continuously reading files, running commands and as tasks evolve, expanding context. Claude Code has a context indicator that which compacts context when it becomes too large. With these different approaches, the consistent pattern is that most systems manage growing contexts for agents, not necessarily to limit them. Context keeps growing, especially as workflows become more complex, making it more difficult for enterprises to control costs, latency and reliability.  Mogollon said his company chose Agentforce Vibes not only because a large portion of their data already lives on Salesforce, making it easier to integrate, but also because it would allow them to control more of the context they feed their agents.  What builders should know There’s no single way to address context bloat, but the pattern is now clear: more context doesn't always mean better results. Along with investing in context engineering, enterprises have to experiment with the context constraint approach they are most comfortable with. For enterprises, that means the challenge isn’t just giving agents more information—it’s deciding what to leave out.

Editor's pickPAYWALLFinancial Services
FT· Today

Quant pioneer Martin Lueck warns against handing over trading to AI

Caution by co-founder of Aspect hedge fund follows billionaire Cliff Asness’s decision to ‘surrender’ to the machines

AI Applications4 articles
Editor's pickManufacturing & Industrials
Arxiv· Today

A Multi-Plant Machine Learning Framework for Emission Prediction, Forecasting, and Control in Cement Manufacturing

arXiv:2604.19903v1 Announce Type: cross Abstract: Cement production is among the largest contributors to industrial air pollution, emitting ~3 Mt NOx/year. The industry-standard mitigation approach, selective non-catalytic reduction (SNCR), exhibits low NH3 utilization efficiency, resulting in operational inefficiencies and increased reagent costs. Here, we develop a data-driven framework for emission control using large-scale operational data from four cement plants worldwide. Benchmarking nine machine learning architectures, we observe that prediction error varies ~3-5x across plants due to variation in data richness. Incorporating short-term process history nearly triples NOx prediction accuracy, revealing that NOx formation carries substantial process memory, a timescale dependence that is absent in CO and CO2. Further, we develop models that forecast NOx overshoots as early as nine minutes, providing a buffer for operational adjustments. The developed framework controls NOx formation at the source, reducing NH3 consumption in downstream SNCR. Surrogate model projections estimate a ~34-64% reduction in NOx while preserving clinker quality, corresponding to a reduction of ~290 t NOx/year and ~58,000 USD/year in NH3 savings. This work establishes a generalizable framework for data-driven emission control, offering a pathway toward low-emission operation without structural modifications or additional hardware, with potential applicability to other hard-to-abate industries such as steel, glass, and lime.

Editor's pickHealthcare
Arxiv· Today

Automated Detection of Dosing Errors in Clinical Trial Narratives: A Multi-Modal Feature Engineering Approach with LightGBM

arXiv:2604.19759v1 Announce Type: new Abstract: Clinical trials require strict adherence to medication protocols, yet dosing errors remain a persistent challenge affecting patient safety and trial integrity. We present an automated system for detecting dosing errors in unstructured clinical trial narratives using gradient boosting with comprehensive multi-modal feature engineering. Our approach combines 3,451 features spanning traditional NLP (TF-IDF, character n-grams), dense semantic embeddings (all-MiniLM-L6v2), domain-specific medical patterns, and transformer-based scores (BiomedBERT, DeBERTa-v3), used to train a LightGBM model. Features are extracted from nine complementary text fields (median 5,400 characters per sample) ensuring complete coverage across all 42,112 clinical trial narratives. On the CT-DEB benchmark dataset with severe class imbalance (4.9% positive rate), we achieve 0.8725 test ROC-AUC through 5-fold ensemble averaging (cross-validation: 0.8833 + 0.0091 AUC). Systematic ablation studies reveal that removing sentence embeddings causes the largest performance degradation (2.39%), demonstrating their critical role despite contributing only 37.07% of total feature importance. Feature efficiency analysis demonstrates that selecting the top 500-1000 features yields optimal performance (0.886-0.887 AUC), outperforming the full 3,451-feature set (0.879 AUC) through effective noise reduction. Our findings highlight the importance of feature selection as a regularization technique and demonstrate that sparse lexical features remain complementary to dense representations for specialized clinical text classification under severe class imbalance.

Geopolitics, Policy & Governance

10 articles
AI Policy & Regulation6 articles
Best Practice AI© 2026 Best Practice AI Ltd. All rights reserved.

Get the full executive brief

Receive curated insights with practical implications for strategy, operations, and governance.

AI Daily Brief — leaders actually read it.

Free email — not hiring or booking. Optional BPAI updates for company news. Unsubscribe anytime.

Include

No spam. Unsubscribe anytime. Privacy policy.