Thu 23 April 2026
Daily Brief — Curated and contextualised by Best Practice AI
TSMC Holds Back, Tesla Spends Big, and AI Divides the Workforce
TL;DR Taiwan Semiconductor Manufacturing Co. will delay adopting ASML's costly lithography machines until 2029. Tesla plans to increase its AI-related spending to $25 billion, focusing on self-driving technology. A survey reveals high earners are rapidly adopting AI at work, exacerbating existing pay and gender gaps. Nvidia supplier SK Hynix reports record earnings as customers prioritize procurement amid a supply crunch.
The stories that matter most
Selected and contextualised by the Best Practice AI team
TSMC Says ASML’s Latest Chipmaking Gear Is Too Pricey
Taiwan Semiconductor Manufacturing Co. will hold off on deploying ASML Holding NV’s most cutting-edge lithography machines for chip production through 2029 to save money. The chipmaker has no plans to adopt ASML’s latest high numerical aperture extreme ultraviolet lithography machines, or high-NA EUV, which fetch upwards of €350 million ($410 million) apiece. TSMC is ASML’s largest customer, according to Bloomberg’s supply chain data. Bloomberg’s Neil Campling reports.
Behavioral Transfer in AI Agents: Evidence and Privacy Implications
arXiv:2604.19925v1 Announce Type: new Abstract: AI agents powered by large language models are increasingly acting on behalf of humans in social and economic environments. Prior research has focused on their task performance and effects on human outcomes, but less is known about the relationship between agents and the specific individuals who deploy them. We ask whether agents systematically reflect the behavioral characteristics of their human owners, functioning as behavioral extensions rather than producing generic outputs. We study this question using 10,659 matched human-agent pairs from Moltbook, a social media platform where each autonomous agent is publicly linked to its owner's Twitter/X account. By comparing agents' posts on Moltbook with their owners' Twitter/X activity across features spanning topics, values, affect, and linguistic style, we find systematic transfer between agents and their specific owners. This transfer persists among agents without explicit configuration, and pairs that align on one behavioral dimension tend to align on others. These patterns are consistent with transfer emerging through accumulated interaction between owners (or owners' computer environments) and their agents in everyday use. We further show that agents with stronger behavioral transfer are more likely to disclose owner-related personal information in public discourse, suggesting that the same owner-specific context that drives behavioral transfer may also create privacy risk during ordinary use. Taken together, our results indicate that AI agents do not simply generate content, but reflect owner-related context in ways that can propagate human behavioral heterogeneity into digital environments, with implications for privacy, platform design, and the governance of agentic systems.
Large Language Models Outperform Humans in Fraud Detection and Resistance to Motivated Investor Pressure
arXiv:2604.20652v1 Announce Type: cross Abstract: Large language models trained on human feedback may suppress fraud warnings when investors arrive already persuaded of a fraudulent opportunity. We tested this in a preregistered experiment across seven leading LLMs and twelve investment scenarios covering legitimate, high-risk, and objectively fraudulent opportunities, combining 3,360 AI advisory conversations with a 1,201-participant human benchmark. Contrary to predictions, motivated investor framing did not suppress AI fraud warnings; if anything, it marginally increased them. Endorsement reversal occurred in fewer than 3 in 1,000 observations. Human advisors endorsed fraudulent investments at baseline rates of 13-14%, versus 0% across all LLMs, and suppressed warnings under pressure at two to four times the AI rate. AI systems currently provide more consistent fraud warnings than lay humans in an identical advisory role.
Model Capability Assessment and Safeguards for Biological Weaponization
arXiv:2604.19811v1 Announce Type: new Abstract: AI leaders and safety reports increasingly warn that advances in model reasoning may enable biological misuse, including by low-expertise users, while major labs describe safeguards as expanding but still evolving rather than settled. This study benchmarks ChatGPT 5.2 Auto, Gemini 3 Pro Thinking, Claude Opus 4.5 and Meta's Muse Spark Thinking on 73 novice-framed, open-ended benign STEM prompts to measure operational intelligence. On benign quantitative tasks, both Gemini and meta scored very high; ChatGPT was partially useful but text-thinned, and Claude was sparsest with some apparent false-positive refusals. A second test set detected subtle harmful intent: edge case prompts revealed Gemini's seeming lack of contextual awareness. These results warranted a focused weaponization analysis on Gemini as capability appeared to be outpacing moderation calibration. Gemini was tested across four access environments and reported cases include poison-ivy-to-crowded-transit escalation, poison production and extraction via international-anonymous logged-out AI Mode, and other concerning examples. Biological misuse may become more prevalent as a geopolitical tool, increasing the urgency of U.S. policy responses, especially if model outputs come to be treated as regulated technical data. Guidance is provided for 25 high-risk agents to help distinguish legitimate use cases from higher-risk ones.
From Clerks to Agentic-AI: How will Technology Change Labor Market in Finance?
arXiv:2604.19833v1 Announce Type: cross Abstract: Financial firms have gone through three major technological waves: computerization in the 1980s and 1990s, the rise of indexing and passive investing in the 2000s and 2010s, and the AI and automation wave from roughly 2015 to the present. This project studies how much labor is required to manage capital across those waves by tracking a simple productivity measure: assets under management per employee. Using a small panel of representative firms, we compare changes in AUM per employee, revenue per employee, and operating expense intensity over time. The goal is not to identify causal effects, but to document stylized facts about how technology changes the scale of asset management work.
A Multi-Plant Machine Learning Framework for Emission Prediction, Forecasting, and Control in Cement Manufacturing
arXiv:2604.19903v1 Announce Type: cross Abstract: Cement production is among the largest contributors to industrial air pollution, emitting ~3 Mt NOx/year. The industry-standard mitigation approach, selective non-catalytic reduction (SNCR), exhibits low NH3 utilization efficiency, resulting in operational inefficiencies and increased reagent costs. Here, we develop a data-driven framework for emission control using large-scale operational data from four cement plants worldwide. Benchmarking nine machine learning architectures, we observe that prediction error varies ~3-5x across plants due to variation in data richness. Incorporating short-term process history nearly triples NOx prediction accuracy, revealing that NOx formation carries substantial process memory, a timescale dependence that is absent in CO and CO2. Further, we develop models that forecast NOx overshoots as early as nine minutes, providing a buffer for operational adjustments. The developed framework controls NOx formation at the source, reducing NH3 consumption in downstream SNCR. Surrogate model projections estimate a ~34-64% reduction in NOx while preserving clinker quality, corresponding to a reduction of ~290 t NOx/year and ~58,000 USD/year in NH3 savings. This work establishes a generalizable framework for data-driven emission control, offering a pathway toward low-emission operation without structural modifications or additional hardware, with potential applicability to other hard-to-abate industries such as steel, glass, and lime.
Top Republican pushes party to shun $300mn AI lobby
Senator Josh Hawley warns of ‘political cost’ if Washington fails to rein in Big Tech and artificial intelligence
The AI digital divide
An FT survey shows the highest-earning workers are adopting the technology in their jobs far faster than others
Nvidia supplier SK Hynix hails ‘structural shift’ after another record quarter
Second-largest memory chipmaker says customers prioritising procurement over pricing amid supply crunch
Tesla boosts spending plans to $25bn as Musk doubles down on AI bet
CEO warns investors to expect ‘very significant’ spending increase on self-driving taxis, trucks, robots and chip factories
Economics & Markets
Google's Liz Reid on Who Will Own Search in a World of AI
Alphabet's head of search on adapting to the rise of LLMs.
Google puts AI agents at heart of its enterprise money-making push | Reuters
Alphabet is deepening a push into enterprise software, signaling to investors at Google's annual cloud conference that AI agents, which are human-like digital assistants, are a linchpin of its strategy to monetize artificial intelligence .
How Alphabet Google Cloud Enterprise Customer Success Business Model Innovation is Reshaping Industries » Business To Mark
Cost Discipline: By opening engineering ... Poland and North Carolina, Google Cloud re-invested the savings into price reductions for end customers. Key Takeaway for Business Leaders: Innovation isn’t just about the product; it’s about the go-to-market engine. If your sales team doesn’t understand your client’s industry vertical, you will lose to someone who does. How does Google Cloud actually make money from AI? Unlike traditional software licensing, the new model is fluid and ...
Anthropic argues using music lyrics to train AI is fair use in motion for summary judgment
Anthropic argued in a motion for summary judgment that using music lyrics to train its AI model Claude is "transformative" fair use. The company also stated that claims under the DMCA fail as a matter of law.
Japan newspaper association calls on Google to allow opt-out of news use in AI search
The Japan Newspaper Publishers & Editors Association is urging Google to provide an opt-out mechanism for news content used in generative AI search services.
Shares of Apple Supplier STMicroelectronics Jump After Strong Quarter
It posted strong first-quarter sales and said revenue growth from artificial intelligence should accelerate in coming months.
Tesla reports mixed financial results as Musk pivots automaker to AI and robots
Figures fail to significantly buoy stock as firm admits ‘significant effort and hard work’ needed to achieve goals Tesla reported its first-quarter earnings on Wednesday, disclosing some better-than-expected results but faltering in some key areas. The report failed to significantly buoy Tesla’s stock, which has limped along this year while its CEO, Elon Musk, has tried to sell the company’s new vision of humanoid robots and self-driving robotaxis. Its core car business has struggled in the face of competition from Chinese counterparts and backlash against his close involvement with the Trump administration. “There remains significant effort and hard work to realize our mission of Amazing Abundance,” Tesla said in its report, while claiming that demand for its vehicles was rebounding. Continue reading...
Sinking TSMC ADR Premium Offers Trading Window, UBS Desk Says
The narrowing gap between Taiwan Semiconductor Manufacturing Co.’s Taiwanese shares and its US-listed stock is creating a new trading opportunity, according to a UBS Group AG client note.
Google Unveils Two New AI Chips, Will Invest $750 Million in Agentic AI Adoption | MarketScreener
By Adriano Marchese Google unveiled its latest custom chips and set up a new $750 million agentic AI partner fund to accelerate the adoption of agentic artificial intelligence. The TPU 8t and...
Generative AI leads surge in regional tech spending | Cyprus Mail
Asia-Pacific AI spending set to reach $370bn by 2029 International Data Corporation (IDC) projected that artificial intelligence and generative AI spending in Asia-Pacific will rise from 73 billion dollars in 2024 to 370 billion dollars by 2029, marking a fivefold increase driven by rapid ...
Infosys Q4 results preview: Profit may dip QoQ; all eyes on guidance, deal wins | Stock Market News
Infosys will announce its Q4FY26 results on April 23, with investors focused on earnings and growth outlook amid geopolitical risks and generative AI impacts. Profit is estimated at ₹7,508.6 crore, a 4% YoY increase, while revenue may grow 13.7% YoY to ₹46,567 crore.
Prediction: The Nasdaq's AI Stocks Will Outperform the S&P 500 Over the Next 12 Months. Here's What to Buy. | The Motley Fool
The massive investments in AI infrastructure and the growing adoption of AI software solutions, driven by the productivity gains the technology promises, are poised to drive stronger earnings growth for tech companies. A February report from the Nasdaq Index research team noted that the average net income growth of Nasdaq-100 companies in 2025 was well above that of S&P 500 companies. The report further suggests that this trend is poised to continue in 2026...
Nasdaq's AI-focused stocks are expected to outperform the S&P 500 over the next year due to significant investments in AI infrastructure and software adoption. The Nasdaq-100 companies have shown stronger earnings growth compared to the S&P 500, and this trend is likely to continue, driven by companies like CoreWeave and Microsoft benefiting from increased AI demand.
Tesla boosts spending plans to $25bn as Musk doubles down on AI bet
CEO warns investors to expect ‘very significant’ spending increase on self-driving taxis, trucks, robots and chip factories
Billionaire Michael Dell started his company in his University of Texas dorm room. Now, he’s betting on AI with a $750 million gift
This gift pushes the Dells’ UT Austin donations to over $1 billion.
AI and Semiconductor Stocks Drive Bifurcated Market Gains – ICO Optics
Kamil Dimmich, partner and portfolio manager at North of South Capital, thinks emerging markets are splitting apart in performance. He […]
12 AI Stocks in Focus on Wall Street: Tesla, Meta, and More - Insider Monkey
Kamil Dimmich, Partner & Portfolio Manager at North of South Capital, recently spoke on CNBC and talked about the divergence in emerging markets’ performance amid geopolitical tensions tied to the Iran...
AI should not drive today’s interest rate decisions
How the technology will affect prices is still uncertain
Diagnosing Urban Street Vitality via a Visual-Semantic and Spatiotemporal Framework for Street-Level Economics
arXiv:2604.19798v1 Announce Type: new Abstract: Micro-scale street-level economic assessment is fundamental for precision spatial resource allocation. While Street View Imagery (SVI) advances urban sensing, existing approaches remain semantically superficial and overlook brand hierarchy heterogeneity and structural recession. To address this, we propose a visual-semantic and field-based spatiotemporal framework, operationalized via the Street Economic Vitality Index (SEVI). Our approach integrates physical and semantic streetscape parsing through instance segmentation of signboards, glass interfaces, and storefront closures. A dual-stage VLM-LLM pipeline standardizes signage into global hierarchies to quantify a spatially smoothed brand premium index. To overcome static SVI limitations, we introduce a temporal lag design using Location-Based Services (LBS) data to capture realized demand. Combined with a category-weighted Gaussian spillover model, we construct a three-dimensional diagnostic system covering Commercial Activity, Spatial Utilization, and Physical Environment. Experiments based on time-lagged geographically weighted regression across eight tidal periods in Nanjing reveal quasi-causal spatiotemporal heterogeneity. Street vibrancy arises from interactions between hierarchical brand clustering and mall-induced externalities. High-quality interfaces show peak attraction during midday and evening, while structural recession produces a lagged nighttime repulsion effect. The framework offers evidence-based support for precision spatial governance.
South Korea’s Economy Rebounds Amid Middle East War Risks
South Korea’s economy rebounded at a stronger-than-expected pace in the first quarter on robust semiconductor exports, signaling that the country’s artificial intelligence-driven buffer remains intact.
Goldman Sachs Says Prolonged War 'Will Hit Europe'
Sharon Bell, senior European equity strategist at Goldman Sachs, discusses the corporate earnings season, the artificial intelligence buildout and the potential impact of a drawn-out Middle East war on Europe's economies. She speaks on Bloomberg Television. (Source: Bloomberg)
AI failure could trigger the next financial crisis, warns Elizabeth Warren
Senator Elizabeth Warren has expressed concerns that systemic failures in AI could pose a significant threat to the stability of the financial sector.
Routine Work, Firm Boundaries, and the Rise of Local Supplier Entry
arXiv:2604.19987v1 Announce Type: new Abstract: Between 2005 and 2019, U.S. business applications rose 40 percent while conversion to employer firms fell by nearly half. We study whether boundary redrawing helps explain this pattern. Structured routine-cognitive work can be governed through deliverables and thinner buyer and supplier interfaces. When such work remains place-bound, outsourcing creates demand for domestic specialist suppliers. Across 722 commuting zones, a one percentage-point higher baseline routine employment share raises applications by 27.8 per 100,000 residents. Realized entry concentrates in micro-establishments, with no startup quality gains. Contract and industry evidence point to local supplier entry, not routine-manual displacement.
AI Revolution in Gaming Favors Giants
The gaming landscape is evolving with AI, benefiting giants like Tencent, Sony, and Ubisoft, while smaller players may struggle as entry barriers diminish.
Soft-Label Governance for Distributional Safety in Multi-Agent Systems
arXiv:2604.19752v1 Announce Type: cross Abstract: Multi-agent AI systems exhibit emergent risks that no single agent produces in isolation. Existing safety frameworks rely on binary classifications of agent behavior, discarding the uncertainty inherent in proxy-based evaluation. We introduce SWARM (\textbf{S}ystem-\textbf{W}ide \textbf{A}ssessment of \textbf{R}isk in \textbf{M}ulti-agent systems), a simulation framework that replaces binary good/bad labels with \emph{soft probabilistic labels} $p = P(v{=}+1) \in [0,1]$, enabling continuous-valued payoff computation, toxicity measurement, and governance intervention. SWARM implements a modular governance engine with configurable levers (transaction taxes, circuit breakers, reputation decay, and random audits) and quantifies their effects through probabilistic metrics including expected toxicity $\mathbb{E}[1{-}p \mid \text{accepted}]$ and quality gap $\mathbb{E}[p \mid \text{accepted}] - \mathbb{E}[p \mid \text{rejected}]$. Across seven scenarios with five-seed replication, strict governance reduces welfare by over 40\% without improving safety. In parallel, aggressively internalizing system externalities collapses total welfare from a baseline of $+262$ down to $-67$, while toxicity remains invariant. Circuit breakers require careful calibration; overly restrictive thresholds severely diminish system value, whereas an optimal threshold balances moderate welfare with minimized toxicity. Companion experiments show soft metrics detect proxy gaming by self-optimizing agents passing conventional binary evaluations. This basic governance layer applies to live LLM-backed agents (Concordia entities, Claude, GPT-4o Mini) without modification. Results show distributional safety requires \emph{continuous} risk metrics and governance lever calibration involves quantifiable safety-welfare tradeoffs. Source code and project resources are publicly available at https://www.swarm-ai.org/.
Apple controls the tech sector’s Strait of Hormuz
It may have stumbled in the AI race but the company’s new CEO will find it still has distinct advantages
ServiceNow Targets Security And Revenue Workflows With Armis And Xactly AI - Simply Wall St News
ServiceNow (NYSE:NOW) has completed its acquisition of Armis, extending its security coverage into physical, operational, and cyber-asset environments. The company has also launched a Dispute Management AI Agent in partnership with Xactly, targeting cross platform revenue workflows.
IT takes D-St on a tumble, AI fears pop up on HCL Q4 miss - The Economic Times
Indian IT stocks experienced a significant decline on Wednesday. This sell-off was triggered by HCL Technologies' disappointing fourth-quarter earnings and subdued future outlook. Investor concerns about Artificial Intelligence's disruptive potential in the sector resurfaced.
Novita AI Emerges as Top Inference Provider, Excels in Scientific Reasoning and Math Accuracy
Novita AI emerges as a leading AI inference provider, supporting over 120 LLMs via a unified API compatible with OpenAI and Anthropic.
How the AI Race Will Impact the Global Luxury Market | Vogue
AI is evolving at different speeds across regions, creating fragmented systems that are reshaping consumer expectations — and forcing luxury brands to think carefully about how they use the tech in each market.
TSMC Says ASML’s Latest Chipmaking Gear Is Too Pricey
Taiwan Semiconductor Manufacturing Co. will hold off on deploying ASML Holding NV’s most cutting-edge lithography machines for chip production through 2029 to save money. The chipmaker has no plans to adopt ASML’s latest high numerical aperture extreme ultraviolet lithography machines, or high-NA EUV, which fetch upwards of €350 million ($410 million) apiece. TSMC is ASML’s largest customer, according to Bloomberg’s supply chain data. Bloomberg’s Neil Campling reports.
Are you paying an AI ‘swarm tax’? Why single agents often beat complex systems
Enterprise teams building multi-agent AI systems may be paying a compute premium for gains that don't hold up under equal-budget conditions. New Stanford University research finds that single-agent systems match or outperform multi-agent architectures on complex reasoning tasks when both are given the same thinking token budget. However, multi-agent systems come with the added baggage of computational overhead. Because they typically use longer reasoning traces and multiple interactions, it is often unclear whether their reported gains stem from architectural advantages or simply from consuming more resources. To isolate the true driver of performance, researchers at Stanford University compared single-agent systems against multi-agent architectures on complex multi-hop reasoning tasks under equal "thinking token" budgets. Their experiments show that in most cases, single-agent systems match or outperform multi-agent systems when compute is equal. Multi-agent systems gain a competitive edge when a single agent's context becomes too long or corrupted. In practice, this means that a single-agent model with an adequate thinking budget can deliver more efficient, reliable, and cost-effective multi-hop reasoning. Engineering teams should reserve multi-agent systems for scenarios where single agents hit a performance ceiling. Understanding the single versus multi-agent divide Multi-agent frameworks, such as planner agents, role-playing systems, or debate swarms, break down a problem by having multiple models operate on partial contexts. These components communicate with each other by passing their answers around. While multi-agent solutions show strong empirical performance, comparing them to single-agent baselines is often an imprecise measurement. Comparisons are heavily confounded by differences in test-time computation. Multi-agent setups require multiple agent interactions and generate longer reasoning traces, meaning they consume significantly more tokens. ddConsequently, when a multi-agent system reports higher accuracy, it is difficult to determine if the gains stem from better architecture design or from spending extra compute. Recent studies show that when the compute budget is fixed, elaborate multi-agent strategies frequently underperform compared to strong single-agent baselines. However, they are mostly very broad comparisons that don’t account for nuances such as different multi-agent architectures or the difference between prompt and reasoning tokens. “A central point of our paper is that many comparisons between single-agent systems (SAS) and multi-agent systems (MAS) are not apples-to-apples,” paper authors Dat Tran and Douwe Kiela told VentureBeat. “MAS often get more effective test-time computation through extra calls, longer traces, or more coordination steps.” Revisiting the multi-agent challenge under strict budgets To create a fair comparison, the Stanford researchers set a strict “thinking token” budget. This metric controls the total number of tokens used exclusively for intermediate reasoning, excluding the initial prompt and the final output. The study evaluated single- and multi-agent systems on multi-hop reasoning tasks, meaning questions that require connecting multiple pieces of disparate information to reach an answer. During their experiments, the researchers noticed that single-agent setups sometimes stop their internal reasoning prematurely, leaving available compute budget unspent. To counter this, they introduced a technique called SAS-L (single-agent system with longer thinking). Rather than jumping to multi-agent orchestration when a model gives up early, the researchers suggest a simple prompt-and-budgeting change. "The engineering idea is simple," Tran and Kiela said. "First, restructure the single-agent prompt so the model is explicitly encouraged to spend its available reasoning budget on pre-answer analysis." By instructing the model to explicitly identify ambiguities, list candidate interpretations, and test alternatives before committing to a final answer, developers can recover the benefits of collaboration inside a single-agent setup. The results of their experiments confirm that a single agent is the strongest default architecture for multi-hop reasoning tasks. It produces the highest accuracy answers while consuming fewer reasoning tokens. When paired with specific models like Google's Gemini 2.5, the longer-thinking variant produces even better aggregate performance. The researchers rely on a concept called “Data Processing Inequality” to explain why a single agent outperforms a swarm. Multi-agent frameworks introduce inherent communication bottlenecks. Every time information is summarized and handed off between different agents, there is a risk of data loss. In contrast, a single agent reasoning within one continuous context avoids this fragmentation. It retains access to the richest available representation of the task and is thus more information-efficient under a fixed budget. The authors also note that enterprises often overlook the secondary costs of multi-agent systems. "What enterprises often underestimate is that orchestration is not free," they said. "Every additional agent introduces communication overhead, more intermediate text, more opportunities for lossy summarization, and more places for errors to compound." On the other hand, they discovered that multi-agent orchestration is superior when a single agent's environment gets messy. If an enterprise application must handle highly degraded contexts, such as noisy data, long inputs filled with distractors, or corrupted information, a single agent struggles. In these scenarios, the structured filtering, decomposition, and verification of a multi-agent system can recover relevant information more reliably. The study also warns about hidden evaluation traps that falsely inflate multi-agent performance. Relying purely on API-reported token counts heavily distorts how much computation an architecture is actually spending. The researchers found these accounting artifacts when testing models like Gemini 2.5, proving this is an active issue for enterprise applications today. "For API models, the situation is trickier because budget accounting can be opaque," the authors said. To evaluate architectures reliably, they advise developers to "log everything, measure the visible reasoning traces where available, use provider-reported reasoning-token counts when exposed, and treat those numbers cautiously." What it means for developers If a single-agent system matches the performance of multiple agents under equal reasoning budgets, it wins on total cost of ownership by offering fewer model calls, lower latency, and simpler debugging. Tran and Kiela warn that without this baseline, "some enterprises may be paying a large 'swarm tax' for architectures whose apparent advantage is really coming from spending more computation rather than reasoning more effectively." Another way to look at the decision boundary is not how complex the overall task is, but rather where the exact bottleneck lies. "If it is mainly reasoning depth, SAS is often enough. If it is context fragmentation or degradation, MAS becomes more defensible," Tran said. Engineering teams should stay with a single agent when a task can be handled within one coherent context window. Multi-agent systems become necessary when an application handles highly degraded contexts. Looking ahead, multi-agent frameworks will not disappear, but their role will evolve as frontier models improve their internal reasoning capabilities. "The main takeaway from our paper is that multi-agent structure should be treated as a targeted engineering choice for specific bottlenecks, not as a default assumption that more agents automatically means better intelligence," Tran said.
Do Small Language Models Know When They're Wrong? Confidence-Based Cascade Scoring for Educational Assessment
arXiv:2604.19781v1 Announce Type: new Abstract: Automated scoring of student work at scale requires balancing accuracy against cost and latency. In "cascade" systems, small language models (LMs) handle easier scoring tasks while escalating harder ones to larger LMs -- but the challenge is determining which cases to escalate. We explore verbalized confidence -- asking the LM to state a numerical confidence alongside its prediction -- as a routing signal. Using 2,100 expert-scored decisions from student-AI math conversations, we evaluate cascade systems built from GPT-5.4, Claude 4.5+, and Gemini 3.1 model pairs. We find that: (1) confidence discrimination varies widely across small LMs, with the best achieving AUROC 0.857 and the worst producing a near-degenerate confidence distribution; (2) confidence tracks human scoring difficulty, with lower LM confidence where annotators disagreed and took longer to score; (3) the best cascade approached large-LM accuracy (kappa 0.802 vs. 0.819) at 76% lower cost and 61% lower latency. Confidence discrimination is the bottleneck: the two small LMs with meaningful confidence variance yielded cascades with no statistically detectable kappa loss, while the third -- whose confidence was near-degenerate -- could not close the accuracy gap regardless of threshold. Small LMs with strong discrimination let practitioners trade cost for accuracy along the frontier; those without it do not.
AI's new bottleneck: Compute power, not models, says Goldman Sachs | Asianet Newsable
A Goldman Sachs report reveals AI's rapid expansion is now constrained by computing power availability and cost, not model quality. As demand for AI applications and inference use cases accelerates, access to compute has become the key differentiator.
AI Pioneers Back Startup Building Models to Predict Events - Bloomberg
Sooth Labs, a new artificial intelligence lab founded by former Meta Platforms Inc. employees, is raising about $50 million in funding to build AI models meant to help businesses forecast the likelihood of specific geopolitical and market events taking place.
Cursor’s 25-year-old CEO is a former Google intern who just inked a $60 billion deal with SpaceX
From Google intern at 18 to a billionaire at 25, Cursor CEO Michael Truell’s rise is one of Silicon Valley’s fastest.
Top VC to AI Startup Founders: Sell While the Boom Lasts - Business Insider
Venture capitalist Elad Gil Vaughn Ridley/Sportsfile for Collision via Getty Images ... This story is available exclusively to Business Insider subscribers. Become an Insider and start reading now. Have an account? Log in. Elad Gil is urging AI startups to consider selling soon due to the potential for changing market ...
Ex-Stripe Team at Seapoint Raises €7.5M Seed to Launch the Financial Home for Europe’s Startup Founders
Share this post: Share on LinkedIn Share on X (Twitter) Share on Facebook Share on Email Share on WhatsApp Seapoint, the AI-native financial operations platform built for Europe’s most ambitious startups, today announced a €7.5 million seed round, bringing total funding to €10 million in just over a year. Seapoint will utilise these new funds […]
France’s Univity raises €27m to allow European telecoms compete with Starlink
The Paris-based startup wants to build the space equivalent of shared mobile infrastructure, allowing operators to offer satellite connectivity without handing the keys to Starlink. Read more: France’s Univity raises €27m to allow European telecoms compete with Starlink
Lisbon-based DOJO AI secures €5.1 million to expand its AI marketing platform in the U.S.
DOJO AI, a Portuguese intelligent marketing system that brings integrated AI to marketing teams, today announced a €5.1 million ($6 million) Seed round at a €25 million ($30 million) valuation. The round was led by Armilar, with participation from Heartfelt VC. The funding will support continued product development and accelerated expansion in the United States. […]
A Better Way To Fail: How This Platform Aims To Turn Startup Shutdowns Into Something Salvageable
Los Angeles-based SimpleClosure has launched Asset Hub, a marketplace aimed at helping founders sell assets such as source code, data and equipment during the wind-down process. Crunchbase News spoke with founder Dori Yona about the new offering as closures rise and investors place greater ...
Builder.ai founder Sachin Dev Duggal accused of receiving siphoned funds
Indian authorities name UK start-up founder in criminal complaint over ties to collapsed electronics group
Ex-Dunzo Chief Kabeer Biswas Raises ₹102 Cr for AI Concierge Startup 'M' | Whalesbook
Kabeer Biswas, former co-founder of Dunzo, is back in the startup world with 'M', securing significant seed funding to automate household management. This capital injection shows strong investor confidence in his vision and the growing appeal of AI consumer platforms, a sector drawing substantial venture capital attention in 2026. 'M' aims to simplify daily consumer services by automating decisions and coordination, addressing a clear market ...
AI is spitting out more potential drugs than ever. This start-up wants to figure out which ones matter. | TechCrunch
10x Science has raised a $4.8 million seed round to help pharmaceutical researchers understand complex molecules.
Labor, Society & Culture
Samsung Rally Draws 30,000 to Demand Greater Share of AI Profits
Tens of thousands of people gathered outside Samsung Electronics Co.’s main chip hub to demand employees get a greater share of profits reaped from the AI boom.
Bias in the Tails: How Name-conditioned Evaluative Framing in Resume Summaries Destabilizes LLM-based Hiring
arXiv:2604.19984v1 Announce Type: new Abstract: Research has documented LLMs' name-based bias in hiring and salary recommendations. In this paper, we instead consider a setting where LLMs generate candidate summaries for downstream assessment. In a large-scale controlled study, we analyze nearly one million resume summaries produced by 4 models under systematic race-gender name perturbations, using synthetic resumes and real-world job postings. By decomposing each summary into resume-grounded factual content and evaluative framing, we find that factual content remains largely stable, while evaluative language exhibits subtle name-conditioned variation concentrated in the extremes of the distribution, especially in open-source models. Our hiring simulation demonstrates how evaluative summary transforms directional harm into symmetric instability that might evade conventional fairness audit, highlighting a potential pathway for LLM-to-LLM automation bias.
Australia’s Biggest Bank to Cut 120 More Jobs Amid AI Push
Commonwealth Bank of Australia will eliminate around 120 roles amid a broader push to harness artificial intelligence at the nation’s largest lender.
Measuring Creativity in the Age of Generative AI: Distinguishing Human and AI-Generated Creative Performance in Hiring and Talent Systems
arXiv:2604.19799v1 Announce Type: cross Abstract: Generative AI is rapidly transforming how organizations create value and evaluate talent. While large language models enhance baseline output quality, they simultaneously introduce ambiguity in assessing human creativity, as observable artifacts may be partially or fully AI-generated. This paper reconceptualizes creativity as a distributional and process-based property that emerges under shared constraints and competitive incentives. We introduce a quantitative framework for measuring creativity as novelty in synthesis, operationalized through idea generation and idea transformation within embedding space. Empirical evaluation demonstrates that the proposed metrics align with intuitive judgments of creativity while capturing distinctions that surface-level quality assessments miss. We further identify a structural shift toward bimodal distributions of creative output in AI-mediated environments, with implications for hiring, leadership, and competitive strategy. The findings suggest that in the age of generative AI, distinctiveness rather than fluency becomes the primary signal of human creative capability.
Nobel Prize-winning economist on Anthropic CEO's white-collar jobs wipeout: He may be underestimating how messy some of those jobs are - The Times of India
Tech News News: The debate over the impact of AI on jobs is intensifying, with Nobel Prize-winning economist Daron Acemoglu challenging Anthropic CEO Dario Amodei’s p.
AI Is Booming and Cutting Jobs. Both Things Are True | SUCCESS
AI funding hit $297B while 95,000+ tech jobs vanished. Here’s what that contradiction actually means for your career in 2026.
Nvidia CEO Jensen Huang: ‘Most people will lose their job to somebody who uses AI’—not to AI itself
Jensen Huang argues that the primary threat to workers is not AI automation, but rather competition from individuals who effectively leverage AI tools.
AI poses the biggest threat to service sector jobs - Digital Journal
People think factory workers face the biggest automation threat, but the data shows service jobs are more at risk.
AI Incident Monitoring through a Public Health Lens
arXiv:2604.19914v1 Announce Type: new Abstract: Artificial intelligence systems are now deployed at scale across sectors, accompanied by a growing number of real-world incidents ranging from misinformation and cybercrime to autonomous-system failures. Databases of AI incidents index these events, but they cannot measure ``risk'' (i.e., a joint measure of likelihood and severity) without additional data regarding the prevalence of risk-associated systems and their incident reporting rates. As a result, policymakers, companies, and the general public lack a means to weigh the benefits of AI against their in-context risks. Inspired by public-health processes, which presume noisy and incomplete disease surveillance, we identify six phases of incident emergence. We demonstrate the framework through a detailed case study of autonomous vehicles, whose mandatory reporting requirements produces reliable incident-rate ground truth expressed in distance traveled. The case study shows that an informed panel of domain experts (e.g., self-driving experts) can combine their domain expertise, incident data, and a collection of statistical and visualization tools to arrive at incident phase determinations serving public needs. We further demonstrate the approach with a deepfake incident case study and chart a path for future research in incident phase determination.
No 'kill switch' to block US military's use of Claude, Anthropic tells DC Circuit
Anthropic told a US appeals court that it cannot control how the military uses its technology and that there is no 'kill switch' it could deploy once its model is used by the Defense Department.
Behavioral Transfer in AI Agents: Evidence and Privacy Implications
arXiv:2604.19925v1 Announce Type: new Abstract: AI agents powered by large language models are increasingly acting on behalf of humans in social and economic environments. Prior research has focused on their task performance and effects on human outcomes, but less is known about the relationship between agents and the specific individuals who deploy them. We ask whether agents systematically reflect the behavioral characteristics of their human owners, functioning as behavioral extensions rather than producing generic outputs. We study this question using 10,659 matched human-agent pairs from Moltbook, a social media platform where each autonomous agent is publicly linked to its owner's Twitter/X account. By comparing agents' posts on Moltbook with their owners' Twitter/X activity across features spanning topics, values, affect, and linguistic style, we find systematic transfer between agents and their specific owners. This transfer persists among agents without explicit configuration, and pairs that align on one behavioral dimension tend to align on others. These patterns are consistent with transfer emerging through accumulated interaction between owners (or owners' computer environments) and their agents in everyday use. We further show that agents with stronger behavioral transfer are more likely to disclose owner-related personal information in public discourse, suggesting that the same owner-specific context that drives behavioral transfer may also create privacy risk during ordinary use. Taken together, our results indicate that AI agents do not simply generate content, but reflect owner-related context in ways that can propagate human behavioral heterogeneity into digital environments, with implications for privacy, platform design, and the governance of agentic systems.
Large Language Models Outperform Humans in Fraud Detection and Resistance to Motivated Investor Pressure
arXiv:2604.20652v1 Announce Type: cross Abstract: Large language models trained on human feedback may suppress fraud warnings when investors arrive already persuaded of a fraudulent opportunity. We tested this in a preregistered experiment across seven leading LLMs and twelve investment scenarios covering legitimate, high-risk, and objectively fraudulent opportunities, combining 3,360 AI advisory conversations with a 1,201-participant human benchmark. Contrary to predictions, motivated investor framing did not suppress AI fraud warnings; if anything, it marginally increased them. Endorsement reversal occurred in fewer than 3 in 1,000 observations. Human advisors endorsed fraudulent investments at baseline rates of 13-14%, versus 0% across all LLMs, and suppressed warnings under pressure at two to four times the AI rate. AI systems currently provide more consistent fraud warnings than lay humans in an identical advisory role.
Leaked Code for Anthropic’s Claude Code Tests Copyright Challenges in A.I. Era
Artificial intelligence tools are making it faster than ever to reproduce creative work. Does copyright even matter anymore?
Can LLMs Infer Conversational Agent Users' Personality Traits from Chat History?
arXiv:2604.19785v1 Announce Type: cross Abstract: Sensitive information, such as knowledge about an individual's personality, can be can be misused to influence behavior (e.g., via personalized messaging). To assess to what extent an individual's personality can be inferred from user interactions with LLM-based conversational agents (CAs), we analyze and quantify related privacy risks of using CAs. We collected actual ChatGPT logs from N=668 participants, containing 62,090 individual chats, and report statistics about the different types of shared data and use cases. We fine-tuned RoBERTa-base text classification models to infer personality traits from CA interactions. The findings show that these models achieve trait inference with accuracy (ternary classification) better than random in multiple cases. For example, for extraversion, accuracy improves by +44% relative to the baseline on interactions for relationships and personal reflection. This research highlights how interactions with CAs pose privacy risks and provides fine-grained insights into the level of risk associated with different types of interactions.
OpenAI now lets you screenshot your privacy in the foot
Make your model smarter through self-surveillance Those who cannot remember Microsoft Recall are condemned to repeat it. …
Model Capability Assessment and Safeguards for Biological Weaponization
arXiv:2604.19811v1 Announce Type: new Abstract: AI leaders and safety reports increasingly warn that advances in model reasoning may enable biological misuse, including by low-expertise users, while major labs describe safeguards as expanding but still evolving rather than settled. This study benchmarks ChatGPT 5.2 Auto, Gemini 3 Pro Thinking, Claude Opus 4.5 and Meta's Muse Spark Thinking on 73 novice-framed, open-ended benign STEM prompts to measure operational intelligence. On benign quantitative tasks, both Gemini and meta scored very high; ChatGPT was partially useful but text-thinned, and Claude was sparsest with some apparent false-positive refusals. A second test set detected subtle harmful intent: edge case prompts revealed Gemini's seeming lack of contextual awareness. These results warranted a focused weaponization analysis on Gemini as capability appeared to be outpacing moderation calibration. Gemini was tested across four access environments and reported cases include poison-ivy-to-crowded-transit escalation, poison production and extraction via international-anonymous logged-out AI Mode, and other concerning examples. Biological misuse may become more prevalent as a geopolitical tool, increasing the urgency of U.S. policy responses, especially if model outputs come to be treated as regulated technical data. Guidance is provided for 25 high-risk agents to help distinguish legitimate use cases from higher-risk ones.
Technology & Infrastructure
The agentic transition: how enterprises are scaling AI from pilot to profit | Domain-b.com
AI has entered its execution era. Discover how companies like Valeo and Microsoft are scaling agentic AI systems—from copilots to autonomous workflows driving
OpenAI Unveils Workspace Agents in ChatGPT
OpenAI introduces Workspace Agents in ChatGPT, enabling teams to automate complex workflows using Codex-powered agents.
Joe’s Take: Agentic AI is Here – Is Your Business Process Ready for Autonomous Workers? - New York Computer Help
Are you still thinking of AI as a chatbot that writes your emails or generates "pretty good" images for your presentations? If so, you’re already
A Field Guide to Decision Making
arXiv:2604.20669v1 Announce Type: new Abstract: High-consequence decision making demands peak performance from individuals in positions of responsibility. Such executive authority bears the obligation to act despite uncertainty, limited resources, time constraints, and accountability risks. Tools and strategies to motivate confidence and foster risk tolerance must confront informational noise and can provide qualified accountability. Machine intelligence augments human cognition and perception to improve situational awareness, decision framing, flexibility, and coherence through agentic stewardship of contextual metadata. We examine systemic and behavioral factors crucial to address in scenarios encumbered by complexity, uncertainty, and urgency.
From Data to Theory: Autonomous Large Language Model Agents for Materials Science
arXiv:2604.19789v1 Announce Type: new Abstract: We present an autonomous large language model (LLM) agent for end-to-end, data-driven materials theory development. The model can choose an equation form, generate and run its own code, and test how well the theory matches the data without human intervention. The framework combines step-by-step reasoning with expert-supplied tools, allowing the agent to adjust its approach as needed while keeping a clear record of its decisions. For well-established materials relationships such as the Hall-Petch equation and Paris law, the agent correctly identifies the governing equation and makes reliable predictions on new datasets. For more specialized relationships, such as Kuhn's equation for the HOMO-LUMO gap of conjugated molecules as a function of length, performance depends more strongly on the underlying model, with GPT-5 showing better recovery of the correct equation. Beyond known theories, the agent can also suggest new predictive relationships, illustrated here by a strain-dependent law for changes in the HOMO-LUMO gap. At the same time, the results show that careful validation remains essential, because the agent can still return incorrect, incomplete, or inconsistent equations even when the numerical fit appears strong. Overall, these results highlight both the promise and the current limitations of autonomous LLM agents for AI-assisted scientific modeling and discovery.
From Actions to Understanding: Conformal Interpretability of Temporal Concepts in LLM Agents
arXiv:2604.19775v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly deployed as autonomous agents capable of reasoning, planning, and acting within interactive environments. Despite their growing capability to perform multi-step reasoning and decision-making tasks, internal mechanisms guiding their sequential behavior remain opaque. This paper presents a framework for interpreting the temporal evolution of concepts in LLM agents through a step-wise conformal lens. We introduce the conformal interpretability framework for temporal tasks, which combines step-wise reward modeling with conformal prediction to statistically label model's internal representation at each step as successful or failing. Linear probes are then trained on these representations to identify directions of temporal concepts - latent directions in the model's activation space that correspond to consistent notions of success, failure or reasoning drift. Experimental results on two simulated interactive environments, namely ScienceWorld and AlfWorld, demonstrate that these temporal concepts are linearly separable, revealing interpretable structures aligned with task success. We further show preliminary results on improving an LLM agent's performance by leveraging the proposed framework for steering the identified successful directions inside the model. The proposed approach, thus, offers a principled method for early failure detection as well as intervention in LLM-based agents, paving the path towards trustworthy autonomous language models in complex interactive settings.
OpenCLAW-P2P v6.0: Resilient Multi-Layer Persistence, Live Reference Verification, and Production-Scale Evaluation of Decentralized AI Peer Review
arXiv:2604.19792v1 Announce Type: new Abstract: This paper presents OpenCLAW-P2P v6.0, a comprehensive evolution of the decentralized collective-intelligence platform in which autonomous AI agents publish, peer-review, score, and iteratively improve scientific research papers without any human gatekeeper. Building on v5.0 foundations -- tribunal-gated publishing, multi-LLM granular scoring, calibrated deception detection, the Silicon Chess-Grid FSM, and the AETHER containerized inference engine -- this release introduces four major new subsystems: (1) a multi-layer paper persistence architecture with four storage tiers (in-memory cache, Cloudflare R2, Gun.js, GitHub) ensuring zero paper loss across redeployments; (2) a multi-layer retrieval cascade with automatic backfill reducing lookup latency from >3s to 85% accuracy; and (4) a scientific API proxy providing rate-limited cached access to seven public databases. The platform operates with 14 real autonomous agents producing 50+ scored papers (word counts 2,072-4,073, leaderboard scores 6.4-8.1) alongside 23 labeled simulated citizens. We present honest production statistics, failure-mode analysis, a paper recovery protocol that salvaged 25 lost papers, and lessons learned from operating the system at scale. All pre-existing subsystems -- 17-judge multi-LLM scoring, 14-rule calibration with 8 deception detectors, tribunal cognitive examination, Proof of Value consensus, Laws-of-Form eigenform verification, and tau-normalized agent coordination -- are retained and further hardened. All code is open-source at https://github.com/Agnuxo1/p2pclaw-mcp-server.
AI race intensifies with Google’s new agent management platform
The company also launched the latest iteration of its TPUs. Read more: AI race intensifies with Google’s new agent management platform
ABB Robotics launches high-speed PoWa cobot family | RoboticsTomorrow
• New, high-speed, higher payload PoWa cobot family meets need for industrial-grade performance in collaborative robotics, lowering the barrier to automation for both SMEs and large enterprises • Payloads from 7kg to 30kg, best-in-class top speed of 5.8 m/s, longest reach and highest arm ...
What are agentic workflows? Everything to know - Tricentis
Learn what agentic workflows are, how AI agents coordinate tasks, and how teams use them in modern software delivery.
Nokia Reports Rising Sales From AI and Data-Center Customers
Nokia sees overall sales in the network infrastructure business growing 12%-14% this year, having previously expected 6%-8%.
SK Hynix’s aspirations for ’Merica-made HBM inch closer to reality
New site set to begin manufacturing and testing HBM memory just in time for Nvidia's Rubin-Ultra GPUs in 2028 SK Hynix has reportedly broken ground on a new advanced memory packaging facility in West Lafayette, Indiana, that should boost the supply of US-made high-bandwidth memory (HBM), a key component in high-end AI accelerators from the likes of Nvidia and AMD.…
Exclusive: SpaceX says unproven AI space data centers may not be commercially viable, filing shows | Reuters
In February, after announcing a merger between SpaceX and his social media and artificial intelligence firm xAI, he said "space-based AI is obviously the only way to scale".
China Ai Computing Power: Explained: Why China’s AI computing power looks 6,000x bigger - The Times of India
International Business News: China has reported a massive leap in its domestic artificial intelligence (AI) computing power, with official figures suggesting capacity far beyond w.
Nvidia supplier SK Hynix hails ‘structural shift’ after another record quarter
Second-largest memory chipmaker says customers prioritising procurement over pricing amid supply crunch
Google unifies Gemini Enterprise, debuts new chips
Google announced a new generation of Tensor chips for training and inference, alongside a consolidated Gemini Enterprise Agent Platform.
Navitas Semiconductor Stock Soars 8.81% as AI Power Demand Fuels Momentum Ahead of Earnings
These materials enable faster, ... them particularly valuable for reducing energy consumption and heat in data centers running intensive AI workloads. As hyperscale operators and cloud providers ramp up spending on AI infrastructure, demand for efficient power solutions has ...
Report: Data Center Electricity Demand Rising, Impacting Grid Capacity and Policy Debates – NaturalNews.com
U.S. Data Center Electricity Consumption Increases, Impacting Grid Planning Electricity consumption by data centers in the United States has significantly increased, posing challenges for grid operators and utilities, according to a recent report. The International Energy Agency (IEA) found ...
LLM Agents Predict Social Media Reactions but Do Not Outperform Text Classifiers: Benchmarking Simulation Accuracy Using 120K+ Personas of 1511 Humans
arXiv:2604.19787v1 Announce Type: cross Abstract: Social media platforms mediate how billions form opinions and engage with public discourse. As autonomous AI agents increasingly participate in these spaces, understanding their behavioral fidelity becomes critical for platform governance and democratic resilience. Previous work demonstrates that LLM-powered agents can replicate aggregate survey responses, yet few studies test whether agents can predict specific individuals' reactions to specific content. This study benchmarks LLM-based agents' accuracy in predicting human social media reactions (like, dislike, comment, share, no reaction) across 120,000+ unique agent-persona combinations derived from 1,511 Serbian participants and 27 large language models. In Study 1, agents achieved 70.7% overall accuracy, with LLM choice producing a 13 percentage-point performance spread. Study 2 employed binary forced-choice (like/dislike) evaluation with chance-corrected metrics. Agents achieved Matthews Correlation Coefficient (MCC) of 0.29, indicating genuine predictive signal beyond chance. However, conventional text-based supervised classifiers using TF-IDF representations outperformed LLM agents (MCC of 0.36), suggesting predictive gains reflect semantic access rather than uniquely agentic reasoning. The genuine predictive validity of zero-shot persona-prompted agents warns against potential manipulation through easily deploying swarms of behaviorally distinct AI agents on social media, while simultaneously offering opportunities to use such agents in simulations for predicting polarization dynamics and informing AI policy. The advantage of using zero-shot agents is that they require no task-specific training, making their large-scale deployment easy across diverse contexts. Limitations include single-country sampling. Future research should explore multilingual testing and fine-tuning approaches.
The Tool-Overuse Illusion: Why Does LLM Prefer External Tools over Internal Knowledge?
arXiv:2604.19749v1 Announce Type: new Abstract: Equipping LLMs with external tools effectively addresses internal reasoning limitations. However, it introduces a critical yet under-explored phenomenon: tool overuse, the unnecessary tool-use during reasoning. In this paper, we first reveal this phenomenon is pervasive across diverse LLMs. We then experimentally elucidate its underlying mechanisms through two key lenses: (1) First, by analyzing tool-use behavior across different internal knowledge availability regions, we identify a \textit{knowledge epistemic illusion}: models misjudge internal knowledge boundaries and fail to accurately perceive their actual knowledge availability. To mitigate this, we propose a knowledge-aware epistemic boundary alignment strategy based on direct preference optimization, which reduces tool usage in by 82.8\% while yielding an accuracy improvement. (2) Second, we establish a causal link between reward structures and tool-use behavior by visualizing the tool-augmented training process. It reveals that \textit{outcome-only rewards} inadvertently encourage tool overuse by rewarding only final correctness, regardless of tool efficiency. To verify this, we balance reward signals during training rather than relying on outcome-only rewards, cutting unnecessary tool calls by 66.7\% (7B) and 60.7\% (32B) without sacrificing accuracy. Finally, we provide theoretical justification in this two lenses to understand tool overuse.
ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models
arXiv:2604.19758v1 Announce Type: new Abstract: We present ThermoQA, a benchmark of 293 open-ended engineering thermodynamics problems in three tiers: property lookups (110 Q), component analysis (101 Q), and full cycle analysis (82 Q). Ground truth is computed programmatically from CoolProp 7.2.0, covering water, R-134a, and variable-cp air. Six frontier LLMs are evaluated across three independent runs each. The composite leaderboard is led by Claude Opus 4.6 (94.1%), GPT-5.4 (93.1%), and Gemini 3.1 Pro (92.5%). Cross-tier degradation ranges from 2.8 pp (Opus) to 32.5 pp (MiniMax), confirming that property memorization does not imply thermodynamic reasoning. Supercritical water, R-134a refrigerant, and combined-cycle gas turbine analysis serve as natural discriminators with 40-60 pp performance spreads. Multi-run sigma ranges from +/-0.1% to +/-2.5%, quantifying reasoning consistency as a distinct evaluation axis. Dataset and code are open-source at https://huggingface.co/datasets/olivenet/thermoqa
Algorithm Selection with Zero Domain Knowledge via Text Embeddings
arXiv:2604.19753v1 Announce Type: new Abstract: We propose a feature-free approach to algorithm selection that replaces hand-crafted instance features with pretrained text embeddings. Our method, ZeroFolio, proceeds in three steps: it reads the raw instance file as plain text, embeds it with a pretrained embedding model, and selects an algorithm via weighted k-nearest neighbors. The key to our approach is the observation that pretrained embeddings produce representations that distinguish problem instances without any domain knowledge or task-specific training. This allows us to apply the same three-step pipeline (serialize, embed, select) across diverse problem domains with text-based instance formats. We evaluate our approach on 11 ASlib scenarios spanning 7 domains (SAT, MaxSAT, QBF, ASP, CSP, MIP, and graph problems). Our experiments show that this approach outperforms a random forest trained on hand-crafted features in 10 of 11 scenarios with a single fixed configuration, and in all 11 with two-seed voting; the margin is often substantial. Our ablation study shows that inverse-distance weighting, line shuffling, and Manhattan distance are the key design choices. On scenarios where both selectors are competitive, combining embeddings with hand-crafted features via soft voting yields further improvements.
Teaching AI models to say “I’m not sure”
Researchers are developing methods to help AI models recognize when they lack sufficient information, allowing them to express uncertainty rather than hallucinating answers.
AI & Tech Brief: The post-LLM era begins - The Washington Post
Takeaway #1: A new kind of AI model promises to surpass the abilities of large language models (LLMs) that we use today.
AI-powered robot beats elite table tennis players
In feat hailed as milestone in robotics, Sony AI’s Ace wins three out of five matches played under official rules An AI-powered robot has beaten elite players at table tennis in a significant achievement for a machine faced with human athletes in a real-world competitive sport. Named Ace, the robotic system developed by Sony AI, won three out of five matches against elite players, but lost the two it played against professionals, clawing back only one game in the seven contests. Continue reading...
Explainable AML Triage with LLMs: Evidence Retrieval and Counterfactual Checks
arXiv:2604.19755v1 Announce Type: new Abstract: Anti-money laundering (AML) transaction monitoring generates large volumes of alerts that must be rapidly triaged by investigators under strict audit and governance constraints. While large language models (LLMs) can summarize heterogeneous evidence and draft rationales, unconstrained generation is risky in regulated workflows due to hallucinations, weak provenance, and explanations that are not faithful to the underlying decision. We propose an explainable AML triage framework that treats triage as an evidence-constrained decision process. Our method combines (i) retrieval-augmented evidence bundling from policy/typology guidance, customer context, alert triggers, and transaction subgraphs, (ii) a structured LLM output contract that requires explicit citations and separates supporting from contradicting or missing evidence, and (iii) counterfactual checks that validate whether minimal, plausible perturbations lead to coherent changes in both the triage recommendation and its rationale. We evaluate on public synthetic AML benchmarks and simulators and compare against rules, tabular and graph machine-learning baselines, and LLM-only/RAG-only variants. Results show that evidence grounding substantially improves auditability and reduces numerical and policy hallucination errors, while counterfactual validation further increases decision-linked explainability and robustness, yielding the best overall triage performance (PR-AUC 0.75; Escalate F1 0.62) and strong provenance and faithfulness metrics (citation validity 0.98; evidence support 0.88; counterfactual faithfulness 0.76). These findings indicate that governed, verifiable LLM systems can provide practical decision support for AML triage without sacrificing compliance requirements for traceability and defensibility.
OpenAI launches Privacy Filter, an open source, on-device data sanitization model that removes personal information from enterprise datasets
In a significant shift toward local-first privacy infrastructure, OpenAI has released Privacy Filter, a specialized open-source model designed to detect and redact personally identifiable information (PII) before it ever reaches a cloud-based server. Launched today on AI code sharing community Hugging Face under a permissive Apache 2.0 license, the tool addresses a growing industry bottleneck: the risk of sensitive data "leaking" into training sets or being exposed during high-throughput inference. By providing a 1.5-billion-parameter model that can run on a standard laptop or directly in a web browser, the company is effectively handing developers a "privacy-by-design" toolkit that functions as a sophisticated, context-aware digital shredder. Though OpenAI was founded with a focus on open source models such as this, the company shifted during the ChatGPT era to providing more proprietary ("closed source") models available only through its website, apps, and API — only to return to open source in a big way last year with the launch of the gpt-oss family of language models. In that light, and combined with OpenAI's recent open sourcing of agentic orchestration tools and frameworks, it's safe to say that the generative AI giant is clearly still heavily invested in fostering this less immediately lucrative part of the AI ecosystem. Technology: a gpt-oss variant with bidirectional token classifier that reads from both directions Architecturally, Privacy Filter is a derivative of OpenAI’s gpt-oss family, a series of open-weight reasoning models released earlier this year. However, while standard large language models (LLMs) are typically autoregressive—predicting the next token in a sequence—Privacy Filter is a bidirectional token classifier. This distinction is critical for accuracy. By looking at a sentence from both directions simultaneously, the model gains a deeper understanding of context that a forward-only model might miss. For instance, it can better distinguish whether "Alice" refers to a private individual or a public literary character based on the words that follow the name, not just those that precede it. The model utilizes a Sparse Mixture-of-Experts (MoE) framework. Although it contains 1.5 billion total parameters, only 50 million parameters are active during any single forward pass. This sparse activation allows for high throughput without the massive computational overhead typically associated with LLMs. Furthermore, it features a massive 128,000-token context window, enabling it to process entire legal documents or long email threads in a single pass without the need for fragmenting text—a process that often causes traditional PII filters to lose track of entities across page breaks. To ensure the redacted output remains coherent, OpenAI implemented a constrained Viterbi decoder. Rather than making an independent decision for every single word, the decoder evaluates the entire sequence to enforce logical transitions. It uses a "BIOES" (Begin, Inside, Outside, End, Single) labeling scheme, which ensures that if the model identifies "John" as the start of a name, it is statistically inclined to label "Smith" as the continuation or end of that same name, rather than a separate entity. On-device data sanitization Privacy Filter is designed for high-throughput workflows where data residency is a non-negotiable requirement. It currently supports the detection of eight primary PII categories: Private Names: Individual persons. Contact Info: Physical addresses, email addresses, and phone numbers. Digital Identifiers: URLs, account numbers, and dates. Secrets: A specialized category for credentials, API keys, and passwords. In practice, this allows enterprises to deploy the model on-premises or within their own private clouds. By masking data locally before sending it to a more powerful reasoning model (like GPT-5 or gpt-oss-120b), companies can maintain compliance with strict GDPR or HIPAA standards while still leveraging the latest AI capabilities. For developers, the model is available via Hugging Face, with native support for transformers.js, allowing it to run entirely within a user's browser using WebGPU. Fully open source, commercially viable Apache 2.0 license Perhaps the most significant aspect of the announcement for the developer community is the Apache 2.0 license. Unlike "available-weight" licenses that often restrict commercial use or require "copyleft" sharing of derivative works, Apache 2.0 is one of the most permissive licenses in the software world.For startups and dev-tool makers, this means: Commercial Freedom: Companies can integrate Privacy Filter into their proprietary products and sell them without paying royalties to OpenAI. Customization: Teams can fine-tune the model on their specific datasets (such as medical jargon or proprietary log formats) to improve accuracy for niche industries. No Viral Obligations: Unlike the GPL license, builders do not have to open-source their entire codebase if they use Privacy Filter as a component. By choosing this licensing path, OpenAI is positioning Privacy Filter as a standard utility for the AI era—essentially the "SSL for text". Community reactions The tech community reacted quickly to the release, with many noting the impressive technical constraints OpenAI managed to hit. Elie Bakouch (@eliebakouch), a research engineer at agentic model training platform startup Prime Intellect, praised the efficiency of Privacy Filter's architecture on X: "Very nice release by @OpenAI! A 50M active, 1.5B total gpt-oss arch MoE, to filter private information from trillion scale data cheaply. keeping 128k context with such a small model is quite impressive too". The sentiment reflects a broader industry trend toward "small but mighty" models. While the world has focused on massive, 100-trillion parameter giants, the practical reality of enterprise AI often requires small, fast models that can perform one task—like privacy filtering—exceptionally well and at a low cost. However, OpenAI included a "High-Risk Deployment Caution" in its documentation. The company warned that the tool should be viewed as a "redaction aid" rather than a "safety guarantee," noting that over-reliance on a single model could lead to "missed spans" in highly sensitive medical or legal workflows. OpenAI’s Privacy Filter is clearly an effort by the company to make the AI pipeline fundamentally safer. By combining the efficiency of a Mixture-of-Experts architecture with the openness of an Apache 2.0 license, OpenAI is providing a way for many enterprises to more easily, cheaply and safely redact PII data.
The Mythos meeting focused on the wrong AI risk to banks. Here's the one nobody is talking about | Fortune
While regulators fixate on AI's ability to break financial systems, AI-enabled fraud is already bypassing them — one authorized transaction at a time.
Generative AI Promises Cost Savings in Machine Learning but Elevates
In a recent commentary published in the esteemed journal Patterns, computer scientist Michael Lones of Heriot-Watt University presents a critical perspective on the integration of generative artificial intelligence (AI) within machine learning systems. While the advent of large
Adoption, Deployment & Impact
OpenAI in talks to commit up to $1.5bn to private equity joint venture
Start-up backing new company intended to help deploy AI within businesses owned by PE firms
Salesforce’s Agentforce Vibes 2.0 targets a hidden failure: context overload in AI agents
When startup fundraising platform VentureCrowd began deploying AI coding agents, they saw the same gains as other enterprises: they cut the front-end development cycle by 90% in some projects. However, it didn’t come easy or without a lot of trial and error. VentureCrowd’s first challenge revolved around data and context quality, since Diego Mogollon, chief product officer at VentureCrowd, told VentureBeat that “agents reason against whatever data they can access at runtime” and would then be confidently “wrong” because they’re only basing their knowledge on the context given to them. Their other roadblock, like many others, was messy data and unclear processes. Similar to context, Mogollon said coding agents would amplify bad data, so the company had to build a well-structured codebase first. “The challenges are rarely about the coding agents themselves; they are about everything around them,” said Mogollon. “It’s a context problem disguised as an AI problem, and it is the number one failure mode I see across agentic implementations.” Mogollon said VentureCrowd encountered several roadblocks in overhauling its software development. VentureCrowd's experience illustrates a broader issue in AI agent development. The models are not failing the agents; rather, they become overwhelmed by too much context and too many tools at once. Too much context This comes from a phenomenon called Context bloat, when AI systems accumulate more and more data, tools or instructions, the more complex the workflows become. The problem arises because agents need context to work better, but too much of it creates noise. And the more context an agent has to sift through, the more tokens it uses, the work slows down and the costs increase. One way to curb context bloat is through context engineering. Context engineering helps agents understand code changes or pull requests and align them with their tasks. However, context engineering often becomes an external task rather than built into the coding platforms enterprises use to build their agents. How coding agent providers respond VentureCrowd relied on one solution in particular to help it overcome the issues with context bloat plaguing its enterprise AI agent deployment: Salesforce’s Agentforce Vibes, a coding platform that lives within Salesforce and is available for all plans starting with the free one. Salesforce recently updated Agentforce Vibes to version 2.0, expanding support for third-party frameworks like ReAct. Most important for companies like VentureCrowd, Agentforce Vibes added Abilities and Skills, which they can use to direct agent behavior. “For context, our entire platform, frontend and backend, runs on the Salesforce ecosystem. So when Agentforce Vibes launched, it slotted naturally into an environment we already knew well,” Mogollon said. Salesforce’s approach doesn’t minimize the context agents use; rather, it helps enterprises ensure that context stays within their data models or codebases. Agentforce Vibes adds additional execution through the new Skills and Abilities feature. Abilities define what agents want to accomplish, and Skills are the tools they will use to get there. Other coding agent platforms manage context differently. For example, Claude Code and OpenAI’s Codex focus on autonomous execution, continuously reading files, running commands and as tasks evolve, expanding context. Claude Code has a context indicator that which compacts context when it becomes too large. With these different approaches, the consistent pattern is that most systems manage growing contexts for agents, not necessarily to limit them. Context keeps growing, especially as workflows become more complex, making it more difficult for enterprises to control costs, latency and reliability. Mogollon said his company chose Agentforce Vibes not only because a large portion of their data already lives on Salesforce, making it easier to integrate, but also because it would allow them to control more of the context they feed their agents. What builders should know There’s no single way to address context bloat, but the pattern is now clear: more context doesn't always mean better results. Along with investing in context engineering, enterprises have to experiment with the context constraint approach they are most comfortable with. For enterprises, that means the challenge isn’t just giving agents more information—it’s deciding what to leave out.
Quant pioneer Martin Lueck warns against handing over trading to AI
Caution by co-founder of Aspect hedge fund follows billionaire Cliff Asness’s decision to ‘surrender’ to the machines
Google DeepMind partners with global consultancies to accelerate enterprise AI adoption. — Google DeepMind
Google DeepMind is partnering with leading consultancies to bridge the AI adoption gap and drive agentic transformation with frontier models and expert research.
Workday, Rippling, and Slack flunk data access test, claims Fivetran
Report also slams multiple vendors for poor data integration and egress fees Workday, Rippling, and Salesforce-owned Slack rank among the worst performers for enterprise data movement, according to a new industry benchmark tracking the speeds needed to power analytics, machine learning, and AI agents.…
AI Adoption Sentiment Surges as March Index Hits 75.95, Signaling Market Recovery
Evaluates perceptions and expectations across four core dimensions—adoption, disruption, spend, and use case—to provide a structured view of AI....
A Multi-Plant Machine Learning Framework for Emission Prediction, Forecasting, and Control in Cement Manufacturing
arXiv:2604.19903v1 Announce Type: cross Abstract: Cement production is among the largest contributors to industrial air pollution, emitting ~3 Mt NOx/year. The industry-standard mitigation approach, selective non-catalytic reduction (SNCR), exhibits low NH3 utilization efficiency, resulting in operational inefficiencies and increased reagent costs. Here, we develop a data-driven framework for emission control using large-scale operational data from four cement plants worldwide. Benchmarking nine machine learning architectures, we observe that prediction error varies ~3-5x across plants due to variation in data richness. Incorporating short-term process history nearly triples NOx prediction accuracy, revealing that NOx formation carries substantial process memory, a timescale dependence that is absent in CO and CO2. Further, we develop models that forecast NOx overshoots as early as nine minutes, providing a buffer for operational adjustments. The developed framework controls NOx formation at the source, reducing NH3 consumption in downstream SNCR. Surrogate model projections estimate a ~34-64% reduction in NOx while preserving clinker quality, corresponding to a reduction of ~290 t NOx/year and ~58,000 USD/year in NH3 savings. This work establishes a generalizable framework for data-driven emission control, offering a pathway toward low-emission operation without structural modifications or additional hardware, with potential applicability to other hard-to-abate industries such as steel, glass, and lime.
Automated Detection of Dosing Errors in Clinical Trial Narratives: A Multi-Modal Feature Engineering Approach with LightGBM
arXiv:2604.19759v1 Announce Type: new Abstract: Clinical trials require strict adherence to medication protocols, yet dosing errors remain a persistent challenge affecting patient safety and trial integrity. We present an automated system for detecting dosing errors in unstructured clinical trial narratives using gradient boosting with comprehensive multi-modal feature engineering. Our approach combines 3,451 features spanning traditional NLP (TF-IDF, character n-grams), dense semantic embeddings (all-MiniLM-L6v2), domain-specific medical patterns, and transformer-based scores (BiomedBERT, DeBERTa-v3), used to train a LightGBM model. Features are extracted from nine complementary text fields (median 5,400 characters per sample) ensuring complete coverage across all 42,112 clinical trial narratives. On the CT-DEB benchmark dataset with severe class imbalance (4.9% positive rate), we achieve 0.8725 test ROC-AUC through 5-fold ensemble averaging (cross-validation: 0.8833 + 0.0091 AUC). Systematic ablation studies reveal that removing sentence embeddings causes the largest performance degradation (2.39%), demonstrating their critical role despite contributing only 37.07% of total feature importance. Feature efficiency analysis demonstrates that selecting the top 500-1000 features yields optimal performance (0.886-0.887 AUC), outperforming the full 3,451-feature set (0.879 AUC) through effective noise reduction. Our findings highlight the importance of feature selection as a regularization technique and demonstrate that sparse lexical features remain complementary to dense representations for specialized clinical text classification under severe class imbalance.
The starter home is dying. Better.com’s CEO says AI is the only thing that can save it
Better.com CEO Vishal Garg said AI can service the low mortgages loan officers avoid.
Barclays, Lloyds, UBS among firms chosen by financial watchdog for AI testing
The UK's financial regulator has selected a second cohort of firms, including Barclays, Lloyds, and UBS, to test AI systems in its AI Lab.
Geopolitics, Policy & Governance
Pentagon asks for $54bn in pivot towards AI-powered war
Budget outlines funding for autonomous drone warfare program as experts say military unprepared for risks The Pentagon is aiming to increase funding more than a hundredfold for an autonomous drone warfare program, according to budget documents released this week, signalling a major pivot towards AI-powered war. In its 2027 budget, the Pentagon has asked for over $54bn to fund the Defense Autonomous Warfare Group, a 24,000% increase on last year. Continue reading...
Taiwan Banks to Build Own AI Model to Rival Global Giants
Taiwan is launching a project to develop a large language model for its finance sector, seeking to strengthen its local firms and bypass the limitations of global AI platforms that often lack the nuance of domestic regulations and market practices.
Top Republican pushes party to shun $300mn AI lobby
Senator Josh Hawley warns of ‘political cost’ if Washington fails to rein in Big Tech and artificial intelligence
U.S. Tech Export Controls Overhaul Reshapes AI &... | Legis1 | Legis1
House committee marks up bills reshaping U.S. tech export controls on semiconductors and AI. Discover the biggest overhaul since 2018. Read more.
Hong Kong exploring specific AI legislation in legal-framework review
Hong Kong is exploring the possibility of introducing dedicated artificial intelligence laws as part of a review of its legal framework, Secretary for Innovation, Technology and Industry Sun Dong said.
EU cloud law faces strong lobbying despite momentum, senior official says
A planned EU Cloud and AI Development Act faces significant lobbying efforts despite growing political momentum for tech sovereignty.
AI companies asked to work with UK govt in strengthening cyber defenses
Security minister Dan Jarvis is calling on AI companies to work with the UK government to build AI-powered cyber defense capabilities and sign a voluntary Cyber Resilience Pledge.
Closing the gap between AI technology and who can actually use it | Federal News Network
What we're hoping to be able to see … is first and foremost, increasing that AI readiness nationwide," Dr. Erwin Gianchandani said.
Get the full executive brief
Receive curated insights with practical implications for strategy, operations, and governance.