Tue 28 April 2026
Daily Brief — Curated and contextualised by Best Practice AI
Chip Startup Innovates, OpenAI Stumbles, and Software Stocks Suffer
TL;DR A new chip startup aims to overcome AI's memory limitations, potentially transforming server efficiency. OpenAI-linked stocks declined after reports of missed sales and user targets, though OpenAI claims robust growth. Investors are increasingly favoring chip stocks over software, reflecting shifting market dynamics. The EU AI Act continues to challenge public sector AI deployment, highlighting regulatory complexities.
The stories that matter most
Selected and contextualised by the Best Practice AI team
OpenAI Hits Back at Growth Fears, Says ‘Firing on All Cylinders’
OpenAI pushed back against concerns over its sales growth on Tuesday, saying its consumer and enterprise businesses are “firing on all cylinders” despite a report about the AI startup missing internal targets.
OpenAI-Linked Stocks Slump on Report It Missed Key Targets
A constellation of artificial-intelligence stocks dropped after OpenAI reportedly failed to meet its sales and user targets, rekindling doubts that the hundreds of billions of dollars that big companies are plowing into the technology will deliver sufficient profits anytime soon.
Tech’s ‘New Normal’ Trade Pair: Long Chip Stock, Short Software
In a choppy year for tech investors, one trade has stood out as a success: buy chip stocks, sell software shares. And the divide between winners and losers is getting bigger as 2026 moves along.
The Security Cost of Intelligence: AI Capability, Cyber Risk, and Deployment Paradox
arXiv:2604.23058v1 Announce Type: new Abstract: Firms are deploying more capable AI systems, but organizational controls often have not kept pace. These systems can generate greater productivity gains, but high-value uses require broader authority exposure -- data access, workflow integration, and delegated authority -- when governance controls have not yet decoupled capability from authority exposure. We develop an analytical model in which a firm jointly chooses AI deployment and cybersecurity investment under this governance-capability gap. The central result shows a deployment paradox: in high-loss environments, better AI can lead a firm to deploy less when capability is deployed through broader authority exposure under weak governance. Optimal deployment also falls below the no-risk benchmark, and this shortfall widens with breach-loss magnitude and with the authority exposure attached to more capable systems. Governance investment that reduces breach-loss magnitude shrinks the paradox region itself, while breach externalities expand the range of environments in which deployment is socially constrained. Governance maturity is therefore not merely a constraint on AI adoption. It is a condition that shapes whether capability improvements translate into productive deployment.
Buying the Right to Monitor:Editorial Design in AI-Assisted Peer Review
arXiv:2604.23645v1 Announce Type: new Abstract: Generative AI acts as a disruptive technological shock to evaluative organizations. In academic peer review, it enters both sides of the market: authors use AI to polish submissions, and reviewers use it to generate plausible reports without exerting evaluative effort. We develop a three-sided equilibrium model to analyze this dual adoption and derive a counterintuitive managerial implication for journal policy. We show that when AI capability crosses a critical threshold, reviewer effort collapses discontinuously. This transition creates a welfare misalignment: authors benefit from a weakened ``rat race,'' while editors suffer from degraded signal informativeness. Characterizing the editor's optimal constrained response, we identify a strict policy reversal. Before the AI transition, editors should tighten acceptance standards to curb rent-dissipating author polishing. After the transition, conventional intuition fails: editors must loosen acceptance standards while investing in AI detection, because further tightening only amplifies dissipative polishing without improving sorting. We prove analytically that this sign reversal is a structural consequence of the reviewer effort collapse under log-concave quality distributions. Ultimately, addressing AI in evaluative systems requires treating monitoring and loosened selectivity as complementary design instruments.
Secure On-Premise Deployment of Open-Weights Large Language Models in Radiology: An Isolation-First Architecture with Prospective Pilot Evaluation
arXiv:2604.22768v1 Announce Type: new Abstract: Purpose: To design, implement, evaluate, and report on the regulatory requirements of a self-hosted LLM infrastructure for radiology adhering to the principle of least privilege, emphasizing technical feasibility, network isolation, and clinical utility. Materials and Methods: The isolation-first, containerized LLM inference stack relies on strict network segmentation, host-enforced egress filtering, and active isolation monitoring preventing unauthorized external connectivity. An accompanying deployment package provides automated isolation and hardening tests. The system served the open-weights DeepSeek-R1 model via vLLM. In a one-week pilot phase, 22 residents and radiologists were free to use 10 predefined prompt-templates whenever they considered them useful in daily work. Afterward, they rated clinical utility and system stability on an 0-10 Likert scale and reported observed critical errors in model output. Results: The applied institutional governance pathway achieved approval from clinic management, compliance, data protection and information security officers for processing unanonymized PHI. The system was rated stable and user friendly during the pilot. Source text-anchored tasks, such as report corrections or simplifications, and radiology guideline recommendations received the highest utility ratings, whereas open-ended conclusion generation based on findings resulted in the highest frequency of critical errors, such as clinically relevant hallucinations or omissions. Conclusion: The proposed isolation-first on-premise architecture enabled overcoming regulatory borders, showed promising clinical utility in text-anchored tasks and is the current base to serve open-weights LLMs as an official service of a German University Hospital with over 10,000 employees. The deployment package were made publicly available (https://github.com/ukbonn/ukb-gpt).
Institutions for the Post-Scarcity of Judgment
arXiv:2604.22966v1 Announce Type: new Abstract: Each major technological revolution inverts a particular scarcity and rebuilds institutions around the shift. The near-consensus diagnosis of the AI revolution holds that AI collapses the cost of prediction while judgment remains scarce. This Opinion argues the inversion has now flipped: competent-looking judgment (selecting, ranking, attributing, certifying) is produced at scale and at marginal cost approaching zero, and four complements become scarce: verified signal, legitimacy, authentic provenance, and integration capacity (the community's tolerance for delegated cognition). Because judgment is the substance of institutions, the institutions built to manufacture legitimate judgment (courts, journals, licensing bodies, legislatures) now compete with the technology for the same functional role. The piece traces the pattern across scientific institutions, professional licensing, intellectual property, democratic legitimacy, and foundation-model concentration, and closes with a three-move agenda: reframe AI policy as institutional redesign, build provenance and verification as commons, and develop the formal apparatus for institutional composition under strategic agents.
Economics & Markets
OpenAI Hits Back at Growth Fears, Says ‘Firing on All Cylinders’
OpenAI pushed back against concerns over its sales growth on Tuesday, saying its consumer and enterprise businesses are “firing on all cylinders” despite a report about the AI startup missing internal targets.
How to price your AI product in a post-SaaS world? | This is what raising $1.5M really looked like: 250 meetings & 171 rejections.
What people really want from AI? - Anthropic Survey.
Disney’s $60 billion bet on the one thing AI can’t replace
Disney’s CEO is facing an existential crisis brought about by an emerging technology. The year was 1955, the CEO was Walt Disney and the tech was television.
GoTo Posts First-Ever Net Income After Cost Cuts Bear Fruit
GoTo Group reported its first-ever net income, a major milestone in the Indonesian ride-hailing and food delivery company’s turnaround effort as it seeks to gain fresh momentum in its long-standing rivalry with Grab Holdings Ltd.
OpenAI-Linked Stocks Slump on Report It Missed Key Targets
A constellation of artificial-intelligence stocks dropped after OpenAI reportedly failed to meet its sales and user targets, rekindling doubts that the hundreds of billions of dollars that big companies are plowing into the technology will deliver sufficient profits anytime soon.
Tech’s ‘New Normal’ Trade Pair: Long Chip Stock, Short Software
In a choppy year for tech investors, one trade has stood out as a success: buy chip stocks, sell software shares. And the divide between winners and losers is getting bigger as 2026 moves along.
Is OpenAI Falling Further Behind in the A.I. Race?
The artificial intelligence giant has reportedly fallen behind on its own user and revenue targets, raising questions about its data center and I.P.O. plans
Citigroup lifts AI market view to over $4 trillion on enterprise adoption | Reuters
Citigroup raised its global artificial intelligence market forecast, citing faster-than-expected enterprise adoption of artificial intelligence tools for coding and automation, with companies such as Anthropic showing strong revenue growth.
Data Center-Linked Bonds Slide as OpenAI Report Fuels Worries
US corporate bonds linked to data-center firms fell Tuesday after the Wall Street Journal reported that OpenAI recently failed to meet its own goals for new user acquisition and sales, fueling internal concerns that the company may struggle to support its spending on artificial intelligence infrastructure.
CrowdStrike to rally as Anthropic spotlights AI cybersecurity threats, Mizuho says
CrowdStrike should see its shares rise as Project Glasswing calls attention to cybersecurity threats posed by the growing use of AI, per Mizuho.
STMicroelectronics Q1 FY 2026 Earnings Show Early AI and Satellite Upside
Brendan Burke, Research Director, analyzes STMicroelectronics’ Q1 FY 2026 earnings, focusing on AI data center and LEO satellite momentum, photonics ramp timing.
AI Infrastructure Stocks Drop Amid OpenAI Growth Concerns
On April 28, 2026, shares of Super Micro Computer Inc (SMCI) experienced a notable decline in pre-market trading, alongside other companies linked to artificial
Institutions for the Post-Scarcity of Judgment
arXiv:2604.22966v1 Announce Type: new Abstract: Each major technological revolution inverts a particular scarcity and rebuilds institutions around the shift. The near-consensus diagnosis of the AI revolution holds that AI collapses the cost of prediction while judgment remains scarce. This Opinion argues the inversion has now flipped: competent-looking judgment (selecting, ranking, attributing, certifying) is produced at scale and at marginal cost approaching zero, and four complements become scarce: verified signal, legitimacy, authentic provenance, and integration capacity (the community's tolerance for delegated cognition). Because judgment is the substance of institutions, the institutions built to manufacture legitimate judgment (courts, journals, licensing bodies, legislatures) now compete with the technology for the same functional role. The piece traces the pattern across scientific institutions, professional licensing, intellectual property, democratic legitimacy, and foundation-model concentration, and closes with a three-move agenda: reframe AI policy as institutional redesign, build provenance and verification as commons, and develop the formal apparatus for institutional composition under strategic agents.
The Economic Imperative of Predicting AI Capability and Scaling Velocity
All downstream economic impacts of AI, including labor displacement and productivity shifts, depend on the ultimate capability and scaling speed of models. Focusing on the S-curve of AI development is essential for strategic planning.
AI-Driven Energy Demand and the Emerging Impact on Credit Risk - Business Information
AI-driven energy demand is reshaping credit risk. Learn how rising energy costs may impact small business performance and lending strategies.
The Future Is Shrouded in an AI Fog
AI’s rapid advance is creating new limits on leaders’ visibility into the short-term future and challenging the criteria they use to commit to forward-looking investments. Staring into this fog, leaders will be tempted to trade the potential future gains from skyscrapers and railways for ...
Locked, stocked, and losing budget: AI vendor lock-in bites back
Execs in the C-suite thought they could swap models in a week. They were hallucinating Opinion The days when you could jump from one frontier AI model to another at the drop of a hat are going away as vendor lock-in starts to kick in, and prices increase.…
Anthropic’s Little Brother
OpenAI is racing to catch up to its greatest rival.
CIOs move to reclaim value as AI shakes up outsourcing contracts
As AI reshapes outsourcing models, CIOs are renegotiating contracts, reworking governance, and rethinking risk management.
The Impact of Dodd-Frank and the Huawei Shock on DRC Tin Exports
arXiv:2512.21645v2 Announce Type: replace Abstract: This paper investigates the structural transformation of the Democratic Republic of the Congo (DRC) tin market induced by the U.S. Dodd-Frank Act. Focusing on the breakdown of the pricing mechanism, we estimate the price elasticity of export demand from 2010 to October 2022 using a structural identification strategy that overcomes the lack of reliable unit value data. Our analysis reveals that the regulation effectively destroyed the price mechanism, with demand elasticity dropping to zero. This indicates the formation of a ``captive market'' driven by certification requirements rather than price competitiveness. Also, we find strong hysteresis; deregulation alone failed to restore market flexibility. The structural rigidity was finally broken not by policy suspension, but by the 2019 ``Huawei shock,'' an external demand surge that forced supply chain diversification.
End of the road for the ‘Mad Men’ as AI moves into advertising
Sprawling marketing groups are struggling to respond to new technology
Advertisers seek to capitalise on the promise of AI
Marketers need to balance the efficiency offered by automation with the authenticity that consumers demand
Should your board appoint a bot?
New AI tools help chairs and directors with prep and research but are unlikely to be granted a vote
r/automation on Reddit: What are some automations that actually got 10x better due to advancements in AI?
Data Drive Content Creation & Repurposing : Our team has used AI tools like Frizerly to automate the process of coming up of a content strategy using our Google search data. It also spies on our competitors to identify keyword gaps. This is then used to automatically publish a blog on our website every day using our internal data like customer testimonials, case studies etc.
‘AI deflation’ comes to India’s tech services giants and puts downward pressure on revenue
Headcounts, however, are mostly holding up AI is beginning to make a dent in the business models of India’s big four technology services giants…
Activist Starboard Value Takes Stake in AI Software Maker Dynatrace
Dynatrace shares have underperformed its peers, and Starboard is pushing for changes to turn things around.
AI & Tech Brief: The Pentagon goes VC - The Washington Post
Plus, a sit-down with Evan Smith, the CEO of Altana, on global AI supply chains
Meta, Google, OpenAI among Big Tech firms seeing top staff leaving to launch AI startups
Former employees at AI giants are raising hundreds of millions of dollars from investors months on from launching.
Stockholm’s Redpine raises €6.8 million to unlock licensed premium data for AI agents
Redpine, a Stockholm-based AI startup, has raised €6.8M in Seed funding to power AI companies and agents with access to licensed, high-quality and multimodal data, securely and at scale. The round was led by NordicNinja, with participation from fellow Nordic firms Luminar Ventures and node.vc. Alongside the Seed funding, Redpine has received investment from strategic […]
Copenhagen’s Performativ raises €11.9 million Series A to scale its AI-native wealth management operating system
Performativ, a Copenhagen-based startup building the next-generation operating system for wealth management, has raised €11.96 million ($14 million) in its Series A funding round. The round was led by Deutsche Börse Group, with participation from Rabo Investments, the investment arm of Rabobank, Jacob Dahl, former Senior Partner and Co-leader of Global Banking Sector, McKinsey & […]
fDi Intelligence – Your source for foreign direct investment information - fDiIntelligence.com
Ban of $2bn deal puts firms and founders falling within Beijing’s AI ambitions on notice
Here’s the contentious history behind OpenAI.
Amsterdam’s QDNL Participations rebrands as Ground State Ventures, raises over €75.2 million for quantum tech fund
Amsterdam-based quantum technology-focused VC firm QDNL Participations today announced its rebranding to Ground State Ventures. It is also preparing for the final close of its early-stage quantum technology fund, having already raised over €75.2 million ($88 million), far exceeding the original target of €59.8 million ($70 million). As per the VC firm, the rebrand reflects […]
Top 10 AI Companies Leading New Zealand's Tech Boom in 2026
AUCKLAND — New Zealand's artificial intelligence sector is experiencing explosive growth in 2026, with a wave of innovative startups and established players driving advancements in agritech, healthcare, customer experience, and maritime intelligence, positioning the country as a rising force ...
Labor, Society & Culture
Buying the Right to Monitor:Editorial Design in AI-Assisted Peer Review
arXiv:2604.23645v1 Announce Type: new Abstract: Generative AI acts as a disruptive technological shock to evaluative organizations. In academic peer review, it enters both sides of the market: authors use AI to polish submissions, and reviewers use it to generate plausible reports without exerting evaluative effort. We develop a three-sided equilibrium model to analyze this dual adoption and derive a counterintuitive managerial implication for journal policy. We show that when AI capability crosses a critical threshold, reviewer effort collapses discontinuously. This transition creates a welfare misalignment: authors benefit from a weakened ``rat race,'' while editors suffer from degraded signal informativeness. Characterizing the editor's optimal constrained response, we identify a strict policy reversal. Before the AI transition, editors should tighten acceptance standards to curb rent-dissipating author polishing. After the transition, conventional intuition fails: editors must loosen acceptance standards while investing in AI detection, because further tightening only amplifies dissipative polishing without improving sorting. We prove analytically that this sign reversal is a structural consequence of the reviewer effort collapse under log-concave quality distributions. Ultimately, addressing AI in evaluative systems requires treating monitoring and loosened selectivity as complementary design instruments.
Microsoft researchers have revealed the 40 jobs most exposed to AI—and even teachers make the list | Fortune
Sorry, Gen Z: AI is expected to soon reshape dozens of popular professions—and possibly make some tasks obsolete.
The 85/5 Enterprise AI Paradox: Why Almost... | Metaintro
85% of enterprises run AI agents, only 5% ship them. Here's what the trust gap means for AI hiring, job security, and the roles employers actually need in 2026.
Despite the hype, AI is not replacing the customer service workforce | HR Dive
The hype says “agentless” service is imminent, but data shows most teams are still staffing up while trying to make artificial intelligence actually function in real workflows.
Nearly half of London jobs at risk of AI disruption and women will be hardest hit, new report finds | Euronews
According to a new report by the Mayor of London's office, nearly half of the UK capital's workers could see their jobs transformed by generative AI.
AI job losses could now exceed 4pc of workforce, says skills tsar
New government-backed forum to examine rules to protect workers is to meet for the first time on Wednesday.
9 Cash and Retraining Lifelines for... | Metaintro
AI-driven layoffs in 2026? Tap nine cash stipends, retraining grants, and employer-funded programs for displaced workers. Get the help you have earned.
PhySE: A Psychological Framework for Real-Time AR-LLM Social Engineering Attacks
arXiv:2604.23148v1 Announce Type: new Abstract: The emerging threat of AR-LLM-based Social Engineering (AR-LLM-SE) attacks (e.g. SEAR) poses a significant risk to real-world social interactions. In such an attack, a malicious actor uses Augmented Reality (AR) glasses to capture a target visual and vocal data. A Large Language Model (LLM) then analyzes this data to identify the individual and generate a detailed social profile. Subsequently, LLM-powered agents employ social engineering strategies, providing real-time conversation suggestions, to gain the target trust and ultimately execute phishing or other malicious acts. Despite its potential, the practical application of AR-LLM-SE faces two major bottlenecks, (1) Cold-start personalization, Current Retrieval-Augmented Generation (RAG) methods introduce critical delays in the earliest turns, slowing initial profile formation and disrupting real-time interaction, (2) Static Attack Strategies, Existing approaches rely on fixed-stage, handcrafted social engineering tactics that lack foundation in established psychological theory. To address these limitations, we propose PhySE, a novel framework with two core innovations, (1) VLM-Based SocialContext Training, To eliminate profiling delays, we efficiently pre-train a Visual Language Model (VLM) with social-context data, enabling rapid, on-the-fly profile generation, (2) Adaptive Psychological Agent, We introduce a psychological LLM that dynamically deploys distinct classes of psychological strategies based on target response, moving beyond static, handcrafted scripts. We evaluated PhySE through an IRB-approved user study with 60 participants, collecting a novel dataset of 360 annotated conversations across diverse social scenarios.
Employees Petition Google CEO To Block Classified Military Use of AI Technology - The Media Line
Google employees have signed a petition opposing the […]
AI Attack on Black: Tech Lynchings & Cyber Discrimination 101 | EURweb | Black News, Culture, Entertainment & More
AI shows measurable bias against African American English. Tech lynching is real. The evidence is clear and urgent.
What fighter pilots can teach us about enterprise AI decisions | TechRadar
Decision traceability and human judgment in enterprise AI
Peer Identity Bias in Multi-Agent LLM Evaluation: An Empirical Study Using the TRUST Democratic Discourse Analysis Pipeline
arXiv:2604.22971v1 Announce Type: new Abstract: The TRUST democratic discourse analysis pipeline exposes its large language model (LLM) components to peer model identity through multiple structural channels -- a design feature whose bias implications have not previously been empirically tested. We provide the first systematic measurement of identity-dependent scoring bias across all active identity exposure channels in TRUST, crossing four model families with two anonymization scopes across 30 political statements. The central finding is that single-channel anonymization produces near-zero bias effects, because individual channels act in opposite directions and cancel each other out -- a result that would lead an evaluator to conclude that identity bias is absent when it is not. Only full-pipeline anonymization reveals the true pattern: homogeneous ensembles amplify identity-driven sycophancy when model identity is fully visible, while the heterogeneous production configuration shows the reverse. Model choice matters independently: one tested model exhibits baseline sycophancy two to three times higher than the others and near-zero deliberative conflict on ideological topics, making it structurally unsuitable for pipelines where genuine inter-role disagreement is the intended quality mechanism. Three practical conclusions follow. First, heterogeneous model ensembles are structurally more robust than homogeneous ones, achieving higher consensus rates and lower identity amplification. Second, full-pipeline anonymization is required for valid bias measurement -- partial anonymization is insufficient and actively misleading. Third, these findings have direct implications for the validation of multi-agent LLM systems in quality-critical applications: a system validated under partial anonymization or with a homogeneous ensemble may pass validation while retaining structural identity bias invisible to single-channel measurement.
Early Academic Capital as the Causal Origin of Dropout in Constrained Educational Systems -- Evidence from Longitudinal Data and Structural Causal Models
arXiv:2604.22772v1 Announce Type: new Abstract: Dropout in higher education is commonly analysed through observable academic events such as course failure or repetition. However, these event-based perspectives may obscure the underlying structural dynamics that shape student trajectories. In this study, we adopt a causal computational social science approach to identify the origins of dropout in a constrained engineering curriculum. Using longitudinal administrative data from 16,868 students who survived to their second active term, and a leakage-free panel design, we estimate the causal effect of early academic capital accumulation on three-year dropout. Treatment is defined as low early progress (passing at most 1 subject by the end of the second term). We employ G-estimation of structural nested mean models, complemented by marginal structural models with inverse probability weighting. We find a large and robust causal effect: low early academic capital increases dropout probability by 25.3 percentage points (G-estimation), closely matched by a 27.4 pp estimate from IPTW models. This effect is approximately twice as large as the estimated direct impact of later academic events such as first-time gateway-course repetition (12.7 pp). These findings suggest that dropout does not originate in isolated academic failures, but in early trajectory misalignment between academic progress and system-imposed temporal constraints. This perspective shifts the focus of intervention from downstream events to early-stage trajectory formation.
Training for the Wrong Job | American Enterprise Institute - AEI
Workforce development is not being displaced by AI. It is being asked to solve one of the defining problems of the next decade: how to grow judgment when the experiences that produce it are increasingly pressured by automation.
Technology & Infrastructure
Towards Automated Ontology Generation from Unstructured Text: A Multi-Agent LLM Approach
arXiv:2604.23090v1 Announce Type: new Abstract: Automatically generating formal ontologies from unstructured natural language remains a central challenge in knowledge engineering. While large language models (LLMs) show promise, it remains unclear which architectural design choices drive generation quality and why current approaches fail. We present a controlled experimental study using domain-specific insurance contracts to investigate these questions. We first establish a single-agent LLM baseline, identifying key failure modes such as poor Ontology Design Pattern compliance, structural redundancy, and ineffective iterative repair. We then introduce a multi-agent architecture that decomposes ontology construction into four artifact-driven roles: Domain Expert, Manager, Coder, and Quality Assurer. We evaluate performance across architectural quality (via a panel of heterogeneous LLM judges) and functional usability (via competency question driven SPARQL evaluation with complementary retrieval augmented generation based assessment). Results show that the multi-agent approach significantly improves structural quality and modestly enhances queryability, with gains driven primarily by front-loaded planning. These findings highlight planning-first, artifact-driven generation as a promising and more auditable path toward scalable automated ontology engineering.
Humanoid robots to become baggage handlers in Japan airport experiment
Japan Airlines will introduce the robots for trial run at a Tokyo airport amid country’s surge in inbound tourism and worsening labour shortages Japan’s famously conscientious but overburdened baggage handlers will soon be joined by extra staff at Tokyo’s Haneda airport – although their new colleagues will need to take regular recharging breaks. Japan Airlines will introduce humanoid robots on a trial basis from the beginning of May, with a view to deploying them permanently as a solution to the country’s chronic labour shortage. Continue reading...
Competitive Analysis of Enterprise AI Agent Integration in Productivity Software
A critique of Microsoft's Outlook agent implementation suggests that current enterprise AI integrations often suffer from poor UX and limited cross-platform visibility. The analysis compares these shortcomings against more agile, third-party agentic alternatives.
Council Post: How AI Agents Can Help Small Businesses Compete
AI agents are more than an emerging trend to monitor; they offer operational advantages that can allow small teams to perform like large ones.
I used Claude’s new Dispatch feature for a month. Here’s everything I was able to do
The new AI feature is less “chatbot on your phone” and more a way to send your computer errands while you’re away.
From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents
arXiv:2604.23194v1 Announce Type: new Abstract: Large language model-based agents have recently emerged as powerful approaches for solving dynamic and multi-step tasks. Most existing agents employ planning mechanisms to guide long-term actions in dynamic environments. However, current planning approaches face a fundamental limitation that they operate at a fixed granularity level. Specifically, they either provide excessive detail for simple tasks or insufficient detail for complex ones, failing to achieve an optimal balance between simplicity and complexity. Drawing inspiration from the principle of \textit{progressive refinement} in cognitive science, we propose \textbf{AdaPlan-H}, a self-adaptive hierarchical planning mechanism that mimics human planning strategies. Our method initiates with a coarse-grained macro plan and progressively refines it based on task complexity. It generates self-adaptive hierarchical plans tailored to the varying difficulty levels of different tasks, which can be optimized by imitation learning and capability enhancement. Experimental results demonstrate that our method significantly improves task execution success rates while mitigating overplanning at the planning level, providing a flexible and efficient solution for multi-step complex decision-making tasks. To contribute to the community, our code and data will be made publicly available at https://github.com/import-myself/AHP.
Open source Xiaomi MiMo-V2.5 and V2.5-Pro are among the most efficient (and affordable) at agentic 'claw' tasks
Xiaomi's new open-source MiMo-V2.5 models are setting new benchmarks for efficiency and affordability in agentic robotic tasks.
Cadence lifts annual revenue forecast on sustained AI chip-design boom | Reuters
Cadence is partnering with Nvidia to integrate its physics engines, which predict how real-world materials interact, with AI models designed to train robots inside computer simulations.
Nvidia-Supplier Victory Giant’s Sales Surge on Solid AI Demand
Victory Giant Technology Huizhou Co. reported a 28% increase in quarterly sales on stronger demand for printed circuit boards critical for development of AI servers.
AI-Driven CPU Shortage Saves Intel’s Financial Cookies
If you have a few pallets of datacenter CPUs sitting in a barn somewhere, and they have a reasonable ...
Chip Startup Aims to Shatter AI’s Dreaded Memory Wall
Huge AI models are overwhelming servers and leaving high-powered chips idle. Google and Meta veterans say they have the solution.
Tenstorrent’s Galaxy Blackhole AI servers escape the event horizon
RISC-V-based systems pack 32 Blackhole accelerators in a 6U, $110K chassis Tenstorrent on Tuesday announced the general availability of its Galaxy Blackhole AI compute platform.…
Suppliers to AI companies are big winners of spending surge | CFO.com
Many of the most successful players in the AI arena, such as CoreWeave, Nvidia and AMD, are making the chips and memory modules that make the technology possible.
AI-Driven Power Demand Reshapes US Energy Consulting Market - Business Insider
Consulting firms like Boston Consulting and McKinsey see AI-driven electricity demand surge, complicating energy transition efforts.
NVIDIA's Rubin Lands Inside Google's Virtual Machine, Stretching Multi Site Clusters to Nearly 1 Million GPUs
Google and NVIDIA have teamed up to provide users with access to as much as one million NVIDIA GPUs to power up the freshly launched A5X instances. The announcement is part of the pair's latest collaboration to reduce inference costs and improve token throughput.
Intellectia
Capital Expenditure Expectations: ... for AI infrastructure alongside shifting global dynamics. Practical Limitations Emerging: Despite a strong willingness to deploy more capital, Terry highlighted that practical limitations such as power capacity shortages, skilled labor deficits, and regulatory approvals are beginning to surface, potentially capping how quickly firms can expand amid intensifying competitive pressure. Shifting Market Focus: Investor ...
Navigating the Connectivity Bottleneck in AI Infrastructure
Navigating the Connectivity Bottleneck in AI Infrastructure
Artificial General Intelligence Forecasting and Scenario Analysis: State of the Field, Methodological Gaps, and Strategic Implications
arXiv:2604.22766v1 Announce Type: new Abstract: In this report, we review the current state of methodologies to forecast the arrival of artificial general intelligence, assess their reliability, and analyze the implications for strategy and policy. We synthesize diverse forecasting approaches, document significant limitations in existing methods, and propose a research agenda for developing more-robust forecasting infrastructure. The report does not endorse a specific forecast or scenario but rather provides a framework for interpreting forecasts under conditions of deep uncertainty. We experimented with an iterative approach to human and artificial intelligence collaboration for this report. The primary drafting of the text was performed by large language models (GPT 5.1, Gemini 3 Pro, and Claude 4.5 Opus), with human researchers providing direction, peer review, fact-checking, and revision.
When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR
arXiv:2604.22774v1 Announce Type: new Abstract: Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus on single-line expressions and rely on lexical metrics such as BLEU, which fail to assess the semantic reasoning across multi-line student solutions. In this paper, we present the first systematic study of multi-line handwritten math Optical Character Recognition (OCR), revealing a critical failure mode of Vision-Language Models (VLMs): over-correction. Instead of faithfully transcribing a student's work, these models often "fix" errors, thereby hiding the very mistakes an educational assessment aims to detect. To address this, we propose PINK (Penalized INK-based score), a semantic evaluation metric that leverages a Large Language Model (LLM) for rubric-based grading and explicitly penalizes over-correction. Our comprehensive evaluation of 15 state-of-the-art VLMs on the FERMAT dataset reveals substantial ranking reversals compared to BLEU: models like GPT-4o are heavily penalized for aggressive over-correction, whereas Gemini 2.5 Flash emerges as the most faithful transcriber. Furthermore, human expert studies show that PINK aligns significantly better with human judgment (55.0% preference over BLEU's 39.5%), providing a more reliable evaluation framework for handwritten math OCR in educational settings.
‘World models’ are AI’s latest sensation: what are they and what can they do?
Training AI world models on data about physical environments could improve their real-world capabilities in technologies such as robotics. Training AI world models on data about physical environments could improve their real-world capabilities in technologies such as robotics.
New AI framework autonomously optimizes training data, architectures and algorithms — outperforming human baselines
A new AI framework has been developed that autonomously optimizes its own training data and architecture, surpassing human-designed baselines.
FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agentic Code Generation in Lean
arXiv:2604.23002v1 Announce Type: new Abstract: Formalising informal mathematical reasoning into formally verifiable code is a significant challenge for large language models. In scientific fields such as physics, domain-specific machinery (\textit{e.g.} Dirac notation, vector calculus) imposes additional formalisation challenges that modern LLMs and agentic approaches have yet to tackle. To aid autoformalisation in scientific domains, we present FormalScience; a domain-agnostic human-in-the-loop agentic pipeline that enables a single domain expert (without deep formal language experience) to produce \textit{syntactically correct} and \textit{semantically aligned} formal proofs of informal reasoning for low economic cost. Applying FormalScience to physics, we construct FormalPhysics, a dataset of 200 university-level (LaTeX) physics problems and solutions (primarily quantum mechanics and electromagnetism), along with their Lean4 formal representations. Compared to existing formal math benchmarks, FormalPhysics achieves perfect formal validity and exhibits greater statement complexity. We evaluate open-source models and proprietary systems on a statement autoformalisation task on our dataset via zero-shot prompting, self-refinement with error feedback, and a novel multi-stage agentic approach, and explore autoformalisation limitations in modern LLM-based approaches. We provide the first systematic characterisation of semantic drift in physics autoformalisation in terms of concepts such as notational collapse and abstraction elevation which reveals what formal language verifies when full semantic preservation is unattainable. We release the codebase together with an interactive UI-based FormalScience system which facilitates autoformalisation and theorem proving in scientific domains beyond physics.https://github.com/jmeadows17/formal-science
Analytica: Soft Propositional Reasoning for Robust and Scalable LLM-Driven Analysis
arXiv:2604.23072v1 Announce Type: new Abstract: Large language model (LLM) agents are increasingly tasked with complex real-world analysis (e.g., in financial forecasting, scientific discovery), yet their reasoning suffers from stochastic instability and lacks a verifiable, compositional structure. To address this, we introduce Analytica, a novel agent architecture built on the principle of Soft Propositional Reasoning (SPR). SPR reframes complex analysis as a structured process of estimating the soft truth values of different outcome propositions, allowing us to formally model and minimize the estimation error in terms of its bias and variance. Analytica operationalizes this through a parallel, divide-and-conquer framework that systematically reduces both sources of error. To reduce bias, problems are first decomposed into a tree of subpropositions, and tool-equipped LLM grounder agents are employed, including a novel Jupyter Notebook agent for data-driven analysis, that help to validate and score facts. To reduce variance, Analytica recursively synthesizes these grounded leaves using robust linear models that average out stochastic noise with superior efficiency, scalability, and enable interactive "what-if" scenario analysis. Our theoretical and empirical results on economic, financial, and political forecasting tasks show that Analytica improves 15.84% accuracy on average over diverse base models, achieving 71.06% accuracy with the lowest variance of 6.02% when working with a Deep Research grounder. Our Jupyter Notebook grounder shows strong cost-effectiveness that achieves a close 70.11% accuracy with 90.35% less cost and 52.85% less time. Analytica also exhibits highly noise-resilient and stable performance growth as the analysis depth increases, with a near-linear time complexity, as well as good adaptivity to open-weight LLMs and scientific domains.
Don't Make the LLM Read the Graph: Make the Graph Think
arXiv:2604.23057v1 Announce Type: new Abstract: We investigate whether explicit belief graphs improve LLM performance in cooperative multi-agent reasoning. Through 3,000+ controlled trials across four LLM families in the cooperative card game Hanabi, we establish four findings. First, integration architecture determines whether belief graphs provide value: as prompt context, graphs are decorative for strong models and beneficial only for weak models on 2nd-order Theory of Mind (80% vs 10%, p<0.0001, OR=36.0); when graphs gate action selection through ranked shortlists, they become structurally essential even for strong models (100% vs 20% on 2nd-order ToM, p<0.001). Second, we identify "Planner Defiance," a model-family-specific failure where LLMs override correct planner recommendations at partial competence (90% override, replicated N=20); Gemini models show near-zero defiance while Llama 70B shows 90%, and models distinguish factual context (deferred to) from advisory recommendations (overridden). Third, full-game evidence confirms inter-agent conventions (+128% over baseline, p=0.003) outperform all single-agent interventions, and individual belief-graph components must be combined to produce gains. Fourth, preliminary scaling analysis (N=10/cell, exploratory) suggests graph depth has diminishing returns: shallow graphs provide the best cost-benefit ratio, while deeper ToM graphs appear harmful at larger player counts (-1.5 pts at 5-player, p=0.029).
StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning
arXiv:2604.23198v1 Announce Type: new Abstract: Current video moment retrieval excels at action-centric tasks but struggles with narrative content. Models can see \textit{what is happening} but fail to reason \textit{why it matters}. This semantic gap stems from the lack of \textbf{Theory of Mind (ToM)}: the cognitive ability to infer implicit intentions, mental states, and narrative causality from surface-level observations. We introduce \textbf{StoryTR}, the first video moment retrieval benchmark requiring ToM reasoning, comprising 8.1k samples from narrative short-form videos (shorts/reels). These videos present an ideal testbed. Their high information density encodes meaning through subtle multimodal cues. For instance, a glance paired with a sigh carries entirely different semantics than the glance alone. Yet multimodal perception alone is insufficient; ToM is required to decode that a character ``smiling'' may actually be ``concealing hostility.'' To teach models this reasoning capability, we propose an \textbf{Agentic Data Pipeline} that generates training data with explicit three-tier ToM chains (intent decoding, narrative reasoning, boundary localization). Experiments reveal the severity of the reasoning gap: Gemini-3.0-Pro achieves only 0.53 Avg IoU on StoryTR. However, our 7B \textbf{Shorts-Moment} model, trained on ToM-guided data, improves +15.1\% relative IoU over baselines, demonstrating that \textit{narrative reasoning capability matters more than parameter scale}.
An Intelligent Fault Diagnosis Method for General Aviation Aircraft Based on Multi-Fidelity Digital Twin and FMEA Knowledge Enhancement
arXiv:2604.22777v1 Announce Type: new Abstract: Fault diagnosis of general aviation aircraft faces challenges including scarce real fault data, diverse fault types, and weak fault signatures. This paper proposes an intelligent fault diagnosis framework based on multi-fidelity digital twin, integrating four modules: high-fidelity flight dynamics simulation, FMEA-driven fault injection, multi-fidelity residual feature extraction, and large language model (LLM)-enhanced interpretable report generation. A digital twin is constructed using the JSBSim six-degree-of-freedom (6-DoF) flight dynamics engine, generating 23-channel engine health monitoring data via semi-empirical sensor synthesis equations. A three-layer fault injection engine based on failure mode and effects analysis (FMEA) models the physical causal propagation of 19 engine fault types. A multi-fidelity residual computation framework comprising paired-mirror residuals and GRU surrogate prediction residuals is proposed: the high-fidelity path obtains clean fault deviation signals using nominal mirror trajectories with identical initial conditions, while the low-fidelity path achieves online real-time residual computation through a multi-step prediction GRU surrogate model. A 1D-CNN classifier performs end-to-end diagnosis of 20 fault classes. An LLM diagnostic report engine enhanced with FMEA knowledge fuses classification results, residual evidence, and domain causal knowledge to generate interpretable natural language reports. Experiments show the paired-mirror residual scheme achieves a Macro-F1 of 96.2% on the 20-class task, while the GRU surrogate scheme achieves 4.3x inference acceleration at only 0.6% performance cost. Comparison across 24 schemes reveals that residual feature quality contributes approximately 5x more to diagnostic performance than classifier architecture, establishing the "residual quality first" design principle.
PExA: Parallel Exploration Agent for Complex Text-to-SQL
arXiv:2604.22934v1 Announce Type: new Abstract: LLM-based agents for text-to-SQL often struggle with latency-performance trade-off, where performance improvements come at the cost of latency or vice versa. We reformulate text-to-SQL generation within the lens of software test coverage where the original query is prepared with a suite of test cases with simpler, atomic SQLs that are executed in parallel and together ensure semantic coverage of the original query. After iterating on test case coverage, the final SQL is generated only when enough information is gathered, leveraging the explored test case SQLs to ground the final generation. We validated our framework on a state-of-the-art benchmark for text-to-SQL, Spider 2.0, achieving a new state-of-the-art with 70.2% execution accuracy.
DeepSeek V4 is Here. What Does it Mean for Enterprise Productivity? - UC Today
UC Today delivers insights for IT leaders and buyers covering Agentic AI, Agentic AI in the Workplace, AI Agents, Generative AI, Workflow Automation, collaboration, employee experience and workspace tech.
Epicure: Multidimensional Flavor Structure in Food Ingredient Embeddings
arXiv:2604.22776v1 Announce Type: new Abstract: A chef's intuition about flavor, texture, and cultural identity represents tacit knowledge that is difficult to articulate yet central to culinary practice. We show that this knowledge is already encoded in FlavorGraph's 300-dimensional ingredient embeddings, trained on recipe cooccurrence and food chemistry, and that it can be systematically recovered. An LLM-augmented curation pipeline consolidates 6,653 raw FlavorGraph ingredients into 1,032 canonical entries, substantially strengthening the recoverable structure. We identify at least fifteen independently classifiable dimensions spanning taste, texture, geography, food processing, and culture.
Evaluating Historical Data Constraints on Modern Large Language Model Reasoning Capabilities
A small-scale language model trained exclusively on pre-1931 text explores the limits of historical data in modern reasoning tasks. This experiment tests whether models can derive contemporary inventions or coding skills from archaic datasets.
Assessing On-Device Efficiency and Utility of Small-Scale Historical Language Models
A small language model trained on pre-1931 text demonstrates the feasibility of running specialized AI on local hardware. While technically efficient, the model's limited reasoning capabilities highlight the trade-offs between model size and utility.
System Prompt Anomalies in OpenAI's Codex Model Highlight AI Development Unpredictability
The discovery of unusual instructions in the system prompt for OpenAI's Codex model illustrates the opaque nature of AI development. These anomalies underscore the challenges in controlling and interpreting large-scale model behavior.
The Power of Power Law: Asymmetry Enables Compositional Reasoning
arXiv:2604.22951v1 Announce Type: new Abstract: Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a uniform distribution may help models better learn these long-tail skills, we find a counterintuitive result: across a wide range of compositional reasoning tasks, such as state tracking and multi-step arithmetic, training under power-law distributions consistently outperforms training under uniform distributions. To understand this advantage, we introduce a minimalist skill-composition task and show that learning under a power-law distribution provably requires significantly less training data. Our theoretical analysis reveals that power law sampling induces a beneficial asymmetry that improves the pathological loss landscape, which enables models to first acquire high-frequency skill compositions with low data complexity, which in turn serves as a stepping stone to efficiently learn rare long-tailed skills. Our results offer an alternative perspective on what constitutes an effective data distribution for training models.
Towards Causally Interpretable Wi-Fi CSI-Based Human Activity Recognition with Discrete Latent Compression and LTL Rule Extraction
arXiv:2604.22979v1 Announce Type: new Abstract: We address Human Activity Recognition (HAR) utilizing Wi-Fi Channel State Information (CSI) under the joint requirements of causal interpretability, symbolic controllability, and direct operation on high-dimensional raw signals. Deep neural models achieve strong predictive performance on CSI-based HAR (CHAR), yet rely on continuous latent representations that are opaque and difficult to modify; purely symbolic approaches, in contrast, cannot process raw CSI streams. We propose a fully automatic and strictly decoupled pipeline in which CSI magnitude windows are compressed by a categorical variational autoencoder with Gumbel-Softmax latent variables under a capacity-controlled objective, yielding a compact discrete representation. The encoder is then frozen and used as a deterministic mapping to one-hot latent trajectories. Causal discovery is performed on these trajectories to estimate class-conditional temporal dependency graphs. Statistically supported lagged dependencies are translated into Linear Temporal Logic (LTL) rules, producing a fully symbolic and deterministic classifier based solely on rule evaluation and aggregation, without any learned discriminative head. Because rules are defined over discrete latent variables, antenna-specific rule sets can in principle be combined at the symbolic level, enabling structured multi-antenna fusion without retraining the encoder. Results from CHAR Latent Temporal Rule Extraction (CHARL-TRE) indicate competitive performance while preserving explicit temporal and causal structure, showing that deterministic symbolic classification grounded in unsupervised discrete latent representations constitutes a viable alternative to end-to-end black-box models for wireless HAR.
The Security Cost of Intelligence: AI Capability, Cyber Risk, and Deployment Paradox
arXiv:2604.23058v1 Announce Type: new Abstract: Firms are deploying more capable AI systems, but organizational controls often have not kept pace. These systems can generate greater productivity gains, but high-value uses require broader authority exposure -- data access, workflow integration, and delegated authority -- when governance controls have not yet decoupled capability from authority exposure. We develop an analytical model in which a firm jointly chooses AI deployment and cybersecurity investment under this governance-capability gap. The central result shows a deployment paradox: in high-loss environments, better AI can lead a firm to deploy less when capability is deployed through broader authority exposure under weak governance. Optimal deployment also falls below the no-risk benchmark, and this shortfall widens with breach-loss magnitude and with the authority exposure attached to more capable systems. Governance investment that reduces breach-loss magnitude shrinks the paradox region itself, while breach externalities expand the range of environments in which deployment is socially constrained. Governance maturity is therefore not merely a constraint on AI adoption. It is a condition that shapes whether capability improvements translate into productive deployment.
Opinion | After Mythos, Nobody Is Safe From Cybersecurity Threats - The New York Times
Nobody can afford to be relaxed about their digital security anymore.
Adoption, Deployment & Impact
Secure On-Premise Deployment of Open-Weights Large Language Models in Radiology: An Isolation-First Architecture with Prospective Pilot Evaluation
arXiv:2604.22768v1 Announce Type: new Abstract: Purpose: To design, implement, evaluate, and report on the regulatory requirements of a self-hosted LLM infrastructure for radiology adhering to the principle of least privilege, emphasizing technical feasibility, network isolation, and clinical utility. Materials and Methods: The isolation-first, containerized LLM inference stack relies on strict network segmentation, host-enforced egress filtering, and active isolation monitoring preventing unauthorized external connectivity. An accompanying deployment package provides automated isolation and hardening tests. The system served the open-weights DeepSeek-R1 model via vLLM. In a one-week pilot phase, 22 residents and radiologists were free to use 10 predefined prompt-templates whenever they considered them useful in daily work. Afterward, they rated clinical utility and system stability on an 0-10 Likert scale and reported observed critical errors in model output. Results: The applied institutional governance pathway achieved approval from clinic management, compliance, data protection and information security officers for processing unanonymized PHI. The system was rated stable and user friendly during the pilot. Source text-anchored tasks, such as report corrections or simplifications, and radiology guideline recommendations received the highest utility ratings, whereas open-ended conclusion generation based on findings resulted in the highest frequency of critical errors, such as clinically relevant hallucinations or omissions. Conclusion: The proposed isolation-first on-premise architecture enabled overcoming regulatory borders, showed promising clinical utility in text-anchored tasks and is the current base to serve open-weights LLMs as an official service of a German University Hospital with over 10,000 employees. The deployment package were made publicly available (https://github.com/ukbonn/ukb-gpt).
A Decoupled Human-in-the-Loop System for Controlled Autonomy in Agentic Workflows
arXiv:2604.23049v1 Announce Type: new Abstract: AI agents are increasingly deployed to execute tasks and make decisions within agentic workflows, introducing new requirements for safe and controlled autonomy. Prior work has established the importance of human oversight for ensuring transparency, accountability, and trustworthiness in such systems. However, existing implementations of Human-in-the-Loop (HITL) mechanisms are typically embedded within application logic, limiting reuse, consistency, and scalability across multi-agent environments. This paper presents a decoupled HITL system architecture that treats human oversight as an independent system component within the agent operating environment. The proposed design separates human interaction management from application workflows through explicit interfaces and a structured execution model. In addition, a design framework is introduced to formalize HITL integration along four dimensions: intervention conditions, role resolution, interaction semantics, and communication channel. This framework enables selective and context-aware human involvement while maintaining system-level consistency. The approach supports alignment with emerging agent communication protocols, allowing HITL to be implemented as a protocol-level concern. By externalizing HITL and structuring its integration, the system provides a foundation for scalable governance and progressive autonomy in agentic workflows.
Mistral AI launches Workflows, a Temporal-powered orchestration engine already running millions of daily executions
Mistral AI, the Paris-based artificial intelligence company valued at €11.7 billion ($13.8 billion), today released Workflows in public preview — a production-grade orchestration layer designed to move enterprise AI systems out of proofs of concept and into the business processes that generate revenue. The product, which launches as part of Mistral's Studio platform, is the company's clearest articulation yet of a thesis that is quietly reshaping the enterprise AI market: that the bottleneck for organizations adopting AI is no longer the model itself, but the infrastructure required to run it reliably at scale. "What we're seeing today is that organizations are struggling to go beyond isolated proofs of concept," Elisa Salamanca, head of product at Mistral AI, told VentureBeat in an exclusive interview ahead of the launch. "The gap is operational. Workflows is the infrastructure to run AI systems reliably across business-critical processes." The release arrives at a pivotal moment for both Mistral and the broader AI industry. The dedicated agentic AI market has been valued at approximately $10.9 billion in 2026 and is projected to reach $199 billion by 2034. Yet despite that staggering growth trajectory, industry research points to a stark reality: over 40% of agentic AI projects will be aborted by 2027 due to high costs, unclear value, and complexity. Mistral is betting that Workflows can help its enterprise customers avoid becoming one of those statistics. Mistral's new orchestration layer separates execution from control to keep enterprise data private At its core, Workflows provides a structured system for defining, executing, and monitoring multi-step AI processes — from simple sequential tasks to complex, stateful operations that blend deterministic business rules with the probabilistic outputs of large language models. Salamanca described Workflows as containing several key components. The first is a development kit that allows engineers to build orchestration logic in just a few lines of Python code. "We have also been able to expose MCP servers," she explained, referring to the Model Context Protocol standard for connecting AI systems to external tools, "so that they can actually do this with agent authoring." The second — and arguably more technically significant — component is an architecture that separates orchestration from execution. "We're decorrelating the orchestration from the execution," Salamanca said. "Execution can happen close to the customer's data — their critical systems — and orchestration can happen on the cloud or wherever they want to run it." This means the data never has to leave the customer's perimeter, a design decision with enormous implications for regulated industries where data sovereignty is non-negotiable. "Enterprises do not have to worry about us having access to the data," she added. The third pillar is observability. According to Mistral's blog post announcing the release, every branch, retry, and state change within a workflow is recorded in Studio with native support for OpenTelemetry. Salamanca noted that this is not an afterthought: "You can easily see what decisions have been taken by the workflow, by the agent, and you can deep dive into where problems are happening." Workflows is fully customizable across models — engineers can select which model handles which step and can inject arbitrary code, allowing them to blend deterministic pipelines with agentic sections. The system also supports connectors that integrate directly with CRMs, ticketing systems, support platforms, and other enterprise tools, with built-in authentication and secrets management. Why Mistral chose a code-first approach over low-code drag-and-drop builders Unlike some competitors offering drag-and-drop workflow builders, Mistral has deliberately targeted developers and engineers rather than business users. "There are a couple of solutions out there that have click-and-drag, drag-and-drop solutions for workflows," Salamanca acknowledged. "This is not the approach that we've been taking. We've been really focused towards developers and critical systems that will not scale if you're doing these drag-and-drop workflows." The decision is part of a broader philosophy at Mistral: that enterprise AI systems handling mission-critical operations — cargo releases, compliance reviews, financial transactions — require the precision and version control that only code can provide. Business users are not excluded from the picture, but their role is downstream. Once engineers write a workflow in Python, it can be published to Le Chat, Mistral's chatbot platform, so anyone in the organization can trigger it. Every step remains tracked and auditable in Studio. Under the hood, Workflows runs on Temporal's durable execution engine — a platform whose $5 billion valuation reflects how its durable execution capabilities, originally built for cloud workflow orchestration, have become essential infrastructure for AI agents requiring reliable, long-running, stateful processes. Temporal's customers include OpenAI, Snap, Netflix, and JPMorgan Chase, and its technology powers orchestration at companies like Stripe and Salesforce. Mistral extended Temporal's core engine for AI-specific workloads by adding streaming, payload handling, multi-tenancy, and observability that the base engine does not provide out of the box. "Workflows is built on top of Temporal," Salamanca confirmed. "We added all the AI requirements to make these AI workflows reliable. It provides out of the box durability, retries, state management. Whenever there's a failure, it starts again wherever it stopped." Originally spun out of Uber's Cadence project, Temporal transparently handles retries, state persistence, and timeouts, providing durable execution across failures. In late 2025, Temporal joined the newly formed Agentic AI Foundation as a Gold Member and announced an official OpenAI Agents SDK integration. By building on this infrastructure rather than creating a proprietary alternative, Mistral inherits battle-tested reliability while focusing its own engineering efforts on the AI-specific layer that sits above it. From cargo ships to KYC reviews, customers are already running millions of daily executions Mistral is not launching Workflows as a concept — the company says customers are already running the product in production, processing millions of executions daily across three primary use cases. The first is cargo release automation in the logistics sector. Global shipping still runs on paperwork, and a single cargo release can involve customs declarations, dangerous goods classifications, safety inspections, and regulatory checks spanning multiple jurisdictions. Salamanca described the scope of the problem: "Their global shipping today runs on paperwork. They have to involve customs declaration, Dangerous Goods classification, safety inspections, regulatory checks, and Workflows is now powering that with our models and business rules inside." Critically, the system keeps humans in the loop at the right moments. According to Mistral's blog, the human approval step in a workflow is a single line of code — wait_for_input() — that pauses the workflow indefinitely with no compute consumption, notifies the reviewer, and resumes exactly where it left off once approval is given. "Humans are still in the loop, but they're in the loop at the right time," Salamanca said. "They just get the validation — I don't have to go into multiple tools — and the shipment gets released." The second production use case is document compliance checking for financial institutions, specifically Know Your Customer reviews. These reviews are manual, repetitive, and traditionally require hours of analyst time per case. Salamanca said Workflows now processes these reviews in minutes and provides outputs in an auditable manner — a requirement for meeting regulatory obligations. The third example involves customer support in the banking sector. "You'd have millions of users actually asking to have credit cards blocked, or feedbacks on their account situation, on their credit feedbacks," Salamanca said. With Workflows, incoming support tickets are analyzed, categorized by intent and urgency, and routed automatically. Each routing decision is visible and traceable in Studio, and when the system gets a categorization wrong, the team can correct it at the workflow level without retraining the model. How Workflows fits into Mistral's three-layer enterprise AI platform strategy Workflows does not exist in isolation. It is the middle layer of a three-part enterprise platform that Mistral has been assembling at a rapid clip throughout 2026. At the bottom sits Forge, the custom model training platform Mistral launched in March at Nvidia’s GTC conference. Forge allows organizations to build, customize, and continuously improve AI models using their own proprietary data. At the top sits Vibe, Mistral's coding agent platform that provides the user-facing interaction layer — available on web, mobile, or desktop. Salamanca connected the three explicitly: "We just released Forge. It enables you to create your own models. But the question is, how do you put these models to do valuable work for your enterprise? That's where Workflows comes in, because this is the orchestration piece — how you blend in deterministic rules and agentic capabilities. And then if you really want to have your end users interact with these AI patterns, it's where Vibe comes into play." Forge is already seeing strong traction, Salamanca said, across two distinct patterns of enterprise demand. "First, they wanted to really build completely dedicated models to solve unique problems — transformers-based architecture for time series in the financial sector, adding new types of modalities to the LLMs," she explained. "And the second motion was about customers with really specific tasks they want to solve. Reinforcement learning really caught their attention as to how they can use Forge and Forge RL to actually have models do these tasks very well." This layered architecture — model customization, workflow orchestration, and end-user interfaces — positions Mistral as something more ambitious than a model provider. It is building a full-stack enterprise AI platform, a strategy that pits it directly against not just other AI labs like OpenAI and Anthropic, but also against the hyperscale cloud providers. The company's product portfolio now ranges, as Salamanca put it, "from compute to end-user interfaces," including data centers in Europe, document processing with its OCR model, and audio capabilities through its Voxtral models. Mistral's aggressive scaling campaign and the $14 billion valuation powering it The Workflows launch comes as Mistral executes one of the most aggressive scaling campaigns in the history of the European technology industry. The French AI startup has increased its revenue twentyfold within a year, with co-founder and CEO Arthur Mensch putting the company's annualized revenue run rate at over $400 million, compared to just $20 million the previous year. The Paris-based company aims to achieve recurring annual revenue of more than $1 billion by year-end. The company's fundraising trajectory has been equally dramatic. Mistral announced a €1.7 billion ($1.9 billion) Series C round at a €11.7 billion ($12.8 billion) valuation in September 2025. Bloomberg reported in September 2025 that the company was finalizing a €2 billion investment valuing it at €12 billion ($14 billion). ASML led the round and contributed €1.3 billion, a landmark investment that aligned chip manufacturing expertise with frontier AI development and underscored European industrial capital's commitment to building a sovereign AI ecosystem. Mistral then secured $830 million in debt in March 2026 to buy 13,800 Nvidia chips for a new data center near Paris. The financial picture illustrates why Workflows matters strategically. Mistral's revenue growth is being driven primarily by enterprise adoption, with approximately 60% of revenue coming from Europe, according to CEO Mensch's public statements. Those enterprise customers are not buying Mistral's models for casual chatbot applications — they are deploying them in regulated, mission-critical environments where reliability and data sovereignty are table stakes. Workflows gives those customers the production infrastructure they need to actually deploy AI systems that matter. In May 2025, Mistral released Mistral Medium 3, which was priced at $0.40 per million input tokens and $2 per million output tokens. The company said clients in financial services, energy, and healthcare had been beta testing it for customer service, workflow automation, and analyzing complex datasets. That model now becomes one of many that can be plugged into Workflows, creating a flywheel where better models drive more workflow adoption, which in turn drives more inference revenue. Where Mistral's orchestration play fits in an increasingly crowded competitive landscape Mistral's entry into workflow orchestration arrives in an increasingly crowded field. AI orchestration platforms are quickly becoming the backbone of enterprise AI systems in 2026, and as businesses deploy multiple AI agents, tools, and LLMs, the need for unified control, oversight, and efficiency has never been greater. Major cloud providers — Amazon with Bedrock AgentCore, Microsoft with Copilot Studio, Google with Vertex AI's agent tools, and IBM with WatsonX — all offer some form of workflow or agent orchestration. Open-source frameworks like LangChain, LlamaIndex, and Microsoft AutoGen provide developer-level building blocks. And dedicated orchestration startups are proliferating. Mistral's differentiation rests on three pillars. First, vertical integration: because Workflows is native to Studio, the orchestration layer and the components it orchestrates — models, agents, connectors, observability — are built to work together, eliminating the integration tax that enterprises pay when stitching together disparate tools. Second, deployment flexibility: the split control-plane/data-plane architecture means customers in regulated industries can run execution workers in their own environments while still benefiting from managed orchestration. Third, data sovereignty: Mistral's European roots and infrastructure investments give it a natural advantage with organizations wary of routing sensitive data through U.S.-headquartered cloud providers — a concern that has intensified amid ongoing geopolitical tensions and growing European anxiety about relying on foreign providers for over 80% of digital services and infrastructure. Still, the challenges are real. OpenAI and Anthropic both have significantly larger model ecosystems and developer communities. The hyperscalers control the cloud infrastructure where most enterprise workloads actually run. And the enterprise sales cycles for production-grade AI deployments remain long and complex, requiring deep technical integration work that even well-funded startups can struggle to staff. What comes next for Workflows — and why Mistral thinks orchestration is the real AI battleground Salamanca outlined three areas of near-term development. First, Mistral plans to release a more managed version of Workflows that abstracts deployment logic for developers who don't need granular control over worker placement. "Whenever you want to have this flexibility, you can, but if you want to be able to have this on a managed infrastructure, even if it's running in your own VPC, this is something that we're adding," she said. Second, the company intends to make Workflows accessible to business users, not just engineers. "With Vibe code, you can actually author a workflow. This can be executed at scale, and any end user, in the end, can actually do that with Workflows," Salamanca explained. The third area is enterprise guardrails and safety controls for agentic applications — ensuring agents use the correct tools, run with appropriate permissions, and that administrators can enforce policies at scale. "Making sure that we have all these enterprise controls to be able to scale the authoring and the building of these workflows is something we're actively working on," she said. The Python SDK for Workflows (v3.0) is now publicly available. Developers can try the product in Studio and access documentation and demo templates immediately. Mistral will be hosting its inaugural AI Now Summit in Paris on May 27–28, where the company is expected to provide additional details on its platform roadmap. For three years, the AI industry has been captivated by a single question: who can build the most powerful model? Mistral's Workflows launch suggests the company has moved on to a different question entirely — one that may prove far more consequential for the enterprises writing the checks. It's not about which model is smartest. It's about which one can actually show up for work.
Digital Adoption and Cyber Security: An Analysis of Canadian Businesses
arXiv:2504.12413v2 Announce Type: replace Abstract: This paper examines how Canadian firms balance the benefits of technology adoption against the rising risk of cyber security breaches. We merge data from the 2021 Canadian Survey of Digital Technology and Internet Use and the 2021 Canadian Survey of Cyber Security and Cybercrime to investigate the trade-off firms face when pursuing digitalization to enhance productivity and efficiency, balanced against the potential increase in cyber security risk. The analysis explores the extent of digital technology adoption, differences across industries, the subsequent associations with efficiency, and associated cyber security vulnerabilities. We build aggregate variables, such as the Business Digital Usage Score and a cyber security incidence variable to quantify each firm's digital engagement and cyber security risk. A survey-weight-adjusted Lasso estimator is employed, and a debiasing method for high-dimensional logit models is introduced to identify the predictors of technological efficiency and cyber risk. The analysis reveals a digital divide linked to firm size, industry, and workforce composition. While rapid expansion of tools such as cloud services or artificial intelligence can raise efficiency, it simultaneously heightens exposure to cyber threats, particularly among larger enterprises.
A Systematic Approach for Large Language Models Debugging
arXiv:2604.23027v1 Announce Type: new Abstract: Large language models (LLMs) have become central to modern AI workflows, powering applications from open-ended text generation to complex agent-based reasoning. However, debugging these models remains a persistent challenge due to their opaque and probabilistic nature and the difficulty of diagnosing errors across diverse tasks and settings. This paper introduces a systematic approach for LLM debugging that treats models as observable systems, providing structured, model-agnostic methods from issue detection to model refinement. By unifying evaluation, interpretability, and error-analysis practices, our approach enables practitioners to iteratively diagnose model weaknesses, refine prompts and model parameters, and adapt data for fine-tuning or assessment, while remaining effective in contexts where standardized benchmarks and evaluation criteria are lacking. We argue that such a structured methodology not only accelerates troubleshooting but also fosters reproducibility, transparency, and scalability in the deployment of LLM-based systems.
Large UK companies in the dark about how their data is used overseas by AI
Survey of senior technology and data executives finds lack of understanding about how information is handled abroad
Why supply chains are the proving ground for automation-led iPaaS
Supply chains are increasingly becoming the primary testing ground for automation-driven integration platforms.
70% of Enterprise AI is Uncontrolled, Driving Hidden Risk, Cost and Slower ROI
MORRISVILLE, N.C., April 27, 2026--AI is already being used across your organization, whether it has been formally approved or not. Employees are using AI with or without IT involvement, fueling the rise of ‘shadow AI’ across the enterprise, creating gaps in governance and control.
Research funders ‘flooded with AI-assisted applications’
A 142 per cent rise in bids for Marie Curie fellowships shows peer review must adapt to ChatGPT era, says Nature study
"Capital is flowing into the market, but it is more selective, more money going to fewer companies" | CTech
Delia Pekelman, SVP Corporate and Growth at LeumiTech, spoke at Calcalist's Tech Independence 2026 event about how startups are dealing with the AI revolution: "More mature growth companies must integrate AI in a meaningful way. That often requires rethinking and rebuilding core technology ...
Unlocking AI: IBM’s ‘Client Zero’ Strategy and Key Insights for Indian Manufacturers, ETEnterpriseai
Discover how IBM's 'client zero' approach enhances AI deployment for manufacturers in India. Learn valuable lessons on data management, AI ownership, talent constraints, and the future of agentic AI in industry.
ISG Europe ServiceNow report highlights AI, sovereignty | III Stock News
Audit-ready AI and location-bound data handling are shaping deployments in Europe. ISG assessed 40 providers across three ServiceNow categories.
AI adoption roadmap: How organizations scale AI across departments
AI adoption moves organizations from isolated experiments to enterprise-wide operations. Learn the five stages, common roadblocks, and department-by-department strategies for scaling AI agents.
Steelmaker Cliffs Taps Palantir Technologies for AI Overhaul
Cleveland-Cliffs Inc. struck a three-year agreement with Palantir Technologies Inc. to deploy artificial intelligence tools across its operations, as the US steelmaker steps up efforts to modernize its manufacturing footprint.
Learning in Blocks: A Multi Agent Debate Assisted Personalized Adaptive Learning Framework for Language Learning
arXiv:2604.22770v1 Announce Type: new Abstract: Most digital language learning curricula rely on discrete-item quizzes that test recall rather than applied conversational proficiency. When progression is driven by quiz performance, learners can advance despite persistent gaps in using grammar and vocabulary during interaction. Recent work on LLM-based judging suggests a path toward scoring open-ended conversations, but using interaction evidence to drive progression and review requires scoring protocols that are reliable and validated. We introduce Learning in Blocks, a framework that grounds progression in demonstrated conversational competence evaluated using CEFR-aligned rubrics. The framework employs heterogeneous multi-agent debate (HeteroMAD) in two stages: a scoring stage where role-specialized agents independently evaluate Grammar, Vocabulary, and Interactive Communication, engage in debate to address conflicting judgments, and a judge synthesizes consensus scores; and a recommendation stage that identifies specific grammar skills and vocabulary topics for targeted review. Progression requires demonstrating 70% mastery, and spaced review targets identified weaknesses to counter skill decay. We benchmark four scoring and recommendation methods on CEFR A2 conversations annotated by ESL experts. HeteroMAD achieves a superior score agreement with a 0.23 degree of variation and recommendation acceptability of 90.91%. An 8-week study with 180 CEFR A2 learners demonstrates that combining rubric-aligned scoring and recommendation with spaced review and mastery-based progression produces better learning outcomes than feedback alone.
Siemens Energy, TCS partner on AI for energy operations and data center demand
Siemens Energy and TCS have expanded their partnership to deploy AI-driven solutions across energy operations, industrial systems and data center infrastructure to improve efficiency and reliability.
People Using AI to Self-Represent Are Clogging The Courts
Basically, AI helps simplify complex legal pathways and explain them in a way regular people can understand.
AI is transforming innovation in Unilever Beauty & Wellbeing | Unilever
Discover how Unilever's Beauty & Wellbeing scientists are using AI to decode consumer trends and mine decades of research data.
Roadmap to Autonomous Manufacturing: An AI driven Approach Based on Engineering Foundations - Edge AI and Vision Alliance
This article was originally published at HCLTech’s website. It is reprinted here with the permission of HCL Tech. Autonomous manufacturing can be achieved through a structured journey built on foundational engineering, converged data and human-led AI Key takeaways Autonomous manufacturing ...
AI-enabled smart grid to accelerate power sector’s shift to clean energy - Power Technology
AI-enabled smart grids use real-time predictive analytics and machine learning to match supply with demand, improving efficiency.
RAG precision tuning can quietly cut retrieval accuracy by 40%, putting agentic pipelines at risk
Enterprise teams that fine-tune their RAG embedding models for better precision may be unintentionally degrading the retrieval quality those pipelines depend on, according to new research from Redis. The paper, "Training for Compositional Sensitivity Reduces Dense Retrieval Generalization," tested what happens when teams train embedding models for compositional sensitivity. That is the ability to catch sentences that look nearly identical but mean something different — "the dog bit the man" versus "the man bit the dog," or a negation flip that reverses a statement's meaning entirely. That training consistently broke dense retrieval generalization, how well a model retrieves correctly across broad topics and domains it wasn't specifically trained on. Performance dropped by 8 to 9 percent on smaller models and by 40 percent on a current mid-size embedding model teams are actively using in production. The findings have direct implications for enterprise teams building agentic AI pipelines, where retrieval quality determines what context flows into an agent's reasoning chain. A retrieval error in a single-stage pipeline returns a wrong answer. The same error in an agentic pipeline can trigger a cascade of wrong actions downstream. Srijith Rajamohan, AI Research Leader at Redis and one of the paper's authors, said the finding challenges a widespread assumption about how embedding-based retrieval actually works. "There's this general notion that when you use semantic search or similar semantic similarity, we get correct intent. That's not necessarily true," Rajamohan told VentureBeat. "A close or high semantic similarity does not actually mean an exact intent." The geometry behind the retrieval tradeoff Embedding models work by compressing an entire sentence into a single point in a high-dimensional space, then finding the closest points to a query at retrieval time. That works well for broad topical matching — documents about similar subjects end up near each other. The problem is that two sentences with nearly identical words but opposite meanings also end up near each other, because the model is working from word content rather than structure. That is what the research quantified. When teams fine-tune an embedding model to push structurally different sentences apart — teaching it that a negation flip which reverses a statement's meaning is not the same as the original — the model uses representational space it was previously using for broad topical recall. The two objectives compete for the same vector. The research also found the regression is not uniform across failure types. Negation and spatial flip errors improved measurably with structured training. Binding errors — where a model confuses which modifier applies to which word, such as which party a contract obligation falls on — barely moved. For enterprise teams, that means the precision problem is harder to fix in exactly the cases where getting it wrong has the most consequences. The reason most teams don't catch it is that fine-tuning metrics measure the task being trained for, not what happens to general retrieval across unrelated topics. A model can show strong improvement on near-miss rejection during training while quietly regressing on the broader retrieval job it was hired to do. The regression only surfaces in production. Rajamohan said the instinct most teams reach for — moving to a larger embedding model — does not address the underlying architecture. "You can't scale your way out of this," he said. "It's not a problem you can solve with more dimensions and more parameters." Why the standard alternatives all fall short The natural instinct when retrieval precision fails is to layer on additional approaches. The research tested several of them and found each fails in a different way. Hybrid search. Combining embedding-based retrieval with keyword search is already standard practice for closing precision gaps. But Rajamohan said keyword search cannot catch the failure mode this research identifies, because the problem is not missing words — it is misread structure. "If you have a sentence like 'Rome is closer than Paris' and another that says 'Paris is closer than Rome,' and you do an embedding retrieval followed by a text search, you're not going to be able to tell the difference," he said. "The same words exist in both sentences." MaxSim reranking. Some teams add a second scoring layer that compares individual query words against individual document words rather than relying on the single compressed vector. This approach, known as MaxSim or late interaction and used in systems like ColBERT, did improve relevance benchmark scores in the research. But it completely failed to reject structural near-misses, assigning them near-identity similarity scores. The problem is that relevance and identity are different objectives. MaxSim is optimized for the former and blind to the latter. A team that adds MaxSim and sees benchmark improvement may be solving a different problem than the one they have. Cross-encoders. These work by feeding the query and candidate document into the model simultaneously, letting it compare every word against every word before making a decision. That full comparison is what makes them accurate — and what makes them too expensive to run at production scale. Rajamohan said his team investigated them. They work in the lab and break under real query volumes. Contextual memory. Also sometimes referred to as agentic memory, these systems are increasingly cited as the path beyond RAG, but Rajamohan said moving to that type of architecture does not eliminate the structural retrieval problem. Those systems still depend on retrieval at query time, which means the same failure modes apply. The main difference is looser latency requirements, not a precision fix. The two-stage fix the research validated The common thread across every failed approach is the same: a single scoring mechanism trying to handle both recall and precision at once. The research validated a different architecture: stop trying to do both jobs with one vector, and assign each job to a dedicated stage. Stage one: recall. The first stage works exactly as standard dense retrieval does today — the embedding model compresses documents into vectors and retrieves the closest matches to a query. Nothing changes here. The goal is to cast a wide net and bring back a set of strong candidates quickly. Speed and breadth are what matter at this stage, not perfect precision. Stage two: precision. The second stage is where the fix lives. Rather than scoring candidates with a single similarity number, a small learned Transformer model examines the query and each candidate at the token level — comparing individual words against individual words to detect structural mismatches like negation flips or role reversals. This is the verification step the single-vector approach cannot perform. The results. Under end-to-end training, the Transformer verifier outperformed every other approach the research tested on structural near-miss rejection. It was the only approach that reliably caught the failure modes the single-vector system missed. The tradeoff. Adding a verification stage costs latency. The latency cost depends on how much verification a team runs. For precision-sensitive workloads like legal or accounting applications, full verification at every query is warranted. For general-purpose search, lighter verification may be sufficient. The research grew out of a real production problem. Enterprise customers running semantic caching systems were getting fast but semantically incorrect responses back — the retrieval system was treating similar-sounding queries as identical even when their meaning differed. The two-stage architecture is Redis's proposed fix, with incorporation into its LangCache product on the roadmap but not yet available to customers. What this means for enterprise teams The research does not require enterprise teams to rebuild their retrieval pipelines from scratch. But it does ask them to pressure-test assumptions most teams have never examined — about what their embedding models are actually doing, which metrics are worth trusting and where the real precision gaps live in production. Recognize the tradeoff before tuning around it. Rajamohan said the first practical step is understanding the regression exists. He evaluates any LLM-based retrieval system on three criteria: correctness, completeness and usefulness. Correctness failures cascade directly into the other two, which means a retrieval system that scores well on relevance benchmarks but fails on structural near-misses is producing a false sense of production readiness. RAG is not obsolete — but know what it can't do. Rajamohan pushed back firmly on claims that RAG has been superseded. "That's a massive oversimplification," he said. "RAG is a very simple pipeline that can be productionized by almost anyone with very little lift." The research does not argue against RAG as an architecture. It argues against assuming a single-stage RAG pipeline with a fine-tuned embedding model is production-ready for precision-sensitive workloads. The fix is real but not free. For teams that do need higher precision, Rajamohan said the two-stage architecture is not a prohibitive implementation lift, but adding a verification stage costs latency. "It's a mitigation problem," he said. "Not something we can actually solve."
Cross-Course Generalizability of SRL-Aligned Predictive Models Using Digital Learning Traces
arXiv:2604.22812v1 Announce Type: new Abstract: STEM dropout rates remain high at universities, particularly in computer science programs with theory-intensive courses. Digital learning environments now capture rich behavioral data that could help identify struggling students early, yet the generalizability of data-driven prediction models across courses and institutions remains uncertain. Guided by self-regulated learning (SRL) theory, this study analyzed multimodal digital-trace data from three undergraduate theoretical computer science courses (N1 = 137, N2 = 104, N3 = 148) at two universities. Weekly SRL-aligned digital-trace indicators were modeled using Elastic Net, Random Forest, and XGBoost to evaluate predictive performance over time and across settings, and model calibration both within and across courses. Early prediction of at-risk students was feasible, with SRL-related behaviors such as time management, effort regulation, and sustained engagement emerging as key predictors. While Random Forest achieved the highest in-sample accuracy, Elastic Net generalized more robustly across contexts. Out-of-sample accuracy and calibration declined between institutions with different base rates, underscoring the contextual nature of predictive analytics in higher education. These findings suggest that digital learning traces enable early identification of at-risk students within courses, but generalizing predictive models beyond their original context requires caution, particularly if the at-risk rates differ between contexts.
Summary of the AI Index Report 2026, Part II
What if the reality and the hype are bifurcating society on AI? Hello AI related infographics. 🗺️
Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines
arXiv:2604.23178v1 Announce Type: new Abstract: LLM-as-a-Judge has become the dominant paradigm for evaluating language model outputs, yet LLM judges exhibit systematic biases that compromise evaluation reliability. We present a comprehensive empirical study comparing nine debiasing strategies across five judge models from four provider families (Google, Anthropic, OpenAI, Meta), three benchmarks (MT-Bench n=400, LLMBar n=200, custom n=225), and four bias types. Our key findings: (1) Style bias is the dominant bias (0.76-0.92 across all models), far exceeding position bias (<= 0.04), yet has received minimal research attention. (2) All models show a conciseness preference on expansion pairs, but truncation controls confirm they correctly distinguish quality from length (0.92-1.00 accuracy), suggesting quality-sensitive evaluation rather than a simple length bias. (3) Debiasing is beneficial but model-dependent: the combined budget strategy significantly improves Claude Sonnet 4 by +11.2 pp (p < 0.0001), with directionally positive trends for other models. Only 2 of 20 non-baseline configurations show decreased agreement. We release our evaluation framework, controlled dataset, and all experimental artifacts at https://github.com/sksoumik/llm-as-judge.
The missing step between hype and profit
This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here. In February, I picked up a flyer at an anti-AI march in London. I can’t say for sure whether or not its writers meant to riff on South Park’s underpants gnomes. But…
Enterprise AI is missing the business core | InfoWorld
Most enterprise AI sits at the edge of the business rather than at its operational center. Until enterprises apply AI to core systems, much of the promised value will remain out of reach.
Geopolitics, Policy & Governance
China blocks Meta’s $2bn purchase of AI group Manus
Regulators had reviewed whether deal violated Beijing’s investment rules
Google signs classified AI deal with Pentagon, The Information reports | Reuters
Reuters had earlier reported that the Pentagon had been pushing top AI companies such as Open AI and Anthropic to make their tools available on classified networks without the standard restrictions they apply to users.
Can China Really Block Meta’s Manus AI Acquisition?
Artificial intelligence “agents” that can carry out complex tasks with minimal human intervention are attracting heavy investment as giant technology companies look beyond chatbots to develop more autonomous systems.
Meta’s Chinese stumble suggests declining tolerance for shades of grey
Tech-related capital flows have benefited from decades of ambiguity, but AI changes the calculus
Geopolitical Barriers to Globalization
arXiv:2509.12084v4 Announce Type: replace Abstract: We show that since the mid-1990s, the trade-promoting effects of tariff liberalization have been increasingly offset by deteriorating geopolitical alignment, slowing trade globalization after 2007. To quantify this barrier, we use large language models to compile 833,485 geopolitical events across 193 countries, 1950--2024, and construct a bilateral geopolitical alignment score. Using local projections, we estimate that a one-standard-deviation permanent improvement in alignment raises bilateral trade by 22 percent in the long run. In an Armington framework, tariff reductions raised 2021 global trade by about 7.5 percent, while geopolitical deterioration reduced it by about 5.3 percent, with uneven welfare effects.
Observing the Chinese AI Ecosystem: Insights from Beijing Lab Visits
A series of site visits to Beijing-based AI labs provides a window into the current state of China's AI development. These observations are critical for understanding the competitive landscape and the impact of international trade restrictions on regional innovation.
What Global Turmoil Means for Company Structure
Chris Gash/theispot.com The international order is undergoing structural transformation. War in the Middle East, the prolonged conflict in Ukraine, and major shifts in U.S. trade and foreign policy that have altered the country’s traditional alliances are manifestations of a broader reconfiguration of power. Tariffs, export controls, sanctions, and the vulnerability of strategic choke points as […]
Ukraine-linked voices weigh in on the EU’s €160 million DefenceTech gamble
The recently announced EU-Ukraine defence innovation programme is not just another Brussels funding announcement. For Ukraine-linked founders, investors, and DefenceTech operators, the roughly €160 million initiative could become a test of whether Europe can move from statements of support to practical, battlefield-relevant industrial backing. Launched during the EU–Ukraine business summit in Brussels, the programme is […]
Opinion | The profound way America and China are diverging on AI
Public skepticism could become a strategic liability.
India eyes global supply chain pivot as $15 billion Google AI hub breaks ground - The Times of India
Tech News News: New Delhi/Visakhapatnam: India is poised to emerge as a “trusted value chain and supply chain partner” for the world in electronics manufacturing, Uni.
The sovereign shift as Asia accelerates industrial realignment amid global tensions | Domain-b.com
Asia is accelerating investments in defense, sovereign AI, and energy transition as geopolitical risks reshape global supply chains and industrial strategy.
China expands economic pressure tools in US rivalry ahead of Trump–Xi summit - The Economic Times
China is building new economic pressure tools against the United States. Beijing has enacted laws to punish foreign firms shifting supply chains. It has also tightened rare earth licensing and banned foreign AI chips. These actions signal a strategic move to counter U.S. measures.
UK backs company building breakthrough AI that can discover new knowledge - GOV.UK
The UK government’s Sovereign AI is backing Ineffable Intelligence, co-investing with the British Business Bank to scale a UK-built, self-learning AI that can generate new knowledge and drive breakthroughs.
What Should Frontier AI Developers Disclose About Internal Deployments?
arXiv:2604.23065v1 Announce Type: new Abstract: Frontier AI developers are increasingly deploying highly capable models internally to automate AI R&D, but these deployments currently face limited external oversight. It is essential, therefore, that developers provide evidence that internally deployed models are safe. While recent work has highlighted the risks of internal deployments and proposed broad approaches to transparency and governance, there remains little guidance on the specific information developers should disclose about them. We address this gap by identifying key information that companies should disclose about internally deployed models across four categories: capabilities, usage, safety mitigations, and governance. For each category, we analyse the key benefits and limitations of disclosure and consider how disclosure-related risks can be mitigated. Our framework could be used by developers to inform both public transparency documents, such as model system cards, and private periodic reports required under emerging frontier AI regulation.
UGAF-ITS: A Standards Harmonization Framework and Validation Tool for Multi-Framework AI Governance in Distributed Intelligent Transportation Systems
arXiv:2604.22789v1 Announce Type: new Abstract: Organizations deploying AI-enabled Intelligent Transportation Systems face fragmented governance: ISO/IEC 42001 demands a certifiable management system, the EU AI Act imposes binding high-risk obligations from August 2026, and the NIST AI Risk Management Framework structures voluntary practice. Each instrument is internally coherent, yet they drive different control vocabularies, evidence expectations, and audit rhythms. In distributed ITS deployments where vehicle manufacturers, roadside integrators, and cloud operators each hold partial evidence and partial accountability, this fragmentation multiplies compliance effort and obscures incident traceability. This paper introduces UGAF-ITS, a standards harmonization framework that consolidates 154 source obligations from the three instruments into 12 unified controls across eight governance domains through a reproducible five-phase crosswalk methodology. A three-tier operating model allocates each control to the vehicle, edge, or cloud tier where enforcement and defensible evidence production are feasible. An evidence backbone of 20 versioned artifacts supports a single audit package across all three frameworks without duplicating content. We validate UGAF-ITS through an open-source governance engine evaluated across four architecturally distinct ITS deployment scenarios. The engine encodes the complete crosswalk catalog and executes eight compliance computations. Three-tier deployments achieve 91.7% average framework coverage with 45.9% evidence reduction, complete bidirectional traceability, and 80% of artifacts serving all three frameworks simultaneously. Partial deployments degrade gracefully: coverage and reduction scale with architectural complexity. The tool, scenarios, and all reported results are publicly available for independent replication.
AI regulation set to become US midterm battleground | Biometric Update
AI regulation is becoming a proxy fight over democracy, federalism, religious nationalism, surveillance capitalism, and executive power.
Chinese groups push for neutral global AI governance
Chinese scientists urge fair, open AI development free from politics, launching a global initiative to promote inclusive AI governance worldwide.
China bars foreign investment in Manus AI project as scrutiny on AI exports grows · TechNode
China’s National Development and Reform Commission (NDRC) today announced that, in accordance with laws and regulations, it has issued a decision
The Doublespeak in OpenAI’s ‘Industrial Policy for the Intelligence Age’ | TechPolicy.Press
The companies that stand to profit most from the AI transition are the same ones being asked to help design the rules that govern it, writes Paul Nemitz.
Mobility Monitor: The White House Legislative Plan for AI Falls Short on Education While Sanders and Khanna Move to Tax the Rich » NCRC
This month's edition of Mobility Monitor discusses the limitations of the Trump administration’s legislative priorities on AI for the education system, the workforce and communities facing data center development before diving into a new bill in Congress proposing a 5% billionaire tax.
India’s AI Policy Panel: Startup Impact
India’s new AI policy panel aims to resolve fragmented regulation. Discover what AIGEG means for startups, innovation, and liability in India’s AI economy.
Cybersecurity Threats Demand Global Collaboration: Key Strategies – Archyde
Not just zero-days, but autonomous ... to defenses in real time. The response from the tech industry? A reluctant but necessary embrace of collaboration, where competitors share threat intelligence, open-source communities harden AI models, and even governments play a role—albeit cautiously. Traditional cybersecurity relied on ...
Get the full executive brief
Receive curated insights with practical implications for strategy, operations, and governance.