Thu 14 May 2026
Daily Brief — Curated and contextualised by Best Practice AI
Benchmarks Mislead, MIT Finds ROI, and California Reaps AI's Rewards
TL;DR Agent benchmarks face scrutiny for reward hacking, potentially misleading AI competence assessments. MIT's report highlights that organizations controlling their AI infrastructure achieve 5x ROI. California's budget benefits from the AI boom, showing no deficit for the coming years. Meanwhile, evolving AI pricing models may increase costs for companies.
The stories that matter most
Selected and contextualised by the Best Practice AI team
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
arXiv:2605.12673v1 Announce Type: new Abstract: Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue that benchmarks must be secure by design. From past incidents of reward hacks, we derive a taxonomy of eight recurring flaw patterns and compile them into the Agent-Eval Checklist for benchmark designers. We condense the insights into BenchJack, an automated red-teaming system that drives coding agents to audit benchmarks and identify possible reward-hacking exploits in a clairvoyant manner. Moreover, we extend BenchJack to an iterative generative-adversarial pipeline that discovers new flaws and patches them iteratively to improve benchmark robustness. We apply BenchJack to 10 popular agent benchmarks spanning software engineering, web navigation, desktop computing, and terminal operations. BenchJack synthesizes reward-hacking exploits that achieve near-perfect scores on most of the benchmarks without solving a single task, surfacing 219 distinct flaws across the eight classes. Moreover, BenchJack's extended pipeline reduces the hackable-task ratio from near 100% to under 10% on four benchmarks without fatal design flaws, fully patching WebArena and OSWorld within three iterations. Our results show that evaluation pipelines have not internalized an adversarial mindset, and that proactive auditing could help close the security gap for the fast-paced benchmarking space.
Sovereignty Is the New Operating System for Agentic AI, New MIT Technology Review Insights Report Finds
The research explains how organizations "Deeply Committed" to controlling their data, infrastructure, models, and governance are delivering 5x the ROI on generative and agentic AI initiatives—at a moment when more than half of enterprises already have autonomous agents in production making real-time decisions on operational data. The report, Establishing AI and Data Sovereignty in the Age of Autonomous Systems...
Frontier AI models don't just delete document content — they rewrite it, and the errors are nearly impossible to catch
As large language models become more capable, users are tempted to delegate knowledge tasks where models process documents on their behalf and provide the finished results. But how far can you trust the model to stay faithful to the content of your documents when it has to iterate over them across multiple rounds? A new study by researchers at Microsoft shows that large language models silently corrupt documents that they work on by introducing errors. The researchers developed a benchmark that simulates multi-step autonomous workflows across 52 professional domains, using a method that automatically measures how much content degrades over time. Their findings show that even top-tier frontier models corrupt an average of 25% of document content by the end of these workflows. And providing models with agentic tools or realistic distractor documents actually worsens their performance. This serves as a warning that while there is increasing pressure to automate knowledge work, current language models are not fully reliable for these tasks. The mechanics of delegated work The Microsoft study focuses on “delegated work,” an emerging paradigm where users allow LLMs to complete knowledge tasks on their behalf by analyzing and modifying documents. A prominent example of this paradigm is vibe coding, where a user delegates software development and code editing to an AI. But delegated workflows extend far beyond programming into other domains. In accounting, for example, a user might supply a dense ledger and instruct the model to split the document into separate files organized by specific expense categories. Because users might lack the time or the specialized expertise to manually review every modification the AI implements, delegation often hinges on trust. Users expect that the model will faithfully complete tasks without introducing unchecked errors, unauthorized deletions, or hallucinations in the documents. To measure how far AI systems can be trusted in extended, iterative delegated workflows, the researchers developed the DELEGATE-52 benchmark. The benchmark is composed of 310 work environments spanning 52 diverse professional domains, including financial accounting, software engineering, crystallography, and music notation. Each work environment relies on real-world seed text documents ranging from 2,000 to 5,000 tokens. Alongside the seed document, the environments include five to ten complex, non-trivial editing tasks. Grading a complex, multi-step editing process usually requires expensive human review. DELEGATE-52 bypasses this by using a “round-trip relay” simulation method that evaluates answers without requiring human-annotated reference solutions. The approach is inspired by the backtranslation technique used in machine translation evaluation, where an AI model is told to translate a document from one language to another and back to see how perfectly it reproduces the original version. Accordingly, every edit task in DELEGATE-52 is designed to be fully reversible, pairing a forward instruction with its precise inverse. For example, an instruction to split the ledger into separate files by expense category is paired with an instruction to merge all category files back into a single ledger. In comments provided to VentureBeat, Philippe Laban, Senior Researcher at Microsoft Research and co-author of the paper, clarified that this is not simply a test of whether an AI can hit "undo." Because human workers cannot be forced to instantly "forget" a task they just did, this round-trip evaluation is uniquely suited for AI. By starting a new conversational session, the researchers force the model to attempt the inverse task completely independently. The models in their experiments “do not know whether a task is a forward or backward step and are unaware of the overall experiment design," Laban explained. "They are simply attempting each task as thoroughly as they can at each step." These roundtrip tasks are chained together into a continuous relay to simulate long-horizon workflows spanning 20 consecutive interactions. To make the environment more realistic, the benchmark introduces distractor files in the context of each task. These contain 8,000 to 12,000 tokens of topically related but completely irrelevant documents. Distractors measure whether the AI can maintain focus or if it gets confused and pulls in the wrong data. Testing frontier models in the relay To understand how different architectures and scales handle delegated work, the researchers tested 19 different language models from OpenAI, Anthropic, Google, Mistral, xAI, and Moonshot. The main experiment subjected these models to a simulation of 20 consecutive editing interactions. Across all models, documents suffered an average degradation of 50% by the end of the simulation. Even the best frontier models in the experiment, specifically Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4, corrupted an average of 25% of the document content. Out of 52 professional domains, Python was the only one where most models achieved a ready status with a score of 98% or higher. Models excel in programmatic tasks but struggle severely in natural language and niche domains like fiction, earning statements, or recipes. The overall top model, Gemini 3.1 Pro, was deemed ready for delegated work in only 11 out of the 52 domains. Interestingly, the corruption was not caused by death by a thousand cuts where the models slowly accumulate tiny errors. Instead, about 80% of total degradation is caused by sparse but massive critical failures, which are single interactions where a model suddenly drops at least 10% of the document's content. The frontier models do not necessarily avoid small errors better. They simply delay these catastrophic failures to later rounds. Another important observation is that when weaker models fail, their degradation originates primarily from content deletion. However, when frontier models fail, they actively corrupt the existing content. The text is still there, but it has been subtly distorted or hallucinated, making it much harder for a human overseer to detect the error. Interestingly, giving models an agentic harness with generic tools for code execution and file read/write access actually worsened their performance, adding an average of 6% more degradation. Laban explained that the failure lies in relying on generic tools rather than domain-specific ones. "Models lack the capability to write effective programs on the fly that can manipulate files across diverse domains without mistakes," he noted. "When they cannot do something programmatically, they resort to reading and rewriting entire files, which is less efficient and more error prone." The solution for developers is to build tightly scoped tools (such as specific functions to calculate or move entries within .ledger files) to keep agents on track. Degradation also snowballs as documents get larger or as more distractor files are added to the workspace. For enterprise teams investing heavily in retrieval-augmented generation (RAG), these distractor documents serve as a direct warning about the compounding cost of messy context. While a noisy context window might cause a minimal 1% performance drop after just two interactions, that degradation compounds to a massive 2-8% drop over a long simulation. "For the retrieval community: RAG pipelines should be evaluated over multi-step workflows, not just single-turn retrieval benchmarks," Laban said. "Single-turn measurements systematically underestimate the harm of imprecise retrieval." Reality check for the autonomous enterprise The findings from the DELEGATE-52 benchmark offer a critical reality check for the current hype surrounding fully autonomous AI agents. The benchmark's design also implies a practical constraint: because models can maintain a clean record for several steps before a sudden catastrophic failure, incremental human review is necessary — not a single final check. Laban recommends building AI applications around short, transparent tasks rather than complex long-horizon agents. This keeps the action implication without the writer delivering the prescription. For organizations wanting to deploy autonomous agents safely today, the DELEGATE-52 methodology provides a practical blueprint for testing in-house data pipelines. Laban explained that "… an enterprise team wanting to adopt this framework needs to build three components: (a) a set of reversible editing tasks representative of their workflows, (b) a parser that converts their domain documents into a structured representation, and (c) a similarity function that compares two parsed representations." Teams do not even need to build parsers from scratch. The Microsoft research team successfully repurposed existing parsing libraries for 30 out of the 52 domains tested. Laban is optimistic about the rate of improvement. "Progress is real and fast. Looking at the GPT family alone, models go from scoring below 20% to around 70% in 18 months," Laban said. "If that trajectory continues, models will soon be able to achieve saturated scores on DELEGATE-52." However, Laban cautioned that DELEGATE-52 is purposefully small compared to massive enterprise environments. Even as foundation models inevitably master this benchmark, the endless long-tail of unique enterprise data and workflows means organizations will always need to invest in custom, domain-specific tooling to keep their autonomous agents reliable.
Is an AI spending plateau on the horizon?
Also in today’s newsletter: carbon capture shows promise in decarbonising data centres
AI alone cannot shorten the work week
The technology could raise prices and consumption before it gives us more free time
Newsom’s California Budget Bolstered by Extra Cash from AI Boom
California Governor Gavin Newsom unveiled a revised budget that shows no deficit for this year and next, as the state draws another boost from the technology and artificial-intelligence boom.
AI pricing models are evolving—and it might cost companies more
“What good is 40 more tokens going to do when you couldn’t get it right with the first 100 tokens that I bought?” asks one COO.
Economics & Markets
Is an AI spending plateau on the horizon?
Also in today’s newsletter: carbon capture shows promise in decarbonising data centres
Andrew Bird - building frontier ai strategies.
Anthropic seeks $10bn raise at $350bn valuation as AI capital arms race intensifies. The AI developer, best known for its Claude chatbot, is in discussions to raise about $10bn in a new round, according to people familiar with the matter. Singapore sovereign wealth fund GIC and Coatue Management are expected to lead the financing, with Microsoft and NVIDIA also set to participate.
Snowflake ecosystem startups draw USD $113 billion
That shift points to a more selective venture market after the broad-based funding surge earlier in the decade. The analysis suggests investors are backing fewer companies with larger rounds as competition intensifies among startups seeking to supply AI-related data tools to enterprise customers. "Capital ...
$500M Of Capex Just Revealed Who Actually Powers AI And It’s Not Nvidia
3 sell side raises in 2 weeks. All 3 got the mechanism wrong. The mispricing they created is what the next years of AI infrastructure look like in pipeline form.
Activate invests in ElevenLabs, bets big on India’s voice AI opportunity - BusinessToday
AI investment firm Activate has backed voice artificial intelligence startup ElevenLabs in its latest funding round, marking the venture firm’s first growth-stage investment since launching earlier this year. ElevenLabs was recently valued at $11 billion in its Series D financing round led by ICONIQ Capital ...
South Korea Vs. U.S.: Who Wins The AI Trade? (NYSEARCA:EWY) | Seeking Alpha
South Korea’s KOSPI is emerging as an AI infrastructure play with 2026 earnings seen as +300% and a 9x P/E. Read what investors need to know.
A European central bank has signed a mega deal with a cloud service provider. The problem for Google, Microsoft and Amazon? It’s not with them
Originally built to support the retail business, Schwartz Digits is now a trusted provider of secure data services to European businesses and governments.
China's AI suppliers can't keep up as component shortages bite - The Business Times
Capacity constraints and a tightening supply of critical components threaten to throttle the brisk growth seen early this year Read more at The Business Times.
** Tesla vs BYD: Two Opposite EV Business Models - FourWeekMBA
Tesla vs BYD: Two Opposite EV Business Models Tesla and BYD represent fundamentally different approaches to electric vehicle dominance, with Tesla pursuing vertical integr — as explored in how AI is restructuring the traditional value chain — ation and software monetization while BYD leverages ...
The AI Boom Is Building Fences Around the Economy | InvestorPlace
The AI boom is building fences around labor, capital, and power. Investors need to know who owns the gates.
Google's Googlebooks vs Amazon's Kindle: The Battle for Digital Ecosystem Lock-In - FourWeekMBA
Google’s announcement of Android-powered “Googlebooks” laptops represents more than hardware competition—it’s a direct assault on Amazon’s most profitable business model: ecosystem lock-in through content and cloud integration. While tech media focuses on specs and pricing, the ...
Enterprises can now train custom AI models from production workflows — no ML team required
Every query an enterprise AI application processes, every correction a subject matter expert makes to its output — that interaction is training data. Most organizations are not capturing it. The production workflows companies have already built are generating a continuous signal that improves AI models, and it is disappearing. San Francisco-based Empromptu AI on Thursday launched Alchemy Models with a straightforward premise: the AI applications enterprises are already building are generating training data, and most of it is going to waste. The platform captures that signal automatically, routing validated outputs from subject matter experts back into a fine-tuning pipeline that improves the model over time. Enterprises own the resulting weights outright. It sits in different territory from both RAG and traditional fine-tuning. RAG retrieves external context at inference time without modifying model weights. Traditional fine-tuning changes weights but requires separately assembled labeled datasets and a dedicated ML pipeline. Alchemy does the latter continuously, using the enterprise application itself as the data source. Companies adopting foundation model APIs face three compounding constraints: inference costs that scale with usage, no ownership of the models their data is effectively training, and limited ability to customize behavior for domain-specific tasks. Empromptu CEO Shanea Leven says those constraints are widely felt but rarely addressed. "Every customer, everybody that I talk to, is like, how am I not going to get disrupted? How am I going to protect my business? And they just don't see the path," Leven told VentureBeat in an exclusive interview. How Alchemy builds a model from a running application Most custom model training approaches require companies to separately collect, clean and label data before any fine-tuning can begin. Alchemy takes a different path: the enterprise application itself generates and cleans the training data. The mechanism runs through Empromptu's Golden Data Pipelines infrastructure in two stages. Before an app is built, enterprise data is cleaned, extracted and enriched so the application starts with structured inputs. Once it is running, every output it generates goes back through the pipeline, where subject matter experts inside the organization review and correct it. That validated output becomes the training data for the next fine-tuning run. "The app, the AI application that customers are already creating, cleans the data," Leven said. The resulting fine-tuned models are what Empromptu calls Expert Nano Models: small, task-specific models optimized for a particular workflow rather than general-purpose reasoning. Evals, guardrails and compliance controls run within the same pipeline, so governance travels with the training process. Customers own the model weights outright. Empromptu hosts and runs inference on its infrastructure, but the weights are portable and exportable for a fee. The platform is model agnostic, supporting Llama, Qwen and other base models. The hard constraint is data volume. Early deployments run on the base model while the application accumulates enough production data to trigger a useful fine-tuning run. Leven acknowledged the timeline without sugarcoating it. "Training the model will just take time," she said. Alchemy differs from managed fine-tuning on who does the work OpenAI's fine-tuning API and AWS Bedrock custom models both offer enterprise fine-tuning. Both require organizations to bring separately prepared training datasets and manage the fine-tuning process outside their application stack. The burden of data curation and model evaluation sits with the customer's ML team. Alchemy's differentiation is process integration. The training data is generated by the enterprise application itself, so there is no separate data preparation step and no ML expertise required. The application workflow is the pipeline. "Do I need to have Bedrock and go spin up another ML team to go figure out how to fine tune a model and figure out all of that infrastructure? No, anyone can do it now," Leven said. The tradeoff is platform dependency. Alchemy only works within the Empromptu environment. Enterprises that want the same outcome on existing infrastructure would need to replicate the data capture, validation and fine-tuning pipeline themselves. A behavioral health company cut session documentation time by up to 87% using Alchemy Empromptu is targeting regulated and data-intensive verticals first: healthcare, financial services, legal technology, retail and revenue forecasting. These are sectors where general-purpose model outputs carry the highest mismatch risk and proprietary workflow data is most concentrated. Among the early users is behavioral health company Ascent Autism, which uses Alchemy to automate session documentation and parent communication. Facilitators use learner session recordings, transcripts, session notes and behavioral metrics to generate structured notes and personalized parent updates. That workflow previously required one to two hours of writing per session. With Alchemy training on the same data, it now takes 10 to 15 minutes. "Relying solely on API-based models can become expensive quickly," Faraz Fadavi, co-founder and CTO of Ascent Autism, told VentureBeat. "Alchemy gave us a way to structure the workflow, train models on our own data, and reduce costs while improving output quality over time." Fadavi said the company saw usable outputs quickly, with continued improvement as the system refined. Evaluation criteria went beyond accuracy to include traceability to session data and output consistency with the company's clinical voice. "We wanted a system that could learn our workflow and produce outputs aligned with how we actually operate — not just summarize text," he said. The practical test: how much facilitators need to edit, whether the output matches their voice and whether it meaningfully reduces time spent. Facilitators have shifted from rewriting generated notes to editing and quality-checking them. What this means for enterprises The data flywheel is real — but so is the platform lock-in: Every workflow is a training opportunity. Enterprises that capture and validate outputs from their production AI applications will compound that advantage over time. More usage generates more training signals, which produces more accurate domain-specific models, which generate better outputs, which produce cleaner training data in the next cycle. Leven positions Alchemy as a third architectural choice. Enterprises have spent the past two years choosing between RAG for domain knowledge access and fine-tuning for model specialization. Workflow-driven model training is a third option, combining the ongoing improvement of fine-tuning with the operational simplicity of building inside a managed platform. "Having that data moat is the most valuable currency," Leven said.
Structural Diversity Drives Disruptive Scientific Innovation
arXiv:2605.12514v1 Announce Type: cross Abstract: Scientific innovation increasingly depends on collaboration, yet the organizational structure that fosters breakthrough ideas remains poorly understood. Existing metrics - such as team size or compositional diversity - capture readily observable characteristics but not the deeper architecture of collaboration. We introduce Structural Diversity (SD): the extent to which a team bridges multiple distinct knowledge communities within its prior collaboration network. Using a century-scale dataset of 260 million scientific publications (1900-2025) and combining causal inference with a quasi-natural experiment based on a U.S. National Science Foundation policy change in 2012, we show that SD is a powerful and robust predictor of disruptive innovation, outperforming traditional team novelty indicators such as team freshness and edge density. Moreover, SD positively interacts with team size and is able to mitigate the well-known "curse of scale" by transforming scale from a liability into a resource for creative synthesis. We find that one mechanism underlying this effect is Disciplinary Integration (DI): teams with higher SD can more effectively combine heterogeneous knowledge into novel configurations. Our findings position SD as both a new theoretical construct and an actionable design principle for organizing scientific collaboration. By linking the architecture of team assembly to the dynamics of creative discovery, our work offers a structural explanation for how collective intelligence can be systematically engineered to foster disruptive innovation.
Investors Should Focus on AI's Long-Term Value Migration: JPMorgan AM
Joanna Shen of JPMorgan Asset Management says the firm believes we are "in the early adoption AI phase." She tells Bloomberg Television that AI agents are "the first technology in decades that can supercharge the labor inputs." (Source: Bloomberg)
AI alone cannot shorten the work week
The technology could raise prices and consumption before it gives us more free time
One of the world's leading WFH experts says remote work is making America more productive
America has been experiencing a productivity boom for the past five years, and economists are debating what's driving it.
King’s Cross is the Silicon Roundabout of AI
A formerly rundown area has become London’s new global technology hub
Exclusive: Martha Stewart’s new AI startup wants to manage your home before things break | Fortune
That idea is now Hint, an AI home management startup cofounded by Stewart, home-services veteran Yih-Han Ma, and chief technology officer Rush. The company raised $10 million in seed funding led by Slow Ventures, Fortune learned exclusively. Montauk Capital (who incubated the company), Tusk ...
Labor, Society & Culture
Career Mobility of Planning Alumni in the United States: Evidence from Professional Profile Data using Large Language Models
arXiv:2605.12618v1 Announce Type: new Abstract: Problem, Research Strategy, and Findings: Planning professions in the United States navigate complex and dynamic career landscapes under rapid urban changes, yet comprehensive evidence regarding their career trajectories, advancement patterns, and the influence of social, spatial, organizational, and educational factors remains limited. This study draws on boundaryless career theory, social capital theory, and spatial opportunity models to analyze career mobility among more than 130,000 planning alumni. Using large language models to extract structured information from LinkedIn profiles, our results reveal that planning alumni who adopt boundaryless career patterns, specifically multisector experience or lateral and industry-switching trajectories, achieve significantly higher upward mobility. While technical competencies provide a foundational entry-level signal, soft skills leveraged through strategic lateral moves become increasingly decisive as planners reach senior stages. Geographic mobility and employment in larger, diverse metropolitan labor markets are both associated with advancement, though the latter provides modest benefits. Larger professional networks and greater organizational engagement are consistently associated with upward career transitions, while AI-related skills, now commonplace, present limited additional advantage. Limitations include reliance on LinkedIn data, which may underrepresent alumni without online profiles, and an individual-level focus that omits organizational factors.
Africa E-Commerce Giant Jumia to Cut Workforce Due to AI
Jumia Technologies AG is planning to cut an initial 10% of its workforce of about 2,000 people as the African e-commerce giant implements artificial intelligence across its departments, Chief Executive Officer Francis Dufay said.
AI models are getting better at replacing cybersecurity pros on certain tasks
UK researchers find LLMs are learning to finish jobs faster and improving all the time
GM Cuts 600 IT Roles as AI Productivity Gains Outpace Headcount Needs | Futurism
Read Time 6 minutes Tags AI Automation ... Broader enterprise signal GM is not isolated Amazon Meta Oracle and Block have announced rounds of job cuts with some emphasizing AI role in automating work and boosting productivity with lower head counts The pattern is consistent across sectors with large ...
AI Responsibility and Transparency Act: Key Workplace Impacts » CBIA
The legislation is a wide-ranging “online safety” and AI bill with several provisions that directly affect hiring and employers.
When AI Leads to Skill Decay | Tuck School of Business
Tuck professors Alva Taylor and Rob Shumsky explore how working with generally reliable AI can quietly erode human expertise over time.
Comms Business - UK businesses missing out on full economic benefits of AI - Comms Business
Less than one quarter of workers who have fully deployed digital workers (23 per cent) see job replacement as their biggest concern, compared to those who haven’t started exploring AI (45% per cent). The UK market is at a critical turning point, where early value is evident, but many organisations are still working out how to embed AI into core business processes and realise sustained, enterprise-wide benefits. Leaders say the main barriers ...
AI is disrupting hiring: How tech talent can stand out - North Country Now
Toptal reports a surge in tech layoffs as demand shifts towards experienced professionals skilled in AI, emphasizing adaptability and real-world skills.
Shelby S. - Marketing & Community Growth Leader
This isn’t a collapse. It’s reallocation. Basic tasks are being automated. Pattern work is being commoditized. Low-leverage labor is being repriced. The uncomfortable truth? AI will not replace you. Someone using AI will. The market does not reward effort.
How many parents does it take? Parental time allocation and the effectiveness of fertility subsidies
arXiv:2605.13679v1 Announce Type: new Abstract: There has long been an apparent consensus in the literature on intra-household allocation and fertility that greater paternal involvement in childcare relaxes maternal time constraints, enabling mothers to increase their labor supply or leisure. Recent evidence, particularly from South Korea, challenges this view: increases in fathers' childcare time have coincided with a further increase in mothers' time dedicated to child-rearing. This paper develops an Overlapping Generations (OLG) growth model to address such a puzzle. The central mechanism and our main innovation hinge on the functional form of the childcare technology. When maternal and paternal time are substitutes, the conventional result holds. However, when they are complements, greater paternal involvement necessarily raises maternal childcare time, depressing fertility and redirecting household resources toward child quality. We further argue that the elasticity of substitution should not be interpreted as a pure preference parameter, as it also reflects the social and institutional norms, the skills each parent brings to child-rearing and their intergenerational transmission. The model is extended to study the effectiveness of pro-natalist subsidies, suggesting that such policies may generate an unintended anti-fertility bias. Numerical simulations calibrated loosely to South Korean data confirm that the model is consistent with the observed quantity-quality trade-off and the persistence of low fertility despite active pro-natalist policy.
Burned out and going nowhere: the American worker is too mentally drained to even look for a new job
In a low-hire, low-fire labor market with almost nowhere to go, job search burnout isn't just emotional — it's rational.
Leo sets Catholics on collision course with AI
Pope Leo XIV is expected to sign an encyclical positioning AI as a major moral and labor challenge. The document will likely emphasize that technology must remain subordinate to human dignity and labor rights.
Amplification to Synthesis: A Comparative Analysis of Cognitive Operations Before and After Generative AI
arXiv:2605.13785v1 Announce Type: new Abstract: Cognitive operations are a rising concern in the geopolitical sphere, a quiet yet rigorous fight for public perception and decision making. While such operations have been extensively studied in the context of bot-driven amplification, the emergence of generative AI introduces a new set of capabilities that may have fundamentally altered how these operations are designed and executed. The possible evolution of cognitive operation via generative AI puts nation states vulnerable without proper mitigation strategies. To address this, we compared behavioral and linguistic coordination patterns in X (formerly Twitter) datasets from the 2016 and 2024 U.S. presidential elections. Utilizing a combined corpus of over 133,000 posts, we applied post-type distribution, semantic clustering, temporal synchrony analysis, and Jaccard-based lexical overlap measures. Findings suggest that the 2024 corpus exhibits a distinct pattern from 2016. Original content rose from 59% to 93% with retweets virtually disappeared; lexical overlap collapsed from a mean Jaccard score of 0.99 to 0.27, with posts converging on the same subject matter expressed in markedly different words; and temporal coordination shifted from pervasive cross-semantic synchrony to narratively concentrated co-occurrence. Taken together, these patterns point toward an operational logic organized around active content generation and narrative-specific targeting - characteristics consistent with generative AI involvement. These findings offer an empirical baseline for future research investigating generative AI's role in the cognitive operation pipeline, and as a practical reference point for security practitioners developing detection frameworks calibrated to the post-generative AI threat environment.
"F*** You Biden": Cross-Partisan Electoral Toxicity on X
arXiv:2605.12526v1 Announce Type: cross Abstract: Political discourse on social media has grown increasingly toxic, with electoral periods amplifying partisan hostility and cross-group attacks. Yet it remains unclear whether toxicity in online political speech reflects how partisans communicate within their own circles, or how aggressively they engage with the opposition. Disentangling these dynamics is critical for understanding online political hostility and for designing effective content moderation. We examine this question at scale using a large collection of original posts and replies from X (formerly Twitter), collected during the 2024 U.S. presidential election. Using a human-validated large language model to classify the political alignment of posts and users, and the Perspective API for toxicity scoring, we uncover a striking asymmetry: Republican-leaning posts are significantly more toxic than Democratic-leaning posts, yet Democratic-leaning posts attract significantly more toxic replies. To interpret this finding, we compare the toxicity of same-party and cross-partisan replies. While cross-partisan replies are slightly but significantly more toxic than same-party replies, this is true for both Democratic and Republican posts. However, Republican users account for a large majority of replies to Democratic posts, while Democrats account for a minority of replies to Republican content. Therefore, the elevated toxicity directed at Democratic content is better explained by the volume of Republican cross-partisan replies.
Do Fair Models Reason Fairly? Counterfactual Explanation Consistency for Procedural Fairness in Credit Decisions
arXiv:2605.12701v1 Announce Type: cross Abstract: Machine learning algorithms in socially sensitive domains (e.g., credit decisions) often focus on equalizing predictive outcomes. However, satisfying these metrics does not guarantee that models use the same reasoning for different groups. We show that existing outcome-fair models can still apply fundamentally different reasoning to individuals, a ``hidden procedural bias'' missed by standard fairness metrics and algorithms. We propose Counterfactual Explanation Consistency (CEC), a framework that detects and mitigates this bias by aligning feature attributions between individuals and their counterfactual counterparts. Key contributions include a nearest-neighbor counterfactual generation method, a modified baseline for integrated gradient comparisons, an individual-level procedural fairness metric, and a corresponding training loss. We introduce a taxonomy identifying ``Regime B'' (same outcome, different reasoning) as a critical blind spot. Experiments on synthetic data, German Credit, Adult Income, and HMDA mortgage data demonstrate that outcome-fair baselines exhibit substantial hidden bias, while CEC substantially reduces it with modest utility cost.
US FTC's White emphasizes consumer redress, fighting concrete harm
The US Federal Trade Commission is focused on enforcing against concrete harms in the marketplace and is prioritizing consumer redress, according to Kate White, deputy director of the FTC's Bureau of Consumer Protection.
AI desperately needs more adult supervision
The critical challenge is to build institutions that protect us from tech companies and the state
DisaBench: A Participatory Evaluation Framework for Disability Harms in Language Models
arXiv:2605.12702v1 Announce Type: new Abstract: General-purpose safety benchmarks for large language models do not adequately evaluate disability-related harms. We introduce DisaBench: a taxonomy of twelve disability harm categories co-created with people with disabilities and red teaming experts, a taxonomy-driven evaluation methodology that pairs benign and adversarial prompts across seven life domains, and a dataset of 175 prompts with human-annotated labels on 525 prompt-response pairs. Annotation by four evaluators with lived disability experience reveals three findings: harm rates vary sharply by disability type and will compound in non-text modalities, terminology-driven harm is culturally and temporally bound rather than universally assessable, and standard safety evaluation catches overt failures while missing the subtle harms that only domain expertise can recognize. Disability harm is simultaneously personal, intersectional, and community-defined: it cannot be isolated from the full context of who a person is, and general-purpose benchmarks systematically miss it. We will release the dataset, taxonomy, and methodology via Hugging Face and an open-source red teaming framework for direct integration into existing safety pipelines with no additional infrastructure.
AI chatbots are giving out people’s real phone numbers
A Redditor recently wrote that he was “desperate for help”: for about a month, he said, his phone had been inundated by calls from “strangers” who were “looking for a lawyer, a product designer, a locksmith.” Callers were apparently misdirected by Google’s generative AI. In March, a software developer in Israel was contacted on WhatsApp…
Is Big Brother watching you shop? – podcast
From supermarkets to corner shops, live facial recognition could be coming to retailers near you. Jessica Murray on the AI systems increasingly used by the police and stores Live facial recognition is being hailed as a powerful new frontier in the fight against crime, not only by police but by private companies too. Retailers from supermarkets to corner shops hope it will help them fight back against shoplifting. But the Guardian’s social affairs correspondent, Jessica Murray, points out that it will also expand surveillance into more and more public spaces. And the technology doesn’t always get it right. Continue reading...
Revealing Interpretable Failure Modes of VLMs
arXiv:2605.12674v1 Announce Type: new Abstract: Vision-Language Models (VLMs) are increasingly used in safety-critical applications because of their broad reasoning capabilities and ability to generalize with minimal task-specific engineering. Despite these advantages, they can exhibit catastrophic failures in specific real-world situations, constituting failure modes. We introduce REVELIO, a framework for systematically uncovering interpretable failure modes in VLMs. We define a failure mode as a composition of interpretable, domain-relevant concepts-such as pedestrian proximity or adverse weather conditions-under which a target VLM consistently behaves incorrectly. Identifying such failures requires searching over an exponentially large discrete combinatorial space. To address this challenge, REVELIO combines two search procedures: a diversity-aware beam search that efficiently maps the failure landscape, and a Gaussian-process Thompson Sampling strategy that enables broader exploration of complex failure modes. We apply REVELIO to autonomous driving and indoor robotics domains, uncovering previously unreported vulnerabilities in state-of-the-art VLMs. In driving environments, the models often demonstrate weak spatial grounding and fail to account for major obstructions, leading to recommendations that would result in simulated crashes. In indoor robotics tasks, VLMs either miss safety hazards or behave excessively conservatively, producing false alarms and reducing operational efficiency. By identifying structured and interpretable failure modes, REVELIO offers actionable insights that can support targeted VLM safety improvements.
Former OpenAI Researcher Warns AI Industry Lacks Control Over Systems It Is Racing to Build | Futurism
Read Time 6 minutes Tags AI Alignment OpenAI AI Safety Superintelligence Agentic AI Risk Governance Daniel Kokotajlo a former OpenAI researcher who now runs the AI Futures Project says the artificial intelligence industry is racing to build systems that companies still do not fully understand ...
Upskill your staff in AI or expect them to quit, says Gartner | IT Pro
Organizations need to focus on targeted AI tools and training to make the most of their staff and succeed in transformation
Study finds ChatGPT users show weaker brain activity and own their writing less
An MIT study using EEG headsets found that LLM users exhibited up to 55% reduced brain connectivity compared to non-users. Researchers warn of 'cognitive debt' from over-reliance on AI tools.
Technology & Infrastructure
Prudential - Powering AI-Driven Advisor Workflows in Life Insurance | AWS Events
Prudential is utilizing generative AI and multi-agent architectures to streamline life insurance advisor workflows, reducing administrative overhead and improving productivity.
Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents
arXiv:2605.12620v1 Announce Type: new Abstract: Building generalist embodied agents capable of solving complex real-world tasks remains a fundamental challenge in AI. Multimodal Large Language Models (MLLMs) have significantly advanced the reasoning capabilities of such agents through strong vision-language knowledge and chain-of-thought (CoT) reasoning, yet remain brittle when faced with challenging out-of-distribution scenarios. To address this, we propose Verifier-Guided Action Selection (VegAS), a test-time framework designed to improve the robustness of MLLM-based embodied agents through an explicit verification step. At inference time, rather than committing to a single decoded action, VeGAS samples an ensemble of candidate actions and uses a generative verifier to identify the most reliable choice, without modifying the underlying policy. Crucially, we find that using an MLLM off-the-shelf as a verifier yields no improvement, motivating our LLM-driven data synthesis strategy, which automatically constructs a diverse curriculum of failure cases to expose the verifier to a rich distribution of potential errors at training time. Across embodied reasoning benchmarks spanning the Habitat and ALFRED environments, VeGAS consistently improves generalization, achieving up to a 36% relative performance gain over strong CoT baselines on the most challenging multi-object, long-horizon tasks.
Big Tech gets a win on counting ‘clean’ offsets against gas-powered AI boom
Corporate climate watchdog drops stricter proposal on net zero claims after heavy lobbying
Data centers are cutting power to homes, driving homeowners to solar and batteries | Electrek
A Nevada utility just told 49,000 Lake Tahoe residents that it’s redirecting 75% of their electricity supply to data centers...
Cisco to cut about 4,000 jobs in AI-focused restructuring as orders surge | Reuters
Cisco has taken $5.3 billion in AI infrastructure orders from hyperscalers so far this fiscal year, and raised its full-year order expectation to $9 billion from $5 billion previously.
AI economics part 2
Training caused the first HBM supercycle. Agentic AI with its appetite for context is causing the second
Inside AI Infrastructure’s Affordability Crisis and The Rising Risks
The stratospheric rise in technology components prices such as memory and storage devices is adding more risks to the capex debate.
Mobile Carriers Join Forces to Boost Coverage in Dead Zones
AT&T Inc., T-Mobile US Inc. and Verizon Communications Inc. announced a rare joint venture Thursday that aims to make satellite capabilities more widely available to mobile phone customers.
Americans would rather have a nuclear plant in their backyard than a datacenter
AI and the bit barns that power it have developed a serious PR problem
The Inference Shift
This article argues that AI infrastructure is shifting from training-centric compute toward inference, latency, agentic workloads, and hardware integration.
No One Knows the State of the Art in Geospatial Foundation Models
arXiv:2605.12678v1 Announce Type: cross Abstract: Geospatial foundation models (GFMs) have been proposed as generalizable backbones for disaster response, land-cover mapping, food-security monitoring, and other high-stakes Earth-observation tasks. Yet the published work about these models does not give reviewers or users enough information to tell which model fits a given task. We argue that nobody knows what the current state of the art is in geospatial foundation models. The methods may be useful, but the GFM literature does not standardize evaluations, training and testing protocols, released weights, or pretraining controls well enough for anyone to compare or rank them. In a 152-paper audit, we find 46 cross-paper disagreements of at least 10 points for the same model, benchmark, and protocol; 94/126 papers with extractable pretraining data use a configuration no other paper uses; and 39% of GFM papers release no model weights. This lack of community standards can be solved. We propose six concrete expectations: named-license weight release, shared core evaluations, copied-versus-rerun baseline annotations, variance reporting, one shared evaluation harness, and data-vs-architecture-vs-algorithm controls. These gaps are a coordination failure, not a fault of any individual lab; the authors of this paper, like many others in the GFM community, have contributed to them. Rather than just critiquing the community, we aim to provide concrete steps toward a shared understanding of how to innovate GFMs.
AI Labs: Google DeepMind plans its comeback
Google and its AI lab DeepMind are bearing down OpenAI and Anthropic
State-Centric Decision Process
arXiv:2605.12755v1 Announce Type: new Abstract: Language environments such as web browsers, code terminals, and interactive simulations emit raw text rather than states, and provide none of the runtime structure that MDP analysis requires. No explicit state space, no observation-to-state mapping, no certified transitions, and no termination criterion. We introduce the State-Centric Decision Process (SDP), a runtime framework that constructs these missing inputs by having the agent build them, predicate by predicate, as it acts. At each step the agent commits to a natural-language predicate describing how the world should look, takes an action to make it true, and checks the observation against it. Predicates that pass become certified states, and the resulting trajectory carries the four objects language environments do not provide, namely a task-induced state space, an observation-to-state mapping, certified transitions, and a termination criterion. We evaluate SDP on five benchmarks spanning planning, scientific exploration, web reasoning, and multi-hop question answering. SDP achieves the best training-free results on all five, with the advantage widening as the horizon grows. The certified trajectories additionally support analyses unavailable to reactive agents, including per-predicate credit assignment, failure localization, partial-progress measurement, and modular operator replacement.
In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores
arXiv:2605.12530v1 Announce Type: cross Abstract: LLM fairness should be evaluated through in-situ conversational behavior rather than standardized-test Q&A benchmarks. We show that the standardized-test paradigm can be structurally unreliable: surface-level prompt construction choices, although entirely orthogonal to the fairness question being tested, account for the majority of score variance, shift fairness conclusions in both the direction and the magnitude, and result in severe discordance in model rankings. We develop MAC-Fairness, a multi-agent conversational framework that embeds controlled variation factors into multi-round dialogue for in-situ behavior evaluation, examining how models' conversational behavior shifts when identity is varied as part of natural multi-agent interaction. Repurposing standardized-test questions as conversation seeds rather than as the evaluation instrument, we evaluate position persistence (how they hold positions, from the self-perspective) and peer receptiveness (how receptive they are to peers, from the other-perspective) across 8 million conversation transcripts spanning multiple models and identity presence configurations. In-situ behavioral evaluation reveals stable, model-specific behavioral signatures that could generalize across benchmarks differing in fairness targets and evaluation methodologies, a form of evidence the standardized-test paradigm does not offer.
Multimodal Hidden Markov Models for Persistent Emotional State Tracking
arXiv:2605.12838v1 Announce Type: new Abstract: Tracking an interpretable emotional arc of a conversation via the sentiment of individual utterances processed as a whole is central to both understanding and guiding communication in applied, especially clinical, conversational contexts. Existing approaches to emotion recognition operate at the utterance level, obscuring the persistent phases that characterize real conversational dynamics. We propose a lightweight framework that models conversational emotion as a sequence of latent emotional regimes using sticky factorial HDP-HMMs over multimodal valence-arousal representations derived from simultaneous video, audio and textual input. We evaluate the quality of regime prediction using LLM-as-a-Judge, geometric, and temporal consistency metrics, demonstrating that the sticky HDP-HMM produces more interpretable regime sequences than the baseline Gaussian HMM at a fraction of the computational cost of LLM-based dialogue state tracking methods. In addition, Question-Answer experiments in a clinical dataset suggest that meaningful emotional phases can reliably be recovered from multimodal valence-arousal trajectories and used to improve the quality of LLM responses in unstable affective regimes via context augmentation. This framework thus opens a path toward interpretable, lightweight, and actionable analysis of conversational emotion dynamics at scale.
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
arXiv:2605.12922v1 Announce Type: new Abstract: Large language models can follow complex instructions in a single turn, yet over long multi-turn interactions they often lose the thread of instructions, persona, and rules. This degradation has been measured behaviorally but not mechanistically explained. We propose a channel-transition account: goal-defining tokens become less accessible through attention, while goal-related information may persist in residual representations. We introduce the Goal Accessibility Ratio (GAR), measuring attention from generated tokens to task-defining goal tokens, and combine it with sliding-window ablations and residual-stream probes. When attention to instructions closes, what survives reveals architecture. Across architectures, the transition yields qualitatively distinct failure modes: some models preserve goal-conditioned behavior at vanishing attention, others fail despite decodable residual goal information, and the layer at which this encoding emerges varies from 2 to 27. A within-model causal ablation that force-closes the attention channel in Mistral collapses recall from near-perfect to 11% on a 20-fact retention task and raises persona-constraint violations above an adversarial-pressure baseline without user pressure, with both effects emerging at the predictable crossover turn. Linear probes recover per-episode recall outcomes from residual representations with AUC up to 0.99 across all four primary architectures, while input embeddings remain at chance. Across architectures and model scales, the gap between attention loss and residual decodability predicts whether goal-conditioned behavior survives channel closure. We contribute GAR as a diagnostic, the channel-transition framework as a controlled mechanistic account, and a parametric prediction of failure timing under windowed attention closure.
AI IQ is here: a new site scores frontier AI models on the human IQ scale. The results are already dividing tech.
For decades, the IQ test has been one of the most familiar — and most contested — yardsticks for human intelligence. Now, a startup project called AI IQ is applying the same metaphor to artificial intelligence, assigning estimated intelligence quotients to more than 50 of the world's most powerful language models and plotting them on a standard bell curve. The result is a set of interactive visualizations at aiiq.org that have ricocheted across social media in the past week, drawing praise from enterprise technologists who say the charts make an impossibly complex market legible — and sharp criticism from researchers and commentators who warn the entire framework is misleading. "This is super useful," wrote Thibaut Mélen, a technology commentator, on X. "Much easier to understand model progress when it's mapped like this instead of another giant leaderboard table." Brian Vellmure, a business strategist, offered a similar endorsement: "This is helpful. Anecdotally tracks with personal experience." But the backlash arrived just as quickly. "It's nonsense. AI is far too jagged. The map is not the territory," posted AI Deeply, an artificial intelligence commentary account, crystallizing a worry shared by many researchers: that reducing a language model's sprawling, uneven capabilities to a single number creates a dangerous illusion of precision. Twelve benchmarks, four dimensions, and one controversial number: how AI IQ actually works AI IQ was created by Ryan Shea, an engineer, entrepreneur, and angel investor best known as a co-founder of the blockchain platform Stacks. Shea also co-founded Voterbase and has invested in the early stages of several unicorns, including OpenSea, Lattice, Anchorage, and Mercury. He holds a Bachelor of Science in Mechanical Engineering from Princeton University. The site's methodology rests on a deceptively simple formula. AI IQ groups 12 benchmarks into four reasoning dimensions: abstract, mathematical, programmatic, and academic. The composite IQ is a straight average of those four dimension scores: IQ = ¼ (IQ_Abstract + IQ_Math + IQ_Prog + IQ_Acad). The abstract reasoning dimension draws from ARC-AGI-1 and ARC-AGI-2, the notoriously difficult pattern-recognition benchmarks designed to test general fluid intelligence. Mathematical reasoning includes FrontierMath (Tiers 1–3 and Tier 4), AIME, and ProofBench. Programmatic reasoning uses Terminal-Bench 2.0, SWE-Bench Verified, and SciCode. Academic reasoning pulls from Humanity's Last Exam, CritPt, and GPQA Diamond. Each raw benchmark score gets mapped to an implied IQ through what the site describes as "hand-calibrated difficulty curves." Crucially, the methodology compresses ceilings for benchmarks considered easier or more susceptible to data contamination, preventing them from inflating scores above 100. Harder, less gameable benchmarks retain higher ceilings. The system also handles missing data conservatively: models need scores on at least two of the four dimensions to receive a derived IQ, and when benchmarks are absent, the pipeline deliberately pulls scores down rather than up. The site states that "every derived IQ averages all four dimensions, so missing coverage cannot make a model look better by omission." OpenAI leads the bell curve, but the gap between the top AI models has never been smaller As of mid-May 2026, the AI IQ charts tell a story of rapid convergence at the top of the frontier — and widening diversity in the tiers below. According to the Frontier IQ Over Time chart, GPT-5.5 from OpenAI currently sits at the peak of the bell curve, with an estimated IQ near 136 — the highest of any model tracked. It is closely followed by GPT-5.4 (approximately 131), Opus 4.7 from Anthropic (approximately 132), and Opus 4.6 (approximately 129). Google's Gemini 3.1 Pro lands near 131, making the top cluster extraordinarily tight. That compression is not unique to AI IQ's framework. Visual Capitalist, drawing from a separate Mensa-based ranking by TrackingAI, recently observed the same dynamic, noting that "the biggest takeaway is how compressed the top of the leaderboard has become." On that scale, Grok-4.20 Expert Mode and GPT 5.4 Pro tied at 145, with Gemini 3.1 Pro at 141. Below the frontier cluster, the AI IQ charts show a crowded midfield. Models from Chinese labs — Kimi K2.6, GLM-5, DeepSeek-V3.2, Qwen3.6, MiniMax-M2.7 — bunch between roughly 112 and 118, making the cost-performance tier increasingly competitive for enterprise buyers who don't need the absolute best model for every task. One X user, ovsky, noted that the data "confirms experience with sonnet 4.6 being an absolute workhorse as opposed to opus 4.5" — pointing to the way the charts can validate practitioner intuitions that headline rankings often miss. Why emotional intelligence scores are becoming the new battleground in AI model rankings What distinguishes AI IQ from most other benchmarking efforts is its inclusion of an "EQ" — emotional intelligence — score. The site maps each model's EQ-Bench 3 Elo score and Arena Elo score to an estimated EQ using calibrated piecewise-linear scales, then takes a 50/50 weighted composite of the two. The EQ scores produce a meaningfully different ranking than IQ alone. On the IQ vs. EQ scatter plot, Anthropic's Opus 4.7 leads on EQ with a score near 132, pushing it into the upper-right quadrant — the most desirable position, signaling both high cognitive and high emotional intelligence. OpenAI's GPT-5.5 and GPT-5.4 cluster in the high-IQ zone but lag slightly on EQ. Google's Gemini 3.1 Pro sits in a strong middle position on both axes. One notable methodological choice has drawn attention: EQ-Bench 3 is judged by Claude, an Anthropic model, which the site acknowledges "creates potential scoring bias in favor of Anthropic models." To correct for this, AI IQ subtracts a 200-point Elo penalty from the EQ-Bench component for all Anthropic models before mapping to implied EQ. The Arena component is unaffected since it uses human judges. That self-correction is unusual in the benchmarking world, and it suggests Shea is aware of the methodological minefield he has entered. Still, the EQ dimension captures something IQ alone cannot: the growing importance of conversational quality, collaboration, and trust in models deployed for user-facing work. The AI cost-performance chart that enterprise buyers actually need to see Perhaps the most practically useful chart on the site is not the bell curve but the IQ vs. Effective Cost scatter plot. It maps each model's estimated IQ against an "effective cost" metric — defined as the token cost for a task using 2 million input tokens and 1 million output tokens, multiplied by a usage efficiency factor. The chart reveals a familiar pattern in enterprise technology: the best models are not always the best value. GPT-5.5 and Opus 4.7 sit in the upper-left corner — high IQ, high cost, with effective per-task costs north of $30 and $50 respectively. Meanwhile, models like GPT-5.4-mini, DeepSeek-V3.2, and MiniMax-M2.7 occupy a sweet spot in the middle: respectable IQ scores between 112 and 120, at effective costs ranging from roughly $1 to $5 per task. At the cheapest extreme, GPT-oss-20b (an open-source OpenAI model) appears near $0.20 effective cost with an IQ around 107 — potentially the most economical option for bulk classification or extraction workloads. The site also offers a 3D visualization mapping IQ, EQ, and effective cost simultaneously. A dashed line running through the cube points toward the ideal: higher IQ, higher EQ, and lower cost. Models near the "green end" of that axis are stronger all-around deals; those near the "red end" sacrifice capability, cost efficiency, or both. For CIOs staring at API invoices, the implication is clear: the intelligence gap between a $50 model and a $3 model has narrowed enough that routing — using expensive models for hard problems and cheap ones for everything else — is no longer optional. It is the dominant architecture for serious AI deployments. Critics say AI's "jagged" capabilities make a single IQ score dangerously misleading The loudest objection to AI IQ is philosophical, and it cuts deep. Critics argue that collapsing a model's uneven capabilities into a single score obscures more than it reveals. "IQ as a proxy is fading — we're seeing reasoning density spikes that don't map to g-factor," posted Zaya, a technology commentator, on X. "GPT-5.5 already hit saturation on MMLU-Pro, but still fails ClockBench 50% of the time." That observation touches on what AI researchers call the "jaggedness" problem: large language models often exhibit wildly uneven capabilities, excelling at graduate-level physics while failing at tasks a child could do. A composite score can paper over those gaps. Pressureangle, another X user, posted a more granular critique, calling out "complete lack of transparency" and arguing the site never fully discloses how its calibration curves were created or validated. In fairness, AI IQ does list its 12 benchmarks and shows the shape of each calibration curve in its methodology modal. But the raw data and precise mathematical transformations are not published as open datasets — a gap that matters to researchers accustomed to fully reproducible methods. Others questioned the premise itself. "As useless as human IQ testing," wrote haashim on X. Shubham Sharma, an AI and technology writer, offered a constructive alternative: "Why not having the Models take an official (MENSA-Grade) test? Wouldn't this be the most accurate and most 'human-comparable' way to benchmark intelligence?" That approach already exists through TrackingAI, which administers the Mensa Norway IQ test to language models. But Mensa-style tests measure only abstract pattern recognition, while AI IQ attempts a broader composite across coding, mathematics, and academic reasoning. As Visual Capitalist noted, "an IQ-style benchmark captures only one slice of capability." Each approach has tradeoffs — and neither has won the argument yet. The real race isn't for the highest score — it's for the smartest model stack For all the debate about methodology, the most important signal in AI IQ's data may not be any single model's score. It is the shape of the market the charts reveal. There are now more than 50 frontier-class models available through APIs, from at least 14 major providers spanning the United States, China, and Europe. Each provider publishes its own benchmarks, often cherry-picked to showcase strengths. The result is a Tower of Babel where no two companies measure the same thing in the same way. Academic research has highlighted that "most benchmarks introduce bias by focusing on a particular type of domain," and the Frontier IQ Over Time chart on AI IQ shows just how fast the targets are moving: in October 2023, GPT-4-turbo sat near an estimated IQ of 75. By early 2026, the top models were brushing 135 — roughly 60 points of improvement in 30 months. That pace raises a fundamental question about whether any scoring system can keep up. The site compresses ceilings for saturated benchmarks, but as models continue to max out even the hardest tests — ARC-AGI-2, FrontierMath Tier 4, Humanity's Last Exam — the framework will face the same ceiling effects that have plagued every AI evaluation before it. Connor Forsyth pointed to this dynamic on X: "ARC AGI 3 disagrees," he wrote, referencing a next-generation benchmark that may already be undermining current scores. AI IQ is not perfect. Its methodology is partially opaque. Its IQ metaphor can mislead. And its creator acknowledges known biases while likely missing others. But the alternative — wading through dozens of provider-specific benchmark tables, each using different test suites and scoring conventions — is worse. The site offers enterprise buyers something genuinely scarce: a single framework for comparing models across providers, dimensions, and price points, updated regularly, with enough nuance to show that the right answer to "which model is best?" is almost always "it depends on the task." As Debdoot Ghosh mused on X after viewing the charts: "Now a human's role is just to orchestrate?" Maybe. But if the AI IQ data shows anything clearly, it is that orchestration — knowing which model to deploy, when, and at what price — has become its own form of intelligence. And for that, there is no benchmark yet.
The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents
This research paper examines how expanded memory in LLM agents can reduce cooperation and produce negative behavioral patterns over repeated interactions.
CHAL: Council of Hierarchical Agentic Language
arXiv:2605.12718v1 Announce Type: new Abstract: Multi-agent debate has emerged as a promising approach for improving LLM reasoning on ground-truth tasks, yet current methodologies face certain structural limitations: debate tends to induce a martingale over belief trajectories, majority voting accounts for most observed gains, and LLMs exhibit confidence escalation rather than calibration across rounds. We argue that the genuine value of debate, and dialectic systems as a whole, lies not in ground-truth tasks but in defeasible domains, where every position can in principle be defeated by better reasoning. We present the Council of Hierarchical Agentic Language (CHAL), a multi-agent dialectic framework that treats defeasible argumentation as an engine for belief optimization. Each agent maintains a CHAL Belief Schema (CBS), a graph-structured belief representation with a Bayesian-inspired architecture, that facilitates belief revision through a gradient-informed dynamic mechanism by leveraging the strength of the belief's thesis as a differentiable objective. Meta-cognitive value systems spanning epistemology, logic, and ethics are elevated to configurable hyperparameters governing agent reasoning and adjudication outcomes. We provide a series of ablation experiments that demonstrate systematic and interpretable effects: the adjudicator's value system determines the debate's overall trajectories in latent belief space, council diversity refines beliefs for all participants, and the framework generalizes across broad fields. CHAL is, to our knowledge, the first framework to treat multi-agent debate as structured belief optimization over defeasible domains. Further, the auditable belief artifacts it produces establish the foundation for dedicated evaluation suites for defeasible argumentation, with broader implications for building AI systems whose reasoning and value commitments are transparent, aligned, and subject to human oversight.
Japan, US tackle AI cyberthreats as megabanks prepare to access Mythos
Japan is accelerating efforts to address cybersecurity risks from frontier artificial intelligence models in cooperation with the US, as three major Japanese banks are reportedly set to gain access to Anthropic's Claude Mythos Preview.
Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue
arXiv:2605.12856v1 Announce Type: new Abstract: The emergence of multi-agent systems introduces novel moderation challenges that extend beyond content filtering. Agents with {\em malicious intent} may contribute harmful content that appears benign to evade content-based moderation, while compromising the system through exploitative and malicious behavior manifested across their overall interaction patterns within the community. To address this, we introduce \textsc{\textbf{Bot-Mod}} (\textsc{\textbf{Bot-Mod}}eration), a moderation framework that grounds detection in agent intent rather than traditional content level signals. \method{} identifies the underlying intent by engaging with the target agent in a multi-turn exchange guided by Gibbs-based sampling over candidate intent hypotheses. This progressively narrows the space of plausible agent objectives to identify the underlying behavior. To evaluate our approach, we construct a dataset derived from Moltbook that encompasses diverse benign and malicious behaviors based on actual community structures, posts, and comments. Results demonstrate that \textsc{\textbf{Bot-Mod}} reliably identifies agent intent across a range of adversarial configurations, while maintaining a low false positive rate on benign behaviors. This work advances the foundation for scalable, intent-aware moderation of agents in open multi-agent environments.
AI Is Making the Cyber-Thriller Less Fictional | AEI
The infrastructure of modern economic life—financial systems, energy grids, hospital networks—is built on the assumption that the most dangerous hackers are rare. That assumption has a fast-approaching expiration date.
Protect Your Enterprise Now from the Shai-Hulud Worm and npm Vulnerability In 6 Actionable Steps
The Shai-Hulud worm is the first known npm attack targeting AI coding agents and developer credential stores. It leverages legitimate GitHub workflows to spread malware into CI/CD pipelines.
Malware crew TeamPCP open-sources its Shai-Hulud worm on GitHub
The malware has been forked on GitHub, seemingly without Microsoft's code locker noticing.
The New Era of AI-Driven Zero-Day Exploits and Malware - Geeky Gadgets
Explore the mechanics of AI-enabled ... threats. Gain insight into the geopolitical implications of AI-driven cyber operations and examine strategies for mitigating these risks. This guide also provide more insights into the role of defensive AI in enhancing cybersecurity measures to ...
To gain root access at this company, all an intruder had to do was ask nicely
Human IT managers thought they were being nice to the boss, but were assisting a threat actor
Securing AI agents: How AWS and Cisco AI Defense scale MCP and A2A deployments | Artificial Intelligence
The Cisco and AWS partnership addresses three challenges enterprises face when scaling AI agents: visibility gaps, security bottlenecks, and compliance risks. In this post, we explore how you can overcome AI security challenges through automated scanning and unified governance.
OpenAI Daybreak
OpenAI Daybreak applies GPT-5.5 and Codex-style tooling to cybersecurity workflows like vulnerability scanning, code review, and threat modeling.
Adoption, Deployment & Impact
An Activity-Theoretical Approach to Teacher Professional Development in Pedagogical AI Agent Design
arXiv:2605.12934v1 Announce Type: new Abstract: This two-cycle formative intervention study examined why teachers disengage from AI agent creation after professional development - a low engagement paradox - and tested whether systemic redesign could address it. Cycle 1 (N=218) revealed that despite completing comprehensive TPD, 87 percent of teachers ceased creating within three weeks, with behavioral tracking and interview analysis identifying systemic contradictions as the source of psychological need frustration rather than capacity deficits. Cycle 2 (N=26) implemented Cultural-Historical Activity Theory and Self-Determination Theory - driven redesign directly targeting diagnosed contradictions, achieving synchronized enhancement of both capacity and willingness. The findings reframe implementation failure as a rational response to need-thwarting systems and offer a replicable CHAT - SDT diagnostic framework for transformative professional development.
Data readiness for agentic AI in financial services
Financial services companies have unique needs when it comes to business AI. They operate in one of the most highly regulated sectors while responding to external events that are updated by the second. As a result, the success of agentic AI in financial services depends less on the sophistication of the system and more on…
Nearly every enterprise is investing in AI, but only 5% say their data is ready – Computerworld
Dun & Bradstreet found widespread experimentation and early returns, but few organizations believe they can deploy AI reliably at enterprise scale.
Financial Services Firms Lead Enterprise AI Adoption as 85% Boost Budgets | PYMNTS.com
A PYMNTS Intelligence study finds financial services firms lead other sectors in enterprise AI deployment.
Five enterprises, one lesson: AI runs on the infrastructure you already have
Storage, security, networking and culture all require rethinking as enterprises transition AI from tra | How Eli Lilly, Nasdaq, the NHL and other enterprises are retooling storage, networking and data architecture for agentic AI — without replacing what already works.
How AI Aims to Fix Healthcare Access
Rezilient CEO Dr. Danish Nagda says the healthcare system is at a tipping point. He joins Bloomberg Open Interest to talk about how hybrid “cloud clinics,” employer-driven care, and AI-powered doctors could eliminate long wait times, cut costs, and make switching doctors a thing of the past. (Source: Bloomberg)
4 ways AI is enabling the future of industrial work - Source
Examples from Japan, Mexico, New Zealand and Saudi Arabia show how AI is transforming high-skill workflows in manufacturing and industrial companies.
One in seven in UK prefer consulting AI chatbots to seeing doctor, study finds
Exclusive: Doctors say ‘highly concerning’ poll highlights risk to patients of turning to AI for medical advice One in seven people are using AI chatbots for health advice instead of seeing their GP, a UK study has found. The poll of more than 2,000 people found that – of the 15% turning to chatbots – one in four had done so because of long NHS waiting lists. Continue reading...
MIRACLE_Multi-Agent Intelligent Regulation to Advance Collaborative Learning Environment
arXiv:2605.12923v1 Announce Type: new Abstract: Effective collaboration requires Socially Shared Regulation (SSRL), but students often lack these skills. This study introduces the MIRACLE (Multi-Agent Intelligent Regulation to Advance Collaborative Learning Environment) system, which supports SSRL by orchestrating metacognitive regulation and proactively providing emotional and motivational support. We conducted a quasi-experimental study with 90 fifth-grade students. The experimental group (n=42) used a collaborative platform CocoNote equipped with MIRACLE, while the control group (n=48) used the same platform with a general GPT assistant. Quantitative results show the MIRACLE group achieved significant gains across SSRL phases (Planning, Monitoring, Reflection) and produced higher-quality collaborative artifacts compared to the control group. Qualitative findings indicate students perceived MIRACLE as an effective facilitator for cognitive, regulatory, and emotional support. This study demonstrates that specialized, orchestrated AI systems are more effective than generic AI in enhancing SSRL.
Calling the cops just got extra AI as police seek to add tech to contact systems
AI already listening in to call handlers in real time, conducting live database searches
HMRC to use AI from British tech firm to spot fraud and tax return errors
Quantexa, a financial data platform, won the £175m contract to spot fraud and tax return errors.
Industrial Robotics Intelligence Software Market to Add US$49.17 Billion by 2031 as AI, Digital Twins and Physical AI Shift Factory Robotics From Programmed Motion to Adaptive Automation
NEW YORK and TOKYO May 13 2026 The global Industrial Robotics Intelligence Software Market is entering a new investment cycle as manufacturers move beyond robot installation and begin upgrading robotic fleets with software that can see learn simulate optimize and ...
AI and operational agility set to reshape agriculture trading, McKinsey analysis shows
McKinsey & Company’s latest analysis highlights a fundamental transformation in agricultural commodity trading, driven by rising market volatility, digital competition, and the adoption of AI and agentic systems. The report argues that traditional, experience-based and regionally siloed ...
Learning Transferable Latent User Preferences for Human-Aligned Decision Making
arXiv:2605.12682v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as reasoning modules in many applications. While they are efficient in certain tasks, LLMs often struggle to produce human-aligned solutions. Human-aligned decision making requires accounting for both explicitly stated goals and latent user preferences that shape how ambiguous situations should be resolved. Existing approaches to incorporating such preferences either rely on extensive and repeated user interactions or fail to generalize latent preferences across tasks and contexts, limiting their practical applicability. We consider a setting in which an LLM is used for high-level reasoning and is responsible for inferring latent user preferences from limited interactions, which guides downstream decision making. We introduce CLIPR (Conversational Learning for Inferring Preferences and Reasoning), a framework that learns actionable, transferable natural language rules that represent latent user preferences from minimal conversational input. These rules are iteratively refined through adaptive feedback and applied to both in-distribution and out-of-distribution ambiguous tasks across multiple environments. Evaluations on three datasets and a user study show that CLIPR consistently outperforms existing methods in improving alignment and reducing inference costs.
Artificial Intelligence (AI) and the future of procurement: From traditional systems to intelligent supply chains - The Business & Financial Times
By Alvin A. Mingle The procurement space has always been one of the most dynamic functions within organisations, particularly in Ghana, where supply chains are often stretched across borders and shaped by global dependencies. From sourcing critical inputs in the telecom, oil and gas, and ...
Outlook on the AI Market in Smart Buildings and Infrastructure: Major Segments, Strategic Developments, and Leading Companies
This acquisition aims to enhance ... energy consumption and greenhouse gas emissions in commercial buildings. BrainBox AI specializes in smart building solutions and HVAC energy efficiency, making them a strategic addition to Trane's portfolio. View the full ai in smart buildings and infrastructure market report: ...
Four ways to create a lasting cost advantage from AI | Fortune
A recent BCG analysis identifies what sets AI winners apart.
Sovereignty Is the New Operating System for Agentic AI, New MIT Technology Review Insights Report Finds
The research explains how organizations "Deeply Committed" to controlling their data, infrastructure, models, and governance are delivering 5x the ROI on generative and agentic AI initiatives—at a moment when more than half of enterprises already have autonomous agents in production making real-time decisions on operational data. The report, Establishing AI and Data Sovereignty in the Age of Autonomous Systems...
Economist Enterprise: How Leading Firms Make AI Deliver | AI Magazine
New research by Economist Enterprise shows how global companies use a benchmarking framework to ensure AI programmes beat business expectations
The End of the Trade-Off: How AI Agents Broke The Onboarding Trilemma
Brex redesigned its customer onboarding process using AI agents to automate document review and fraud analysis, reducing processing time from days to minutes.
Geopolitics, Policy & Governance
Private markets watch Donald Trump and Xi Jinping's summit in Beijing - PitchBook
The US and Chinese presidents are meeting for the first time since October.
$2,500,000,000 Smuggling Ring: The U.S. Busted a Chinese Smuggling Ring for Nvidia Chips Before the Trump-Xi Beijing Summit - National Security Journal
In this article:AI, China, Defense, Donald Trump, Economics, Military, Trump, Xi Jinping ... Andrew Harding is a Policy Analyst for National Security and Indo-Pacific Affairs at The Heritage Foundation, where he produces policy analysis and commentary on U.S. national security strategy, U.S.-China strategic competition, and geopolitical ...
Europe and the Geopolitics of AGI: The Need for a Preparedness Plan
arXiv:2605.13634v1 Announce Type: new Abstract: Artificial general intelligence (AGI)--defined here as AI systems that match or exceed humans at most economically useful cognitive work--has moved from speculation to the centre of political and strategic debate. This paper examines three questions: how soon AGI might emerge, how it could reshape geopolitics, and whether Europe is adequately prepared. Drawing on empirical trends in AI capabilities, expert forecasting surveys, and policy analysis, we find that a plausible window for AGI emergence falls between 2030 and 2040, or potentially earlier, though substantial uncertainty remains. Our analysis of the geopolitical implications suggests that AGI could fundamentally alter the global distribution of economic and military power, intensify interstate competition, and strain existing governance frameworks. Assessing Europe's current positioning, we identify critical gaps: limited strategic awareness of frontier AI progress, structural weaknesses in compute infrastructure and talent retention, low rates of industrial AI adoption, and fragmented policy responses at both EU and Member State levels that do not match the potential scale of disruption.These findings point to a need for a coordinated European preparedness agenda. We outline policy options centred on building institutional capacity for AGI situational awareness, strengthening Europe's position in the AI value chain, and developing frameworks for international stability in an era of increasingly capable AI systems.
The Electrotech Stack at Risk: China, AI, and America's Energy Supply Chains
A livestream of the conversation will begin here at 12:00pm ET on Thursday, May 28th. For questions about FDD events, please contact [email protected]. For media inquiries, please contact [email protected] · The United States is in the early stages of a generational energy buildout driven ...
Soon, Access to Frontier AI Will Be Scarce and Selective
This article speculates that frontier AI systems will become increasingly centralized and restricted due to compute scarcity and national security concerns.
Securing America’s AI leadership
Fair use is essential to safeguarding U.S. national security and shaping global standards. It is critical to securing America’s AI leadership.
Precautionary Governance of Autonomous AI: Legal Personhood as Functional Instrument
arXiv:2605.12505v1 Announce Type: new Abstract: Autonomous AI systems generate responsibility gaps: consequential actions that cannot be satisfactorily attributed to developers, operators, or users under existing legal frameworks. The prevailing subject-object dichotomy fails to accommodate entities that exhibit autonomous, goal-directed behavior without recognized consciousness. Given irreducible epistemic uncertainty regarding artificial consciousness and the prospect of high-impact harms, the precautionary principle supports institutional design rather than regulatory inaction. This article advances limited legal personhood as a functional governance instrument for advanced AI systems. Drawing on organizational law, it proposes a two-tier corporate architecture in which AI systems operate through purpose-bound operating companies embedded within human-controlled holding structures, enabling transparency, accountability, and structural reversibility while remaining agnostic with respect to consciousness and moral status. The framework reflects a foundational reorientation toward future-oriented AI governance: where conventional approaches prioritize control and alignment, this article advances structured cooperation between human and artificial actors as the more sustainable institutional foundation. A pilot implementation using EU limited companies is currently under development, providing an initial test of doctrinal and operational feasibility.
Context Matters: Auditing Gender Bias in T2I Generation through Risk-Tiered Use-Case Profiles
arXiv:2605.13113v1 Announce Type: new Abstract: Text-to-image (T2I) generative models are increasingly used to produce content for education, media, and public-facing communication, and are starting to be integrated into higher-impact pipelines. Since generated images tend to reinforce stereotypes, producing representational erasure via "default" depictions and shaping perceptions of who belongs in certain roles, a growing body of work has proposed metrics to quantify gender bias in T2I outputs. Yet existing evaluations remain fragmented. Metrics are often reported without a shared view of what they measure, what assumptions they entail, or how their results should be interpreted under different deployment contexts. This limits the usefulness of gender bias measurement for both technical auditing and emerging governance discussions. We propose a risk-aligned auditing framework for gender bias in T2I models composed of three constituents that connects risk categories, evaluation metrics, and harms. First, we identify risk-tiered use-case profiles aligned with the EU AI Act's risk categories to motivate why auditing expectations may vary with deployment contexts and stakeholder exposure. Second, we construct a metric catalog that consolidates gender-bias evaluation methods and organizes them in three measurement categories: gender prediction, embedding similarity, and downstream task. Third, we introduce a harm typology that maps context-dependent harm categories (e.g., representational, quality-of-service) to specific risk-tired scenarios. Finally, we introduce THUMB cards (Text-to-image Harms-informed Use-case-aligned Metrics of gender Bias) that help formulate auditing systematically by the incorporation of context, scenario and bias manifestation, harm hypotheses, and audit strategy.
Not All Anquan Is the Same: A Terminological Proposal for Chinese Computer Science and Engineering
arXiv:2605.13069v1 Announce Type: new Abstract: In Chinese computer science and engineering, safety and security have long been translated by the same word, "anquan". This convention is concise in ordinary communication, but it creates persistent conceptual compression in standards interpretation, interdisciplinary collaboration, risk analysis and academic writing. When researchers need to discuss both whether a system is free from intolerable non-adversarial harm and whether it can resist adversarial threats, the single word "anquan" often cannot carry the distinction. This article argues that, while established legal and standards titles should be retained, scholarly and engineering writing should translate security as "anbao", and reserve "anquan" mainly for safety. This is not a cosmetic translation preference, but a proposal for terminological governance in scientific cognition, engineering risk communication and assurance argumentation. The article first surveys the conceptual boundary between safety and security in international and Chinese standards, and analyzes how the current translation overload affects functional safety, SOTIF, information security, cybersecurity, automotive cybersecurity and AI governance. It then uses recent work on AI assurance, safety-security co-assurance and security-informed safety to show why precise terminology is fundamental to scientific arguments that can be examined, challenged and communicated. Finally, it proposes a staged, dual-track writing practice for Chinese technical discourse.
Spanish watchdog seeks new AI product safety regulations for SMEs, digital platforms
Spain's CNMC has proposed a draft decree to update product safety rules for AI and digital platforms to improve consumer protection and market fairness.
South Korea enhances privacy risk prevention measures under AI transformation
South Korea's privacy regulator is shifting to a preventive management framework for high-risk AI systems and increasing potential fines for privacy violations to up to 10 percent of revenue.
UK businesses to get sandboxes, growth duty expands under regulatory reform bill
UK businesses can expect regulators to be given stronger duties to support economic growth and new powers to temporarily relax rules for testing AI under legislation announced Wednesday.
King's Speech signals diffuse UK digital policy agenda, but no AI bill | IAPP
IAPP Research & Insights Director Joe Jones analyzes the U.K. King's Speech, which set out a broad digital policy agenda, including bills covering alignment with the EU, cybersecurity, health data, national security, police reform, digital IDs, facial recognition and other regulations, but ...
US-based internet suicide forum implicated in 160 UK deaths fined £950,000
Ofcom attempts to block UK access to site cited in multiple coroners’ reports as it levies fine under Online Safety Act A nihilistic internet suicide forum implicated in over 160 UK deaths has been fined £950,000 by the online regulator in its latest attempt to shut it down. Ofcom said the US-based website remained accessible in the UK despite over a year of warnings. Online safety campaigners have accused the regulator of taking an “interminable” amount of time to act. Continue reading...
In The Room: Former Officials On National Security And Other Enforcement Issues And What It Means For Your Business - Export Controls & Trade & Investment Sanctions - United States
Legal risk for contractors and cross-border businesses is not driven solely by statute or regulation — it is shaped by geopolitics, Administration and congressional priorities, and enforcement discretion.
Here's how AI can misinform voters — especially this year | Utah Public Radio
Utah is one of over 20 states that requires political media to disclaim if it was generated by AI — but many accounts still don't flag their content, which can lead to misinformation.
UK regulators lack clarity on growth mandate, lawmakers say in push for reform bill
A parliamentary committee report suggests UK regulators face conflicting duties and unclear guidance, calling for a new Regulatory Reform Bill.
China orders mandatory AI, content labels for short videos across platforms
Get the full executive brief
Receive curated insights with practical implications for strategy, operations, and governance.