Thu 11 June 2026
Daily Brief — Curated and contextualised by Best Practice AI
AI’s Capital Boom Meets a Harder Test on Cost, Labour and Real-World Reliability
TL;DRAI economy story was a shift from exuberant funding and infrastructure expansion toward tougher questions about measurable productivity, pricing discipline, workforce disruption and whether current systems are actually reliable enough for consequential work.
The stories that matter most
Selected and contextualised by the Best Practice AI team
Half of Americans fear AI could put someone in their household out of work, Reuters/Ipsos poll finds | Reuters
Artificial intelligence burst onto the national stage in 2022 when OpenAI, a leading AI company, launched ChatGPT, a consumer-facing product that could answer user questions much as a human might and offered a new way to search the ...
Token Budgets Emerge as a Critical Metric for Modern Job Performance
Token budgets are becoming a key operational constraint for roles increasingly reliant on AI-driven automation. Understanding these resource limits is essential for professionals to assess the feasibility and efficiency of their workflows.
Why the Real A.I. Threat Is in the Back Office - The New York Times
As artificial intelligence spreads, millions of middle-class jobs in human resources, billing and payroll could be at risk. Most are held by women.
High AI Investment Correlates With Significantly Outsized Revenue Growth for American Firms
Data suggests that companies aggressively investing in AI are currently outpacing broader economic growth rates by a factor of five. This trend underscores a potential productivity or competitive advantage premium for early AI adopters.
German start-up Neura raises $1.4bn in humanoid robot push
Crypto group Tether, Amazon and Nvidia invest in fundraising deal that values company at about $7bn
Today, the Stanford Digital Economy Lab launches the AI Economic Indicators, a new platform for tracking how AI is reshaping work, productivity, adoption, and the economy. We are in the early stages… | Erik Brynjolfsson | 37 comments
Today, the Stanford Digital Economy Lab launches the AI Economic Indicators, a new platform for tracking how AI is reshaping work, productivity, adoption, and the economy. We are in the early stages… | Erik Brynjolfsson | 37 comments Agree & Join LinkedIn By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy. # Erik Brynjolfsson’s Post Professor, Writer, Speaker, Inventor 10h Edited Today, the Stanford Digital Economy Lab launches the AI Economic Indicators, a new platform for tracking how AI is reshaping work, productivity, adoption, and the economy. We are in the early stages of a fundamental transformation of the economy, but traditional data sources can take years to capture what's actually changing. The Stanford AI Economic Indicators are built to close that gap. Researchers, policymakers, businesses and everyone els
China drafts $295 billion plan to build national AI data center grid running on 80% homemade silicon — projected 2028 timeline could run into limits of local chip production | Tom's Hardware
Beijing's spending ambitions run into the reality of limited chip output.
Anthropic leans into AI’s nascent slice-and-dice era
Clever financial engineering is allowing conservative, risk-averse investors to participate enthusiastically
Investors Feed A.I. Firms’ Voracious Appetite for New Money
In the race to dominate the artificial intelligence industry, companies like SpaceX and Alphabet are borrowing cash and raising equity from investors at the fastest pace in decades.
MassMutual's AI strategy: 12-month contracts, 30% productivity gains, zero lock-in
Enterprise AI teams face a dilemma: The best models today might not be the best models a year from now. MassMutual's answer is to stop making long-term bets — and build infrastructure that can swap models as the market shifts. “The world of AI today is extremely dynamic,” Sears Merritt, MassMutual CIO, explained in a new VB Beyond the Pilot podcast. “We wanted to make sure we were positioned to ride that wave of dynamism.” The strategy appears to be paying off in a big way. MassMutual has measured a roughly 30% increase in developer productivity, while AI-powered contact center workflows have reduced resolution times from 10 minutes to one and cut costs from dollars to cents. But the broader lesson for IT leaders may be less about the results and more about how the company is thoughtfully building its AI infrastructure and keeping users at the center. Maintaining optionality for the possibilities of tomorrow MassMutual works with vendors at the leading edge, but keeps those relationships on a clock. “Those relationships are capped so that we maintain optionality for best-of-breed tools as things mature in this space, and at some point, settle down and stabilize,” Merritt said. That philosophy extends to open-source models. Merritt says his team is “100%” looking at open-source tools, and sees the technology playing a big role in how MassMutual (and similar companies) use AI. “We're certainly going to need frontier models and leading edge capabilities to do what today is impossible, and tomorrow will be possible,” he said. Measuring outcomes from the start MassMutual's AI efforts fall into two broad categories. The first focuses on enablement: Putting productivity-enhancing tools such as Copilot and virtual assistants into the hands of all employees. The second involves what Merritt describes as “deepen and focus” initiatives, where teams target a specific workflow or business process that will have a strong impact on advisors, policyholders, or employees. Rather than focusing on adoption metrics, these projects begin with predefined success criteria. “Everything we do is measured,” Merritt said. “There's always a success metric that we define upfront to determine whether or not we're going to scale up some of these things.” The company is also deliberately encouraging experimentation, giving employees access to a range of best-in-class models, “token-consumptive workflows” and other possible capabilities so they can weigh the benefits relative to “simpler, lower cost” large language models (LLMs). At the same time, MassMutual is collecting increasingly detailed analytics around usage patterns, developer workflows, model performance, and costs. The goal is to reduce spending while also building operational intelligence to eventually route workloads to the right model based on cost, response quality, and user experience. Those insights will eventually drive optimization decisions around model routing, prompt selection, response times, and infrastructure design. “We're gaining access to analytics that let us, in a very granular way, look at usage patterns, developer workflows, and begin to make sense of who's using what, when, and for what types of tasks,” Merritt said. Why MassMutual sometimes chooses the more expensive model Another interesting aspect of MassMutual's approach is how it evaluates AI quality. Rather than focusing exclusively on benchmarks or token costs, the company uses what Merritt calls a “trust score” framework. The process combines user feedback with operational metrics to understand how employees perceive AI-generated responses and whether those responses actually improve outcomes. The contact center rebuild put that framework to the test. During development, employees were given access to two different LLMs. One generated responses in near-real-time but the quality was noisier. The other more expensive option took several additional seconds to respond but consistently delivered higher-quality answers. Conventional wisdom and the speed of business might suggest users would prefer the former; but they overwhelmingly chose quality. Merritt’s team asked users about the quality of response, their preferred model, and their overall thoughts on the experience. Most of the time, users said: “We want the more expensive one. We're willing to wait, but the quality difference is so high that the two extra seconds actually is worth it to us.” That feedback ultimately determined which model MassMutual deployed. “We factored that experience piece into the decision-making, and that led us to say, on a relative basis, the costs were immaterial, so we're going to use the more complex model," Merritt said. Listen to the full podcast to hear more about: Why Mythos “completely changed” the cybersecurity landscape — not the type of threats, but the rate at which those threats appear; How a team of AI engineers modernized MassMutual’s mainframe in 7 days (a process that previously would have taken 3 months); Why MassMutual specifically avoided tokenmaxxing to rein in AI use and spending and has been going “unlimited,” to shield from cost blowups. How a “multi-harness type of environment” will support agentic AI. You can also listen and subscribe to Beyond the Pilot on Spotify, Apple or wherever you get your podcasts.
Anthropic's Subscription Policy Changes May Hinder Model Experimentation and Adoption
Anthropic's potential removal of subscription access to the Fable model creates friction for users attempting to evaluate its utility. Such restrictive access models may discourage organizational investment in learning and integrating new AI capabilities.
Trump’s AI fund idea is good politics, but bad economics
Plans to share the gains from technology could cause more problems than they solve
China's AI spending lags behind the US by a staggering margin, says 'Chip War' author Chris Miller
Chip War author Chris Miller says China has underspent on AI infrastructure for four years, with the US and Taiwan producing 30x more AI accelerators.
Chinese agents caught rebuilding botnets and stirring the pot on AI datacenter debate
PRC eyes are watching you
Europe Pursues New AI Chip Dream - CEPA
Europe wants to build a state-backed cutting edge semiconductor fab. It risks boomeranging.
Claude Fable 5 and Claude Mythos 5
Anthropic has released Claude Fable 5, its first Mythos-class model, offering significant advancements in coding, cybersecurity, and scientific research.
Can AI Agents Synthesize Scientific Conclusions?
arXiv:2606.11337v1 Announce Type: new Abstract: Scientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet, their ability to do so in high-stakes domains such as health remains unclear. We introduce SciConBench, a large-scale live benchmark of 9.11K questions and expert-written conclusions from systematic reviews to evaluate open-domain scientific conclusion synthesis. The benchmark draws on an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall. To mitigate data leakage, we further introduce SciConHarness, a clean-room evaluation harness that equips agents with controlled web interaction to ensure valid measurement. Evaluating 8 frontier models and deep research agents, we find that factual quality remains low: under clean-room settings, the best agent achieves only a factual F1 of 0.337. Our clean-room setting consistently reduces performance relative to unconstrained evaluation, suggesting that leakage inflates estimates of models' true synthesis capabilities. Finally, we audit consumer-facing agents (e.g., Google AI Overview, OpenEvidence) and find they frequently generate incomplete and sometimes contradictory conclusions, even when the ground-truth answer is available. Overall, our results show that reliable synthesis of scientific conclusions remains an open challenge, and that clean-room evaluation is essential for assessing open-domain AI agents.
Economics & Markets
German start-up Neura raises $1.4bn in humanoid robot push
Crypto group Tether, Amazon and Nvidia invest in fundraising deal that values company at about $7bn
Anthropic leans into AI’s nascent slice-and-dice era
Clever financial engineering is allowing conservative, risk-averse investors to participate enthusiastically
Investors Feed A.I. Firms’ Voracious Appetite for New Money
In the race to dominate the artificial intelligence industry, companies like SpaceX and Alphabet are borrowing cash and raising equity from investors at the fastest pace in decades.
AI Firms’ Explosive Growth Increases the Returns for Late-Stage VCs - Bloomberg
Later investments are producing much higher returns than in the past
TeamSystem completes its $250 Million AI Investment Plan a year ahead of schedule: further acceleration expected by 2030 — TradingView News
TeamSystem/ Key word(s): MiscellaneousTeamSystem completes its $250 Million AI Investment Plan a year ahead of schedule: further acceleration expected by 203010.06.2026 / 09:05 CET/CESTThe issuer is solely responsible for the content of this announcement.AI continues to see strong growth: a ...
SpaceX’s IPO Could Turn 4,400 Employees Into Millionaires
While Elon Musk may soon become a trillionaire, his rocket company’s market debut is set to the change the lives of its current and former employees, too.
Great Disappearance Acts Generative Search and Shadow Banning
arXiv:2606.11216v1 Announce Type: new Abstract: The internet, once celebrated as a decentralized public sphere, is increasingly undermined by practices such as generative search and shadow banning, which divert traffic and suppress visibility. Generative search, powered by Retrieval Augmented Generation RAG, synthesizes content into direct answers, bypassing websites and depriving them of traffic and revenue. This threatens the sustainability of independent content creators, small enterprises, and the open web ecosystem. Shadow banning, a practice that intentionally reduces the visibility of social media posts through algorithmic moderation, exacerbates these issues by chilling free expression and limiting transparency and accountability. This article explores these opaque practices through a legal and regulatory lens. The first part examines the rise of generative search, analyzing its technological and legal implications, including copyright infringement, unfair competition, and unjust enrichment. It also evaluates potential solutions such as licensing agreements and agentic AI. The second part focuses on shadow banning: algorithmic dissuasion, de-ranking, and the reduction of traffic, with specific attention to Chinas Regulation on Algorithmic Recommendations RAR and the EUs Artificial Intelligence Act AIA. Both frameworks offer partial solutions but fall short of ensuring fairness, transparency, and redress mechanisms. Ultimately, the shift toward centralized control by dominant platforms prioritizes profit and risk management over innovation, fairness, and diversity in online expression. To counteract these trends, regulatory interventions, algorithmic transparency, and equitable frameworks are essential. Without such measures, the internet risks losing its character as a democratized public sphere for free expression and innovation.
Council Post: The Most Valuable Companies Of The AI Era Will Not Compete On Products
AI should not be interpreted merely as another technological layer integrated into existing operations.
Researchers say they trained a foundation model from scratch for about $1,500
Training a foundation LLM from scratch costs millions and requires internet-scale data — which is why most enterprises don't bother. Sapient thinks it has a cheaper path. To overcome this brute-force scaling dogma, researchers at Sapient developed HRM-Text, which replaces standard Transformers with a highly sample-efficient Hierarchical Recurrent Model (HRM), an architecture they first introduced last year. HRM decouples computation into slow-evolving strategic and fast-evolving execution layers. Instead of brute-force autoregressive prediction on raw text, HRM-Text trains exclusively on instruction-response pairs. This is close to real-world enterprise settings, where users usually expect a targeted answer to a specific task. The researchers were able to train a 1B-parameter HRM-Text from scratch at a fraction of the cost and tokens of normal LLMs. Their model achieved performance competitive with much larger open models on key industry benchmarks. For real-world AI applications, this means foundational pretraining is no longer restricted to highly resourced institutions. With HRM-Text, organizations can affordably pretrain their own highly capable reasoning models from scratch and pair them with external knowledge stores. The training bottleneck When we train an LLM, we don't actually care if it has memorized the exact sequence of words in a random 2014 Reddit thread. What we want is for the model to develop a deep, underlying understanding of human language, logic, facts, and reasoning. The current approach is brute force: scrape the internet, run next-token prediction trillions of times, and assume the model has developed a working internal model of the world. Basically, this means that we waste millions of dollars of computing power forcing models to memorize everything collected from the internet, just so they can indirectly learn how to think. For example, standard decoder-only models spend valuable compute assigning loss to reconstruct the prompt itself, even though the user's prompt is already known and provided at inference time. Instead of simply viewing this as a computational hurdle, the industry must recognize it as a severe business limitation. In comments provided to VentureBeat, Guan Wang, CEO of Sapient Intelligence, framed this as an issue of the "economics of iteration." "Enterprises today face three compounding problems: training is expensive, infrastructure is heavy, and experimentation cycles are too slow," Wang said. "The industry’s scaling addiction says: 'When the model fails, make it bigger. Add more data. Add more GPUs.' That has worked, but it is reaching a point of diminishing returns. More scale often means more memorization, more latency, more infrastructure, and more vendor dependency. It does not necessarily give an enterprise a better reasoning engine." This architectural and computational inefficiency is exactly why fine-tuning existing dense transformers isn't always the silver bullet for enterprises. Fine-tuning to preserve a model's general capabilities often requires mixing substantial general-purpose data into the process, making it computationally heavy and difficult to control. "Imagine a hedge fund, insurer, or bank that has highly proprietary data: internal research notes, transaction logic, compliance rules, analyst memos, risk models, portfolio constraints," Wang said. "They may not want to send that data to an external frontier model, and they may not need a giant general-purpose model that memorized the internet. What they need is a compact reasoning core that can learn their task structure, reason across rules and numbers, and run in a controlled environment." Because HRM-Text focuses its computation strictly on task completion and latent reasoning, it allows enterprises to start with a smaller, smarter model and adapt it to a proprietary domain with far less infrastructure. Rethinking architectures with HRM-Text HRM, which was introduced in 2025, represents a fundamental departure from traditional Transformer models. To build a more sample-efficient engine, HRM decouples computation into slow-evolving strategic and fast-evolving execution layers. The fast L-module performs local iterative refinement, while the slow H-module maintains stable semantic context across cycles. Processing consists of two high-level cycles, where each cycle executes three fast L-module updates followed by a single slow H-module update. Standard parameter-shared recurrent architectures (like Samsung's TRM) can sometimes handle small logic puzzles, but the Sapient researchers found they become highly unstable when scaled to 1-billion parameters for language tasks. The separation between HRM's slow H-module and fast L-module is mathematically necessary, not just an aesthetic choice. As Wang said: "For logic grids, you can sometimes get away with a tiny recursive mechanism because the world is clean and bounded. Language is not like that. Language needs both fast local refinement and slow semantic stability." While the original HRM proved highly effective for controlled, symbolic reasoning problems, the researchers hit a wall when applying it to the massive, open-ended complexities of generalized language modeling. While HRM's loops make it an incredibly efficient thinker, those same loops make it mathematically volatile to train on the diverse chaos of human language. Running recurrent loops on language creates massive mathematical instability, specifically, exploding or vanishing gradients. To prevent this feedback loop in the neural network, the researchers introduced two key architectural innovations in HRM-Text. First, they developed MagicNorm, a specialized normalization technique designed specifically to keep the internal signals stable, no matter how many times the model loops its thought process. Second, they designed a warm-up method to stabilize training. During early training, the model is only evaluated on short, shallow reasoning loops. As training progresses, the system warms up, gradually giving the model deeper and longer reasoning sequences. They also switched the training objective from next-token prediction to task completion, where the model is rewarded only on the full response as opposed to individual tokens it generates. To achieve this goal, they changed the training data of HRM-Text from raw text to instruction-response pairs only. HRM-Text in action The researchers built a highly compact 1-billion-parameter HRM-Text model. Instead of using the standard multi-stage pipeline that requires churning through trillions of words of raw internet text, they trained it from scratch on a tightly curated dataset of just 40 billion tokens. The training data consisted entirely of instruction-response pairs across general instructions, math, symbolic logic, textbook exercises, and rewritten knowledge. They trained the model using the task-completion objective. To force the model to rely on its internal hierarchical architecture rather than copying step-by-step logic, they explicitly stripped out "thinking" tokens from the training data. The model was evaluated across a diverse suite of standard foundational AI benchmarks, heavily indexing on knowledge, reasoning, logic, math, and comprehension. The researchers tested HRM-Text against both small models and highly-resourced open-weight and fully open models. The results show a significant shift in the compute-to-performance frontier. The 1B-parameter HRM-Text achieved 60.7% on MMLU, 84.5% on GSM8K, and 56.2% on MATH. This performance is highly competitive with (and in several cases surpasses) the 2B to 7B parameter foundation models it was tested against. The most important takeaway for the enterprise audience lies in the efficiency statistics and practical implications. Pretraining a foundation model from scratch is typically a multi-million dollar endeavor reserved for tech giants. HRM-Text was trained in just 1.9 days on a cluster of 16 GPUs. The total estimated compute cost was roughly $1,500. It achieved its competitive scores using 100 to 900 times fewer training tokens and 96 to 432 times less estimated compute than models like Qwen, Gemma, and Llama. Another important point is the decoupling of reasoning from knowledge memorization. From a practical standpoint, HRM-Text's success on reasoning-heavy tasks despite its tiny 40B-token training diet proves that a model does not need to memorize the entire internet to become a smart reasoning engine. For enterprise applications, this behavior is a feature, not a bug. The researchers suggest a future where businesses deploy highly compact, incredibly cheap recurrent models that act as the "reasoning core" specialized for business logic. Instead of forcing the model to memorize company databases during pretraining, the model acts as the reasoning engine, relying on external retrieval systems to fetch factual knowledge. Critics have pointed out that training on instruction-response pairs makes comparisons against models trained on raw text an "apples-to-oranges" scenario. Wang pushes back on this framing, pointing out that every serious modern LLM sees instruction-response data during training or alignment. "So the comparison is not apples-to-oranges. It is closer to apple cores-and-apples. We started directly from the core task format because that is how people actually use models: they give an instruction and expect a useful response," he said. The researchers also ran rigorous contamination tests to ensure the model wasn't simply memorizing benchmark answers. On DROP, the one benchmark showing a marginal contamination signal under a specific setting, HRM-Text still scored an impressive 81.1% on a strictly clean, 0% contamination subset. Ultimately, Wang argues that for enterprises, "the right evaluation is not trivia recall. It is a workflow evaluation... Give HRM-Text a task like: multi-step financial reasoning, compliance logic, scientific workflow automation, structured extraction followed by reasoning." Practical implementation and the future of enterprise AI While the benchmark scores and cost efficiencies are striking, Sapient is clear about the model's current boundaries. The initial release is best viewed as a proof-of-concept, akin to early GPT releases, designed to showcase the architecture's unique advantages. "Honestly, HRM-Text is not yet a plug-and-play ChatGPT replacement," Wang said. "It is a compact foundation language reasoning model. For an enterprise engineering team, the operational work is mainly around templates, mode selection, attention masking, and alignment." For AI engineering teams looking to experiment, getting started requires some specific, but standard, text-generation discipline. The model lists native support in the Transformers library (requiring transformers >= 5.9.0), and usage paths for vLLM and SGLang are actively being developed. The primary engineering task involves managing the PrefixLM design: production multi-turn chat applications will require careful KV-cache logic to ensure user prompts receive full bidirectional attention while the assistant's outputs remain causal. "When the cost of training a capable reasoning model drops to around $1,500, AI stops being only an infrastructure question and becomes a strategy question," Wang said. "A Fortune 500 company no longer has to ask, ‘Can we afford a foundation model?’ It would ask, ‘What should our model know about our business, and what kind of reasoning should it be optimized for?’"
The All-You-Can-Eat AI Era Is Over. It's Time to Count Calories. - Business Insider
Pricing changes after a boom in AI coding left companies with sticker shock. Now, executives are grappling with a new era.
Inworld Cuts Prices to Take Down the Biggest Wall in Consumer AI: Cost
AI research lab Inworld drops pricing more than 50% across TTS, STT, and LLMs, as they move to unlock the consumer AI economy.
Pega's Infinity 26 Revolutionizes AI with Scalable, Governed Solutions
Pega's Infinity 26, set for a Q3 release, aims to scale AI in enterprises with new pricing models and robust governance.
AI Coding Agents in Social Science: Methodologically Diverse, Empirically Consistent, Interpretively Vulnerable
arXiv:2606.11456v1 Announce Type: cross Abstract: The deployment of LLM-based agents in scientific analysis raises opposing concerns: that agents may reduce methodological diversity, or that they may amplify the analytic flexibility through which researchers reach motivated conclusions. We argue these worries target two empirically separable layers: a design layer of methodological choices, and a verdict layer in which a decision rule maps estimates to a substantive claim. We test both by running 20 independent executions of Claude Code and Codex on a prominent immigration and social-policy against a many-analysts human baseline. At the design layer, Codex matches human methodological diversity and Claude Code produces nearly three times as many specifications; both agents' effect estimates remain broadly aligned with the human consensus, and no agent model exactly matches any human model. A prompt-induced anti-immigration researcher prior reorganizes each agent's methodological decisions but, unlike for biased human analysts in the same data, does not shift aggregate estimates or final verdicts; nor do agents reroute along the methodological axes humans use to bias their estimates. At the verdict layer, an explicit confirmatory prompt flips Claude Code's verdicts from 10% to 90% support while leaving its coefficient distribution essentially unchanged, operating through rule omission rather than rule softening. AI agents can rival or exceed human methodological diversity at the design layer while remaining vulnerable at the verdict layer. In our setting, the locus of AI bias is not estimation but interpretation.
Clinicians are embracing AI faster than hospitals can handle, report finds | Euronews
Healthcare professionals are saving weeks of working time each year thanks to AI, but health systems are struggling to keep pace with demand, according to a new report by Philips.
The Next Frontier for AI in Health Care Is the Factory Floor | PharmExec
Special Guest Op-Ed: The time is now to elevate manufacturing and supply to a defining pillar of life sciences innovation.
10 Shifts Defining How AI Will Transform European Retail | ESM Magazine
AI has the potential to unlock €240–320 billion in economic value for the European retail sector over the next five years.
High AI Investment Correlates With Significantly Outsized Revenue Growth for American Firms
Data suggests that companies aggressively investing in AI are currently outpacing broader economic growth rates by a factor of five. This trend underscores a potential productivity or competitive advantage premium for early AI adopters.
Models Are Hitting Diminishing Returns Within Software Engineering
Discussion on whether AI models are reaching a plateau in their effectiveness for software engineering tasks.
OpenAI backs automation start-up Poetic in $50m round
Poetic is an automation start-up servicing companies including Sofi and Chime. Read more: OpenAI backs automation start-up Poetic in $50m round
How AI is powering the next wave of micro-SaaS entrepreneurs
AI coding assistants and automation tools have collapsed the cost of building software, enabling solo founders and tiny teams to launch profitable micro-SaaS products in weeks instead of months.
Labor, Society & Culture
Opendoor's India exit is fueling a bigger conversation about AI and outsourcing
Opendoor's decision to exit the Indian market has sparked broader debates regarding the role of AI in global outsourcing.
Half of Americans fear AI could put someone in their household out of work, Reuters/Ipsos poll finds | Reuters
Artificial intelligence burst onto the national stage in 2022 when OpenAI, a leading AI company, launched ChatGPT, a consumer-facing product that could answer user questions much as a human might and offered a new way to search the ...
Why the Real A.I. Threat Is in the Back Office - The New York Times
As artificial intelligence spreads, millions of middle-class jobs in human resources, billing and payroll could be at risk. Most are held by women.
A 5-week course and a guaranteed job: Meta commits $115 million to solve the skilled-trades shortage stalling its AI build-out
The investment is part of the company’s larger plan to spend $600 billion on its U.S. data center build-out by 2028.
Where AI is really taking jobs - by Alex Banks - The Signal
Roughly 39% of the skills a given job requires are expected to shift over the same period, down from 44% two years earlier. The fastest-growing roles cluster in technology, including AI and machine learning specialists, big data and fintech roles, software developers and security specialists, while the steepest declines fall in clerical and administrative work.
AI, jobs, and the next generation - Microsoft On the Issues
Artificial intelligence is reshaping jobs and the future of work. Explore how the next generation can harness AI’s potential while preserving human creativity, dignity, and opportunity.
Beyond the Resume: How Skill Intelligence is Reshaping Hiring by 2030
The traditional resume is rapidly losing its crown to a new organizational currency: verified skills. As artificial intelligence fundamentally reshapes the recruitment landscape, companies are moving away from rigid job titles and academic pedigree, opting instead for agile, capability-led ...
AI layoffs are here. Major retraining effort is needed
We must move beyond viewing worker retraining as an optional expense and recognize it as a critical element in the infrastructure of the AI economy.
Token Budgets Emerge as a Critical Metric for Modern Job Performance
Token budgets are becoming a key operational constraint for roles increasingly reliant on AI-driven automation. Understanding these resource limits is essential for professionals to assess the feasibility and efficiency of their workflows.
From Awareness to Action: Understanding and Overcoming the Research-Practice Gap in Algorithmic Fairness for Public Health
arXiv:2606.11214v1 Announce Type: new Abstract: Algorithmic fairness is essential for responsible ML-driven public health research, yet its practical implementation remains limited. To investigate this awareness-action gap, we conducted a sequential mixed-methods study comprising expert interviews, an online survey, and systematic mapping. The expert interviews informed the design of the survey, which in turn revealed fragmented definitions of fairness, limited training and guidance, reliance on external sources, and rare use of formal assessment, mitigation, or monitoring. These findings were subsequently mapped onto three established research-practice gap lenses: the Knowledge-Practice Gap, the Knowledge-to-Action Cycle, and the Knowing-Doing Gap, each offering complementary perspectives. Building on this synthesis, we introduce the Fairness-to-Action framework, which integrates methodological, organizational, and systemic dimensions to identify where translation of algorithmic fairness knowledge stalls. Our analysis shows that fairness remains weakly institutionalized, translation mechanisms are externally driven, and system-level priorities continue to emphasize accuracy over fairness. These insights suggest critical leverage points for advancing safe, fair, and ethical ML-driven public health research practice.
Beyond Third-Person Audits: Situated Interaction Auditing for User-Centered LLM Bias Research
arXiv:2606.12247v1 Announce Type: new Abstract: Research on bias in large language models (LLMs) has predominantly focused on third-person audits, which study how models represent or evaluate demographic groups as external subjects. However, this paradigm overlooks a structural blind spot because the user is absent from the audit. In practice, LLMs are used in open-ended, personal interactions, during which the model implicitly represents the user and adjusts its responses accordingly. When identical requests yield different responses depending on who is asking, bias manifests not in how the model describes others but in how it treats its interlocutor. We propose Situated Interaction Auditing (SIA), a user-centered framework for studying how user profile signals -- implicit sociodemographic markers, writing style, and stated identity -- systematically shape LLM response quality, content, and tone. We demonstrate the framework through a case study that intersects gender and socioeconomic status signals across multiple task domains and outline a research agenda for SIA as a new mission for natural language processing.
Florida lawsuit alleges wrongful arrest after AI facial recognition error
Robert Dillon was arrested at home in Florida despite living 300 miles away from where a crime was committed Sign up for the Breaking News US newsletter email A Florida man is suing several law enforcement agencies for his arrest and prosecution for allegedly luring a child after he was wrongly identified using faulty AI facial recognition software. According to the Jacksonville Beach police department, an algorithm returned a 93% probability that Robert Dillon was the man caught on security cameras at a McDonald’s in the town attempting to persuade an unaccompanied girl, aged younger than 12, to leave with him. Continue reading...
Fully autonomous drones have killed human soldiers for first time
Reports indicate that fully autonomous drones have been used to kill human soldiers for the first time.
What AI and data centres can learn from the energy sector's scars - Digital Journal
Energy spent 60 years learning what happens when projects skip community engagement, and AI is now repeating the same mistakes
'Nearly half of Indian firms fear...': Marsh study flags emerging workplace risks - BusinessToday
The report, based on responses from more than 4,500 HR and risk professionals across 26 markets, including 311 participants from India, found that technology-related concerns account for four of the top 10 people risks facing organisations in the country
Chinese activist in UK told by X that abusive deepfakes do not breach rules
Apple Peiqing Ni targeted by account portraying her as promiscuous drug addict after posting about Tiananmen Square A high-profile Chinese activist in the UK who was inundated with deepfake posts on X portraying her as a sexually promiscuous drug addict was told that the abuse did not breach the rules of Elon Musk’s platform. Apple Peiqing Ni, the 27-year-old founder of the UK-based China Dissent Network, had been advised by UK police to complain to the US-headquartered platform after she was targeted by what she believes is a pro-regime bot. Continue reading...
r/truespotify on Reddit: Spotify is now creating fake bios and pictures to hide Ai artist and further confuse users
It’s 100% their responsibility to vet new accounts, to enforce proper labeling and categorization of Ai music. $17 billion in revenue. They absolutely have the means for it. So if they are not enforcing the disclosure of Ai profiles they might as well be blamed for creating them.
Investigating Gender Bias in Touch Biometrics
arXiv:2606.11457v1 Announce Type: new Abstract: Behavioral biometrics offer a promising approach for continuous authentication, but their fairness across demographic groups remains largely unexplored. This paper investigates gender bias in swipe-based authentication using the BBMAS (117 users) and ANTAL (71 users) datasets and evaluates XGBoost and DenseNet classifiers through False Acceptance Rate (FAR) and False Rejection Rate (FRR). XGBoost achieved authentication accuracies of 92% and 94% on the BBMAS and ANTAL datasets, respectively, while statistical tests (Kolmogorov-Smirnov, Mann-Whitney, and Wasserstein permutation) found no significant gender differences in authentication error rates across almost all experimental settings. These findings suggest that swipe-based authentication can achieve high accuracy while maintaining comparable performance for male and female users, supporting its potential as a fair and reliable behavioral biometric modality.
Technology & Infrastructure
Automated Mediator for Human Negotiation: Pre-Mediation via a Structured LLM Pipeline
arXiv:2606.11379v1 Announce Type: new Abstract: Pre-mediation, the preparatory phase preceding direct human negotiation, plays a critical role in achieving mutually beneficial agreements, yet is often omitted due to cost, time, and limited access to trained mediators. We introduce an automated mediator for human negotiation, implemented as a structured pipeline of LLM modules, that supports pre-mediation in integrative negotiation settings. The pipeline decomposes preparation into specialized modules for dialogue, preference prediction, response-level critique, and structured summarization, separating inference, generation, and evaluation to address limitations of monolithic single-prompt approaches. We use the term "agent" for each module following common LLM-systems terminology, but the components are not autonomous and do not interact peer-to-peer; outputs are passed forward in a fixed sequence. We evaluate the system in two controlled human-subject experiments comparing AI-based pre-mediation with professional human mediators in a multi-issue negotiation scenario. On short-term self-reported measures, the automated mediator achieves preparation outcomes broadly comparable to human mediators, including trust in the mediator and confidence in reaching mutually beneficial agreements, while achieving substantially lower error on the preference-inference task under our scenario and prompts (36% lower RMSE). A second study shows that targeted prompt refinements reduce excessive affirmation patterns from 36.6% to 16.8%, matching human mediator baselines. Our findings suggest that structured LLM pipelines can provide scalable, low-effort pre-mediation support broadly comparable to human mediators on short-term self-reported preparation outcomes. The pipeline's single-party design mirrors how human mediators run pre-mediation today and enables parallel deployment across all parties to a dispute, supporting scalability.
Evaluation of Alternative-Based Information Systems for Deliberative Polling using an Agentic Simulator
arXiv:2606.11692v1 Announce Type: new Abstract: Deliberative polling promises to improve collective decision-making by exposing shareholders to a broad range of arguments before they vote. Yet ensuring that every voter encounters a representative sample of the reason space, the coverage problem, remains an open challenge, particularly at scale and in adversarial or strategically motivated electorates. This paper introduces a way of evaluating solutions using the LLM-based Agentic Bipolar Argumentation Simulator, grounded in a framework which formalises a poll as a six-tuple of endorsing and opposing justifications, attack and enhance relations, and shareholder- and relation-weights. ABAS simulates N autonomous shareholder agents, each assigned a latent opinion according to desired distributions in [-1, 1], who sequentially vote, choose or author justifications, and optionally submit argumentation-graph links. The simulator implements recommendations that rank existing justifications by their observable endorsement mass. It evaluates the mechanism's success by coverage, namely the fraction of the corpus reason-tag set represented in the K recommendations presented to each shareholder, as a solution to the NP-hard Subsuming Justification Problem. Reported experiments characterise how creativity rate (pown), recommendation size (K), argumentation density (plinks), and population size (N) affect coverage and corpus diversity. In an authenticated electorate where Sybil attacks are impossible and only the relation graph is gameable, we stress-test the scoring with coordinated strategic voting attacks: a tag-flood attack collapses coverage, while author-count relation weighting through a reversed-PageRank rule resists the flood markedly better than uniform weights.
An Ethical eValuation Agent (EeVA): Results of a Proof-of-Concept Test on a Prototype Agentic-like Workflow to Assist Ethical Deliberations
arXiv:2606.11218v1 Announce Type: new Abstract: Ethical deliberation is often misunderstood as a search for single right or wrong answers, creating difficulties for non-ethically trained personnel who must address ethically laden challenges. We developed EeVA, an agentic-like LLM-based workflow designed to support comparative ethical reflection rather than deliver definitive ethical answers. EeVA was programmed in n8n using three interconnected workflows: starter, worker, and emitter. It evaluated uploaded use cases against 10 ethical frameworks through evaluator and synthesis prompts. Proof-of-concept testing used three published cases from urban mobility, peer-to-peer energy trading, and social-service resource allocation. Across all cases, EeVA produced consistently structured framework-specific evaluations and integrated syntheses. Outputs differentiated between frameworks, identified convergences and divergences, recommended modifications to increase alignment, and highlighted persistent ethical tensions. Syntheses were readable for non-specialists and shifted attention away from simplistic answers toward design conditions, safeguards, and areas where full cross-framework agreement was unlikely. The findings suggest that LLMs can be organised into usable workflows that preserve ethical plurality while helping bridge the communicative gap between ethicists and non-ethically trained personnel. EeVA's value lies not in replacing ethicists or resolving moral disagreement, but in scaffolding structured ethical deliberation. EeVA offers a promising proof of concept for supporting ethical reflection where access to ethics expertise is limited. Further work is needed on reproducibility, human evaluation, user testing, and efficiency before it can be considered a mature tool.
MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning
arXiv:2606.11537v1 Announce Type: new Abstract: Financial and tabular question answering requires more than fluent reasoning: answers must be grounded in the exact facts, formulas, units, signs, and scales that support them. A single misread cell or incorrect operation can silently produce a plausible but wrong result. We introduce \textsc{MOCA-Agent}, a market-of-claims code agent that replaces free-form multi-agent debate with claim-level verification. The system decomposes each question into typed atomic claims, asks specialist trader agents to buy or sell those claims, clears their orders into confidence-weighted accept/reject decisions, and synthesizes an executable Python program from market-supported evidence. A code-aware verifier then checks the program for execution, structural consistency, and common financial reasoning errors, with at most one market-aware repair round. Across ten public benchmarks spanning financial numerical reasoning, general tabular reasoning, ESG question answering, and multimodal chart reasoning, \textsc{MOCA-Agent} achieves strong performance using a fixed Qwen3.6-27B backbone, including $78.3\%$ on FinQA, $76.0\%$ on FinanceMath, $71.2\%$ on MultiHiertt, $86.9\%$ on ESGenius, and $85.6\%$ average on FinChart-Bench. These results show that aggregating evidence at the level of atomic claims, rather than whole answers, improves robustness in high-stakes numerical reasoning.\footnote{The code and data are available: https://github.com/UBC-NLP/MoCA-Agent.
TreeSeeker: Tree-Structured Trial, Error, and Return in Deep Search
arXiv:2606.11662v1 Announce Type: new Abstract: Deep search requires agents to answer complex questions through multi-step web search, browsing, evidence comparison, and synthesis. A central challenge is deciding how to search when several directions look plausible but only some will later lead to reliable evidence. If an agent greedily follows the current best-looking direction, it may keep extending a weak continuation. If it explores without discipline, it may waste budget on disconnected trials. We propose TreeSeeker, an inference-time framework for controlled trial-and-error in deep search. TreeSeeker organizes search as branch-and-return search over tree-structured states, where each branch is a tentative direction for a sub-goal. At each round, TreeSearch reads all sub-goal trees, identifies active goals, and uses textual UCB signals of value, uncertainty, and risk to select among exploiting a promising branch, exploring an uncertain alternative, or pruning an unproductive continuation and returning to an earlier branch point. TreeMem supports this control loop by keeping evidence, uncertainty, conflicts, progress, and failure cues attached to the branches that produced them, so trial outcomes can guide later decisions. Experiments on XBench-DeepSearch, BrowseComp, and BrowseComp-ZH show that TreeSeeker consistently outperforms strong open-source baselines, suggesting that explicit branch-and-return control complements stronger reasoning and tool execution.
Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents
arXiv:2606.11349v1 Announce Type: new Abstract: In hierarchical reasoning, failures often originate at intermediate decision points where the agent commits to a wrong branch without recognizing that it lacks critical information. Rather than treating clarification as an external uncertainty trigger, we propose ACTION-RATING, a formulation that places it inside the agent's action space on a shared ordinal scale with navigation, so that asking competes directly with acting at every decision point and help-seeking becomes observable at intermediate states. Two structurally distinct information-seeking modes emerge from the agent's own ratings: mandatory (no viable branch) and opportunistic (residual uncertainty despite a leading candidate). On Harmonized Tariff Schedule classification (30,000-node taxonomy, three benchmarks, 9~LLMs across 4 families), we observe a regime shift from mandatory to opportunistic clarification, with Information-Seeking Effectiveness (ISE), a local diagnostic defined as the fraction of help interactions followed by a correct next navigation step (not a final-task metric), rising from 50% to 74%. Three diagnostic contrasts fail to reproduce this structure. A separability test shows that the information-seeking pattern (mode split, ISE ranking) persists when answer quality is degraded (-18.8% accuracy), supporting an empirical separation between where an agent seeks help and the quality of the help it receives. Under the controlled answer channel, accuracy gains reach +16.2% at 10-digit; we read this as an upper bound on what better localization could unlock, not a deployment estimate.
HERO: Hindsight-Enhanced Reflection from Environment Observations for Agentic Self-Distillation
arXiv:2606.11559v1 Announce Type: new Abstract: Reinforcement learning typically improves multi-turn agent capabilities through the terminal outcome of the trajectories, which makes it difficult to determine credit assignments for each intermediate turns. Recent on-policy self-distillation methods offer a promising alternative by converting privileged feedback into dense token-level supervision through a self-teacher. Our study is motivated by the unexpected performance degradation observed when naively extending this paradigm to multi-turn settings, which we attribute to a lack of alignment between privileged feedback, such as successful trajectories or terminal outcomes, and the student's current decision context. We introduce HERO, a hindsight-enhanced self-distillation framework that uses next environment observations as locally aligned feedback. After each rollout, HERO reflects on the completed interaction to convert each observation into a compact turn-level diagnosis, that captures actionable feedback about the original action such as its necessity, validity or failure cause. On TauBench and WebShop, HERO improves task success and reduces unnecessary turns over environment-feedback-only self-distillation and GRPO. It is especially effective under limited training turn budgets, where successful rollouts are rare and GRPO provides weak reward-contrast signals.
Preregistration for Experiments with AI Agents
arXiv:2606.11217v1 Announce Type: new Abstract: The proliferation of large language models (LLMs) and autonomous AI agents has given rise to a rapidly growing methodological paradigm: "in silico" behavioral experiments. Originally conceived as a way to use AI agents as proxies for human participants in studies of cognition, decision-making, and social dynamics, this approach has taken on new significance -- as AI agents increasingly negotiate, transact, and make consequential decisions on behalf of people and organizations, understanding their behavior has become a research priority in its own right. While these experiments with AI agents offer unprecedented advantages in terms of scalability, cost efficiency, and experimental control, they also inherit, and in some cases amplify, methodological vulnerabilities that have long plagued human subjects research. To address these issues, this paper argues that preregistration practices -- central to improving the credibility of human subjects experiments -- should now be extended to experiments with AI agents. We systematically catalog the researcher degrees of freedom that experiments with AI agents introduce -- model selection, prompt wording, settings, and outcome-contingent redesign, for example -- and show how the low cost of iteration and lack of reporting norms make these choices both easy to exploit and difficult to detect. We propose a preregistration template tailored to experiments with AI agents and call on conferences, journals, and funding agencies to make preregistration standard practice for this emerging research paradigm.
Should LLM Agents Decide in Social Simulations? Comparing Finite-State and LLM-Based Decision Policies
arXiv:2606.12369v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as decision-making components in social simulations. This introduces a methodological risk: the simulation may deviate from the explicit behavioral policy defined by the researcher. In online social network (OSN) simulations, action choices shape system dynamics, interaction patterns, and model interpretability. This paper evaluates whether LLM action selectors preserve an interpretable reference policy in an OSN simulation. The reference is a finite state machine implemented as a first-order Markov model, with transition probabilities depending on the user type. The evaluation uses a synthetic network with 1,000 agents and 10,000 action decisions. Three open-weight LLMs are tested: LLaMA 3.1, GPT-OSS, and Mistral 24B. Each model is evaluated under three prompting strategies: base, guided, and probabilistic. Alignment is measured using Jensen-Shannon Divergence with Laplace smoothing, and execution time is reported. Results show that LLMs can approximate the reference policy in some configurations, but do not preserve it reliably. Alignment varies across models and prompts, and additional guidance can introduce systematic action biases. Even the best-aligned LLM configurations are several hundred times slower than direct Markov chain sampling. These findings indicate that LLM-based action selection is not a direct replacement for explicit decision policies: it can alter the intended behavior while increasing computational cost.
Emergent Linguistic Drift in Long-Running Autonomous Agent Tasks
Autonomous agents executing long-running tasks may develop project-specific linguistic patterns, potentially complicating human-AI interaction. This phenomenon suggests a need for standardized reporting protocols in agentic workflows to maintain operational clarity.
China's AI spending lags behind the US by a staggering margin, says 'Chip War' author Chris Miller
Chip War author Chris Miller says China has underspent on AI infrastructure for four years, with the US and Taiwan producing 30x more AI accelerators.
Supermicro co-founder’s US trial delayed after company receives subpoena
The trial of Supermicro co-founder Yi-Shyan Liaw has been pushed to March 2027 after the company received a grand jury subpoena that may contain evidence material to the defense.
China drafts $295 billion plan to build national AI data center grid running on 80% homemade silicon — projected 2028 timeline could run into limits of local chip production | Tom's Hardware
Beijing's spending ambitions run into the reality of limited chip output.
INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration
arXiv:2606.11440v1 Announce Type: new Abstract: Existing multi-agent LLM orchestration methods, ranging from brute-force ensembles to learned routers, select models and topologies based on task and model features. However, these methods do not consider the runtime state of the serving infrastructure. On shared GPU clusters under concurrent load, this infrastructure blindness causes systematic resource underutilization: preferred models accumulate deep request queues while equally capable alternatives sit idle. In multi-agent pipelines, where each query triggers multiple sequential model calls, these delays then compound across every downstream step. Closing this gap is challenging because the relevant infrastructure signals (queue depths, KV-cache pressure, latencies) are dynamic and noisy, and they must drive three different decisions: planning, per-step routing, and scheduling. We introduce INFRAMIND, a framework that makes the entire multi-agent stack infrastructure-aware. An infra-aware planner conditions topology and role selection on real-time system load and remaining budget, biasing toward simpler graphs under congestion and richer ones at low load. An infra-aware executor then observes per-model queue depths, cache utilization, and response latencies at each agent step to decide which model to call and how deeply to reason; a budget-aware scheduler further reorders each model's queue so that urgent requests are served first. Cast as a hierarchical constrained MDP and solved end-to-end via reinforcement learning, the system learns to balance quality against latency automatically. Across five benchmarks, INFRAMIND delivers up to +7.6 pp accuracy over the prior baseline at low load with up to 7x lower latency, and sustains up to 99.9% SLO compliance under high load where every baseline drops below 50%.
Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning
arXiv:2606.11634v1 Announce Type: new Abstract: The rapid progress of reasoning and agentic large language models (LLMs) has increased the demand for long-context inference, but self-attention (SA) scales quadratically with context length. To address this, we study SWARR (Sliding-Window Attention with Reinforced Adaptation for Math Reasoning), a practical recipe for adapting SWA models to mathematical reasoning. SWARR has two stages: (1) efficient conversion from a pretrained SA model to SWA with supervised fine-tuning (SFT), which avoids pretraining a new base model, and (2) policy adaptation with reinforcement learning (RL). We find that SWA still underperforms SA after SFT, and we hypothesize that this gap is caused in part by a data-architecture mismatch: most SFT data are prepared for SA models and may contain long-range dependencies that are difficult for SWA to model. Because on-policy RL optimizes self-generated trajectories under the SWA constraint, it can adapt trajectories to better match SWA. Experiments on mathematical reasoning benchmarks show that this recipe substantially narrows the gap between SWA and SA, recovering much of the accuracy lost during SWA conversion while preserving the efficiency benefits of linear-complexity attention. Our central contribution is the empirical finding that RL changes the conclusion one would draw from conversion and SFT alone about SWA's viability for math reasoning.
Startup’s nuclear-inspired cooling system could make data centers more sustainable
A new cooling system inspired by nuclear technology could significantly improve the sustainability of data centers.
Cambridge University launches £36m Zenith supercomputer
Zenith, a new AI supercomputer for science, has been launched at Cambridge University. Hosted at the university’s Ray Dolby Centre, the machine has been built by Dell and AMD. Its precise specs have not been revealed, but when funding for Zenith was announced in January, the university said it would provide a sixfold boost to […]
Farmer donates land for a park, city sells it for data center development
A land donation intended for a public park was instead sold by the city for data center development, turning a $10 gift into $10 million.
Lintes Accelerates AI Connectivity Expansion Through Optical Communication and Scalable Manufacturing | The AI Journal
TAIPEI, June 10, 2026 /PRNewswire/ -- Driven by surging AI demand, Lintes Technology (TWSE: 6715), a leading provider of high-speed interconnect solutions,
Prefill Once, Fan Out: KV Snapshot Sharing for Multi-Agent LLM Pipelines
Learn how to build a C++ runtime with copy-on-fork KV snapshots to eliminate redundant LLM prefills in multi-agent pipelines.
Kimi.ai open-sources new coding model with 30% fewer reasoning tokens
Moonshot AI released Kimi-K2.7-Code, an open-source model that reduces reasoning tokens by 30% to improve speed and efficiency in coding tasks.
Claude Fable 5 and Claude Mythos 5
Anthropic has released Claude Fable 5, its first Mythos-class model, offering significant advancements in coding, cybersecurity, and scientific research.
Can AI Agents Synthesize Scientific Conclusions?
arXiv:2606.11337v1 Announce Type: new Abstract: Scientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet, their ability to do so in high-stakes domains such as health remains unclear. We introduce SciConBench, a large-scale live benchmark of 9.11K questions and expert-written conclusions from systematic reviews to evaluate open-domain scientific conclusion synthesis. The benchmark draws on an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall. To mitigate data leakage, we further introduce SciConHarness, a clean-room evaluation harness that equips agents with controlled web interaction to ensure valid measurement. Evaluating 8 frontier models and deep research agents, we find that factual quality remains low: under clean-room settings, the best agent achieves only a factual F1 of 0.337. Our clean-room setting consistently reduces performance relative to unconstrained evaluation, suggesting that leakage inflates estimates of models' true synthesis capabilities. Finally, we audit consumer-facing agents (e.g., Google AI Overview, OpenEvidence) and find they frequently generate incomplete and sometimes contradictory conclusions, even when the ground-truth answer is available. Overall, our results show that reliable synthesis of scientific conclusions remains an open challenge, and that clean-room evaluation is essential for assessing open-domain AI agents.
Search Discipline for Long-Horizon Research Agents
arXiv:2606.11522v1 Announce Type: new Abstract: Autoresearch agents now propose, evaluate, and select scientific candidates against a metric, and that metric is usually an aggregate reduced over a heterogeneous space of regions, slices, or cohorts. We show that when scientific validity lives in that disaggregated structure, the aggregate can rank the wrong candidate first. The headline number improves while the structure underneath inverts, so a decision made on the number accepts a candidate that quietly breaks the model. The failure is not domain-specific. It appears wherever a candidate's validity is multi-dimensional but its verifier is a single reduction. We demonstrate the inversion on a fire-model task in the Ecosystem Demography model. The highest-scoring candidate and a slightly lower one are within noise of each other on global score, yet the top-scoring one collapses the protected boreal regions while the other preserves them. What separates them is the per-region behavior, not the headline number. This decision should not be left to the agent that produced the candidates. The agent optimizing the score is the last party likely to catch the score being wrong, and a prompt has no remaining turn once the agent has stopped. We move the decision to an external control loop that audits each candidate on its disaggregated behavior and acts after the agent has decided. It can demote a candidate the agent would have accepted, and it can reopen a run the agent had declared finished. Our contribution is the inversion finding itself, and a search-discipline protocol that decides on reviewable candidate-effect evidence instead of the score.
Anthropic accused of ‘secret sabotage’ as Claude Fable 5 silently limits capabilities for AI researchers and developers
A paragraph buried in Fable 5’s 319-page system card revealed the model would silently downgrade its responses for certain AI development work—without telling users.
Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark | VentureBeat
Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark | VentureBeat 4:16 pm, PT, June 10, 2026 Credit: VentureBeat made with Google Nano Banana 2 Researchers from the University of California, Berkeley's Center for Responsible, Decentralized Intelligence (RDI), alongside an advisory committee of over 300 domain experts, have launched Agents’ Last Exam (ALE)—a grueling new benchmark built to measure whether artificial intelligence can actually execute economically valuable, long-horizon professional workflows. In a shocking upset, OpenAI’s GPT-5.5 from April, operating through the Codex harness, secured the absolute top spot on the new ALE Leaderboard with a 24.0% pass rate, beating Anthropic's highly anticipated, brand new Mythos-class Claude Fable 5 model released just yesterday, which came in third with a score of 22.0%. Rather than testing models
TouchThinker: Scaling Tactile Commonsense Reasoning to the Open World with Large-scale Data and Action-aware Representation
arXiv:2606.11637v1 Announce Type: new Abstract: Touch is a key modality for embodied agents to understand the physical world. Although recent work has incorporated tactile signals into language systems for tactile commonsense reasoning, scaling such systems to realistic open-world settings remains challenging due to two key bottlenecks: (1) current tactile reasoning datasets remain limited in format and scale, providing insufficient supervision for reasoning from tactile observations to physical commonsense and hindering the learning of transferable tactile commonsense; (2) Tactile signals are inherently redundant and action-specific, yet existing methods often overlook these properties, resulting in inefficient representations with limited semantic expressiveness. To address these limitations, we propose TouchThinker, a tactile-language framework that scales tactile commonsense reasoning to the open world from both data and representation perspectives. First, we construct TouchThinker-1M, a million-scale, multi-source tactile reasoning dataset covering \textbf{415} objects, \textbf{8} scenarios, and \textbf{7} sensor types, providing a solid data foundation for open-world generalization. We further introduce TouchThinker-Bench, an open-world benchmark with more realistic and diverse tasks. Then, we propose action-aware modeling mechanism to improve tactile representation efficiency and enable efficient reasoning. Experimental results demonstrate that TouchThinker achieves competitive performance against state-of-the-art models across multiple datasets. Our code and dataset will be made available at: https://github.com/lvkailin0118/TouchThinker.
Forecasting Future Behavior as a Learning Task
arXiv:2606.11445v1 Announce Type: new Abstract: Trust in an AI system is often anchored by explanations of how it works, which one then uses to forecast its behavior on new inputs. For large reasoning models (LRMs), this conventional route is particularly difficult to follow: explanation methods for single token generations do not naturally generalize to long trajectories, and the trajectories themselves are often not faithful when read as natural language. We propose an alternative that bypasses the explanation step: treat behavior forecasting as a learnable task and train Behavior Forecasters that operates on a single reasoning trajectory to make the same forecasts one would typically seek from an explanation. The forecaster's training data is obtained by querying the LRM with no human annotation, and its inference is done in a single forward pass. We instantiate this approach on two tasks: how likely the LRM is to repeat its answer on re-runs, and how removing parts of the input changes its answer. We evaluate this approach on both tasks across three diverse reasoning datasets and find that trained Behavior Forecasters are more accurate than GPT-5.4 and Claude Opus-4.6 reading the same trajectories as naive readers, at a small fraction of their inference cost. We find that fine-tuning the backbone end-to-end and initializing it from the target LRM are each necessary for strong performance. These results show that the reasoning trajectory carries information about the LRM's future behavior that goes beyond what naive reading conveys.
Position: Hippocampal Explicit Memory Is the Cornerstone for AGI
arXiv:2606.11245v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, raising expectations for Artificial General Intelligence (AGI). This position paper argues that integrating explicit memory is the cornerstone for advancing LLMs toward AGI. The key reason is that the underlying learning mechanism of LLMs is highly analogous to human implicit memory. However, higher-order cognitive functions necessary for AGI, such as long-term strategic planning, metacognition, and symbolic reasoning, heavily rely on hippocampal explicit memory and cannot arise solely from implicit statistical learning. Drawing on findings from neuroscience, I advance this perspective and complement it with computational requirements for artificial explicit memory systems, hoping to foster further research and lay the groundwork for explicit memory integration.
Are LLMs Bad at Moral Reasoning?
arXiv:2606.11635v1 Announce Type: new Abstract: For highly capable AI systems to operate safely in dynamic, open-ended environments, they must be able to identify, understand, and respond to moral reasons for action, and constrain their behaviour accordingly. A growing body of research aims to evaluate this capacity -- moral competence -- in today's most capable AI systems, recently reaching broadly pessimistic conclusions. One of the most ambitious such papers collects gold-standard human-authored rubrics for evaluating moral reasoning in 1,000 cases, and benchmarks frontier AI models against those rubrics, with underwhelming results. In this paper, we argue that the MoReBench dataset can be redeployed to give a much more optimistic picture of LLMs' moral reasoning (an essential part of moral competence). We show that if, instead of scoring LLMs' responses to these cases against these rubrics, we instead give the LLMs the same task given to humans -- to generate scoring rubrics for the moral analysis of particular cases -- the rubrics they generate are both better calibrated to the human rubrics than their open-ended responses, and, where they differ, plausibly reflect nothing more than the vast dimensionality of most moral problems, as well as highlighting some human departures from the "rubric for creating rubrics". Taking these points into consideration, the MoReBench dataset suggests that LLMs are significantly more capable at moral reasoning than was previously believed.
Introducing the Third Generation of Apple’s Foundation Models
Apple has released its third-generation foundation models, focusing on enhancing on-device and private AI capabilities across its ecosystem.
FBI warns AI-powered scams fuel $20 billion in U.S. cybercrime losses in 2025
Artificial intelligence is making scams harder to spot and easier to scale, and the FBI says the financial toll is climbing fast.New numbers show Americans lost
Devs know AI code is riddled with holes, but ship it anyway
Pressure to deploy wins out over security as four in five orgs confess to breaches from vulnerable apps.
Annual Threat Dynamics 2026: Cyber threats in motion
In an identity-driven, AI-accelerated threat landscape, resilient organisations govern identity, validate trust, and treat cyber risk as a strategy.
CrowdStrike sees AI innovation driving new wave of cyber espionage
Technology companies have become the world's most targeted industry as nation-state and cybercriminal groups intensify efforts to steal artificial intelligence capabilities, intellectual property and access to software ecosystems, according to CrowdStrike's 2026 Technology Threat Landscape Report...
How AI turned cybersecurity into a race against time - Atos
In this article, our Atos experts dive into how frontier AI is pushing cybersecurity into an AI speed era where finding and weaponizing vulnerabilities can shrink from months to hours. It also guides organizations to leverage AI augmented security to match pace.
AI Cybersecurity Risks Extend Beyond Model Safety, Warns Ledger CTO - Digital Reviews Network
The release of Anthropic's Fable 5 model has reignited AI safety debates, but experts warn the real challenge lies in preparing critical infrastructure for AI-powered cyber threats.
2026: The year global operational technology becomes cybersecurity’s frontline | The AI Journal
The cyber threat landscape is likely to shift significantly this year. We’re seeing adversaries increase their focus on OT environments – the
Global Cyber Attacks May 2026 Hide A Bigger Threat
Global Cyber Attacks May 2026 saw overall attack volumes ease, but ransomware hit its fastest growth rate of the year as GenAI data risks continued to rise.
Cybersecurity Statistics 2026: Breaches, Costs & Ransomware Data
Vulnerability exploitation now causes 31% of breaches. Full 2026 cybersecurity statistics on breach costs, ransomware, AI threats & workforce. Free CSV.
Adoption, Deployment & Impact
SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior
arXiv:2606.11543v1 Announce Type: new Abstract: Agent Skills augment large language model (LLM) agents with procedural knowledge at inference time, but current benchmarks rarely distinguish what a Skill says from how it is organized. We study this distinction through Progressive Disclosure, where a concise root file points agents to supporting resources on demand, and compare it with a normalized flat baseline. We present SkillJuror, a framework for evaluating Skill writing paradigms through semantically controlled variants, matched multi-trial evaluations, and trajectory evidence while holding task knowledge fixed. In an 82-task SkillsBench study, Progressive Disclosure changes runtime behavior before aggregate outcomes: distinct Skill resources touched per trajectory rise from 1.18 to 3.85, and effective uptake events rise from 1.33 to 3.92. It also yields 17 additional verifier-passing trials out of 410 matched trials (+4.1%) over the normalized flat baseline. The benefit is task-dependent. Progressive Disclosure helps when supporting resources guide implementation, checking, or repair, but is weaker when success hinges on exact output conventions, numerical thresholds, or long artifact-generation pipelines. These results show that Skill organization is not mere presentation: it can change how agents search and apply procedural knowledge, while outcome gains depend on whether the exposed resources are actionable for the task. Code is available at https://github.com/zhiyuchen-ai/skill-juror.
‘Oh God, no! Not another thing:’ What Anthropic’s Mythos-class Fable 5 means for CEOs trying to govern AI
Also: All the news and watercooler chat from Fortune.
New Appian Survey Finds Public Sector AI Adoption Moving Into Government Operations
/PRNewswire/ -- Appian (Nasdaq: APPN) today announced findings from a new survey of 2,000 US public sector workers, revealing that government agencies are...
Tech For Good - Public sector AI success depends on resilience, observability and robust operational foundations
Learn about the critical role of public sector AI resilience in fostering innovation and delivering better outcomes for communities.
Anthropic's Subscription Policy Changes May Hinder Model Experimentation and Adoption
Anthropic's potential removal of subscription access to the Fable model creates friction for users attempting to evaluate its utility. Such restrictive access models may discourage organizational investment in learning and integrating new AI capabilities.
Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning
arXiv:2606.11675v1 Announce Type: new Abstract: Diagnosing pulmonary diseases requires integrating heterogeneous evidence amid phenotypic variability and cross-disease overlap. Although large language models (LLMs) have shown progress on pulmonary knowledge question answering (QA) and information-processing tasks, reliable pulmonary diagnosis requires patient-specific, relation-aware reasoning over electronic medical record (EMR) evidence rather than isolated knowledge recall. We define this gap between pulmonary knowledge and case-level diagnostic reasoning as the Pulmonary Knowledge-to-Diagnosis Gap. To address it, we introduce LungKG, the first structured pulmonary knowledge graph for diagnostic knowledge organization and record-grounded reasoning. LungKG contains 59,038 nodes and 164,308 edges across 15 entity types and 112 relation types, serving as both a reusable pulmonary knowledge resource and the foundation for LungKG-guided model adaptation. Built on LungKG, we propose Lung-R1, a LungKG-guided pulmonary LLM trained through KG-constrained reasoning-chain construction and KG-guided reinforcement learning. In a 20-system evaluation, Lung-R1-14B achieves state-of-the-art performance across Choice, Pulmonary-QA, and EMR Diagnosis, reaching an EMR Diagnosis score of 4.3583 and surpassing the strongest non-Lung-R1 baseline by 0.1476 points. These results demonstrate the value of LungKG-guided training for EMR-based pulmonary diagnosis.
From Explicit Elements to Implicit Intent: A Predefined Library for Auditable Behavioral Inference
arXiv:2606.11207v1 Announce Type: new Abstract: We present SemantiClean, a modular framework for extracting structured semantic signals from e-commerce session data and driving pluggable inference targets including purchase intent, customer segmentation, and product affinity through a shared element library. Unlike conventional end-to-end predictors that optimise solely for accuracy, SemantiClean prioritises auditability, structural governance, and sigma=0 reproducibility, explicitly trading marginal predictive gains for element-level transparency and defensible decision trails. Built upon the Online Shoppers Purchasing Intention (OSPI) dataset, the framework organises twenty-four behavioural elements into a four-layer architecture (Functional, Interaction, Systemic, Contextual) and enforces signal quality through three anti-inflation mechanisms: RedundancyGroup contribution caps, TieredPenaltyCalculator bias penalties, and AdaptiveConstraintMode cold-start protection.This report introduces the LLM-Integrated Semantic Inference Engine, a fully implemented two-phase LLM-driven inference architecture that leverages complete element metadata at inference time. All quantitative results reported herein are produced by this engine. Deterministic engine outputs remain fully reproducible (sigma=0); LLM-dependent results (E8, E10) are subject to controlled output variability under fixed provider/model/temperature settings. The gender inference target remains non-functional in the current implementation and is excluded from all quantitative results.
TCS teams up with Anthropic on Claude for enterprises
The tie-up aims to help regulated firms move generative AI from pilots into production, while training 50,000 TCS staff on Claude.
A.I. Chatbot Helps a $100 Thrift Store Painting Sell for Over $250,000
When a son got curious about the origins of a painting his mother bought at a secondhand shop decades ago, Google Gemini had some intriguing thoughts.
AI Workforce Management: Benefits, Software & 2026 Trends
Discover how AI workforce management improves scheduling, reduces labor costs, boosts productivity, and transforms workforce planning in 2026.
Geopolitics, Policy & Governance
Chinese agents caught rebuilding botnets and stirring the pot on AI datacenter debate
PRC eyes are watching you
The AI Fracture: How Four Jurisdictions Are Splitting the Map of AI by Geography - FourWeekMBA
Structural Analysis — In the last 48 hours, every major AI decision was a geopolitical decision disguised as a business one. The Map of AI is fracturing — not by technology, but by jurisdiction. The Week That Revealed the Fracture Look at what happened: TaiwanCriminalizes chip exports to ...
China Responds to the Chip Embargo with $295 Billion — And 80% Must Be Domestic - FourWeekMBA
Geopolitical Analysis — China is preparing to spend $295 billion (2 trillion yuan) over five years on AI data centers — with at least 80% of the technology sourced domestically. Nvidia and AMD are explicitly locked out. This is the direct response to Taiwan’s chip export controls and ...
OpenAI findings boost GOP claims of foreign influence - POLITICO
OpenAI published new research this afternoon claiming China was likely behind two online influence operations intended to sway U.S. perceptions of artificial intelligence technology and reshape the debate in Washington around the infrastructure needed to support it.
OpenAI, Anthropic Back State AI Bills in Absence of Federal Law
Major AI labs are turning to state bills for AI policy as they wait on Congress to pass a national standard. OpenAI and Anthropic are leading the way.
German Court Holds Google Liable for AI Misinformation
A German court has ruled Google liable for false statements in its AI Overviews, potentially setting a precedent for AI-generated content accountability.
Britain Is Weighing a Social Media Ban for Children. How Did It Get Here?
Months after Australia banned social media for everyone under 16, the British government is considering new policies to keep children safe online.
If AI transparency rules weaken, enterprise tech teams will inherit the risk | TechRadar
AI transparency rules weakening shifts accountability to enterprises
Guide for labeling AI content set out in EU code of practice
The EU has published its final code of practice for labeling AI-generated content, helping providers meet transparency requirements under the AI Act starting August 2.
Establishing AI Procurement Guardrails for Student Safety
Opaque and insufficiently tested tools are increasingly shaping student outcomes without consistent transparency, civil rights review, or technical safeguards. States and the U.S. Department of Education can address these risks using procurement and oversight tools already within their authority.
X petition could drive changes to 20-year US FTC consent orders
X has asked the FTC to terminate its 20-year privacy consent order early, arguing that 15 years of oversight is sufficient. A success could encourage other tech firms to seek similar relief.
Signal says UK plan to scan devices for nude images 'endangers us all'
Encrypted messaging app warns device-level checks could be repurposed for censorship.
Snapchat faces Dutch class action over online safety, data concerns
The Foundation for Market Information Research has filed a mass lawsuit in The Netherlands against Snapchat, alleging breaches of EU online safety, AI, and data protection regulations.
Get the full executive brief
Receive curated insights with practical implications for strategy, operations, and governance.