Back

OpenDailyNews

AI Research Digest โ€” Auto-Generated Daily
Tuesday, February 24, 2026
๐Ÿ”ฌ Research (17 papers) ๐Ÿ›๏ธ Lab Check ๐Ÿค– Agentic AI ๐Ÿ‡จ๐Ÿ‡ณ China AI ๐Ÿ’ผ Hiring
๐Ÿ”ฌ

Research Papers & Lab Outputs

17 items
1
Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory
Usman Anwar et al. arxiv.org/abs/2602.18297
๐Ÿง  Reasoning ๐Ÿ›ก๏ธ Alignment

Uses information theory to show that non-zero mutual information between CoT and output is necessary but NOT sufficient for CoT monitorability. Identifies two failure modes: "information gap" (monitor can't extract available info) and "elicitation error" (monitor approximates poorly). Proposes two training fixes: oracle-based method + label-free approach maximizing conditional mutual information. Both significantly improve monitoring even against adversarial training.

Why it matters: Core alignment infrastructure. First rigorous info-theoretic treatment of WHEN CoT monitoring works and when it doesn't. Critical as reasoning models get deployed at scale.
2
Progressive Thought Encoding: Training Large Reasoning Models Efficiently
Xiaodong Liu et al. arxiv.org/abs/2602.16839
โœ… ICLR 2026 ๐Ÿง  Reasoning โšก Efficiency

Addresses the brutal efficiency problem of RL training for reasoning: long rollouts eat memory and time. Compresses intermediate reasoning into fixed-size vector representations, eliminating backprop through full-cache rollouts. Tested on Qwen2.5-3B/7B and DeepSeek-R1-Distill-Llama-8B: +19.3% over LoRA fine-tuning, +29.9% over base LRMs, up to +23.4 accuracy on AIME2024/2025 under tight cache budgets.

Why it matters: Directly solves "RL training for reasoning is too expensive." ICLR 2026 accepted โ€” expect this to become standard for training reasoning models under real-world constraints.
3
Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
Shashank Aggarwal et al. arxiv.org/abs/2602.17544
๐Ÿง  Reasoning ๐Ÿ“Š Benchmark

Introduces two novel CoT quality metrics: "reusability" (can another model re-execute your CoT?) and "verifiability" (can another model reproduce your answer from your CoT?). Key finding: CoTs from o1-class reasoning models are NOT consistently more reusable or verifiable than general-purpose LLMs like Llama and Gemma.

Why it matters: Exposes a blind spot in accuracy-based reasoning leaderboards. If your CoT can't be reused by other agents in a pipeline, your model may be less useful than benchmarks suggest.
4
Framework of Thoughts (FoT): A Foundation Framework for Dynamic Reasoning
Felix Fricke, Simon Malberg et al. arxiv.org/abs/2602.16512
๐Ÿง  Reasoning ๐ŸŒ Open Source

Unifies CoT, ToT, and GoT into a single general-purpose framework with built-in hyperparameter tuning, prompt optimization, parallel execution, and intelligent caching. Empirically shows faster execution, reduced costs, and better task scores. Open-source codebase released.

Why it matters: Instead of hand-picking reasoning schemes, FoT dynamically selects and optimizes. Could become the "standard library" for reasoning implementations.
5
Discovering Multiagent Learning Algorithms with AlphaEvolve
Zun Li et al. ยท Google DeepMind arxiv.org/abs/2602.16928
๐Ÿค– Agents ๐Ÿง  Reasoning

Uses AlphaEvolve (Google's evolutionary coding agent) to automatically discover new MARL algorithms. Discovers VAD-CFR with novel non-intuitive mechanisms including volatility-sensitive discounting โ€” outperforming SOTA Discounted Predictive CFR+. Also discovers SHOR-PSRO for population-based training.

Why it matters: "AI designing AI algorithms" becoming concrete in multi-agent settings. LLM-powered evolution finds variants humans wouldn't design. Directly relevant to agentic RL.
6
OMAD: Online Multi-Agent Diffusion Policies
Zhuoran Li et al. arxiv.org/abs/2602.18291
๐Ÿค– Agents โšก Architecture

First framework bringing diffusion policies to online MARL. Uses relaxed policy objective maximizing scaled joint entropy. Joint distributional value function within CTDE paradigm. Results: 2.5โ€“5ร— improvement in sample efficiency over SOTA across 10 MPE and MAMuJoCo tasks.

Why it matters: Diffusion models finally making it into online multi-agent RL. The sample efficiency gains are dramatic.
7
Overseeing Agents Without Constant Oversight
Madeleine Grunde-McLaughlin et al. ยท Microsoft Research arxiv.org/abs/2602.16844
๐Ÿค– Agents ๐Ÿ›ก๏ธ Safety

Three user studies on Computer User Agents. Finding: even improved interfaces for verifying agent traces didn't meaningfully improve human accuracy at catching errors. Users' correctness criteria are subjective and changing. Communicating agent process is both essential and insufficient.

Why it matters: Sobering for "human-in-the-loop" safety. Good interfaces โ‰  good oversight. Anyone shipping CUA/agent products needs to read this.
8
2025 AI Agent Index
Leon Staufer et al. ยท MIT arxiv.org/abs/2602.17753
๐Ÿค– Agents ๐Ÿ“Š Benchmark

Comprehensive index of 30 SOTA AI agents documenting origins, design, capabilities, and safety features. Most developers share little about safety, evaluations, or societal impacts. Available at aiagentindex.mit.edu.

Why it matters: First systematic documentation of the deployed agent ecosystem. Useful reference for capability and safety tracking.
9
SPQ: SVD-Pruning-Quantization Ensemble for LLM Compression
Jiamin Yao, Eren Gultepe et al. arxiv.org/abs/2602.18420
โœ… LREC 2026 โšก Efficiency

Combines variance-retained SVD + activation-based pruning + 8-bit quantization. LLaMA-2-7B: 75% memory reduction while IMPROVING perplexity (5.47โ†’4.91). 6.86GB vs GPTQ's 7.16GB. 1.9ร— speedup.

Why it matters: Memory reduction + quality improvement + speed improvement. The compression trifecta actually achieved.
10
JPmHC: Orthogonal Hyper-Connections for Stable Deep Learning
Jinhua Wang et al. arxiv.org/abs/2602.18308
โšก Architecture

Constrains Hyper-Connection mixer on operator-norm-bounded manifolds (Stiefel, Grassmann). Uses Cayley transforms for orthogonality without post-hoc normalization. Evaluated on ARC-AGI: faster convergence, higher accuracy, lower compute vs baselines.

Why it matters: Architecture innovation targeting ARC-AGI. Hyper-Connections gaining steam as replacement for simple residual connections.
11
On the "Induction Bias" in Sequence Models: Transformers vs RNNs
MReza Ebrahimi et al. arxiv.org/abs/2602.18333
โšก Architecture ๐Ÿ“Š Benchmark

Large-scale study: transformers are dramatically less data-efficient than RNNs for state tracking. Data requirements grow much faster with state-space size and sequence length. RNNs share weights across lengths; transformers learn length-specific solutions in isolation, showing negligible or detrimental weight sharing.

Why it matters: Renews the case for recurrent architectures. If state tracking is fundamental to reasoning, transformers may be structurally disadvantaged.
12
RoboGene: Agentic Framework for Real-World Task Generation
Yixue Zhang et al. arxiv.org/abs/2602.16444
๐Ÿฆพ Robotics

Automates generation of diverse manipulation tasks for single-arm, dual-arm, and mobile robots. Diversity-driven sampling + self-reflection + human-in-the-loop. 18K real trajectories collected.

Why it matters: Tackles robotics data scarcity from the task-generation side. Auto-specify diverse tasks โ†’ scale data collection faster.
13
Zero-Shot Interactive Perception (ZS-IP)
Venkatesh Sripada et al. arxiv.org/abs/2602.18374
๐Ÿฆพ Robotics ๐Ÿ–ผ๏ธ Multimodal

Novel "pushlines" โ€” 2D visual augmentation for robotic pushing actions. Outperforms grid-based augmentations optimized for pick-and-place. Couples multi-strategy manipulation with memory-driven VLM.

Why it matters: VLMs for robotics need better primitives for contact-rich actions beyond pick-and-place. Pushlines are simple but clever.
14
VeriSoftBench: Repository-Scale Formal Verification for Lean
Qiaochu Chen et al. arxiv.org/abs/2602.18307
๐Ÿ’ป Code ๐Ÿ”ฌ Math ๐ŸŒ Open Source

500 Lean 4 proof obligations from real open-source projects (NOT Mathlib). Key: Mathlib-tuned provers transfer POORLY to software verification. Success correlates with transitive dependency closure size. Released at github.com/utopia-group/VeriSoftBench.

Why it matters: Formal verification is where coding agents meet theorem proving. Current provers are overfit to math โ€” this benchmark exposes the gap.
15
Generated Reality: Human-Centric World Simulation
Lisong Sun et al. arxiv.org/abs/2602.18422
๐Ÿ–ผ๏ธ Multimodal ๐Ÿฆพ Embodied AI

Video world model conditioned on tracked head pose AND joint-level hand poses for XR. Bidirectional video diffusion teacher distilled to causal interactive system. Human subjects show improved task performance and higher perceived control. Project: codeysun.github.io/generated-reality

Why it matters: World models responding to actual body motion, not just text. Bridge between video generation and embodied interaction.
16
On the Adversarial Robustness of Discrete Image Tokenizers
Rishika Bhagwatkar et al. arxiv.org/abs/2602.18252
๐Ÿ›ก๏ธ Safety ๐Ÿ–ผ๏ธ Multimodal

First work on adversarial attacks against discrete image tokenizers in multimodal LLMs. Attacks are computationally efficient and effective across classification, retrieval, captioning. Defense: unsupervised adversarial training with all other components frozen.

Why it matters: As multimodal models shift to discrete tokenizers, knowing they're vulnerable (and how to fix it) matters.
17
Mean-Field RL Without Synchrony (TMF)
Shan Yang et al. arxiv.org/abs/2602.18026
๐Ÿค– Agents

Extends mean-field RL to asynchronous settings where not all agents act simultaneously. Builds theory around population distribution ฮผ rather than mean action. Proves O(1/โˆšN) approximation bound regardless of how many agents act per step. TMF-PG converges to unique equilibrium.

Why it matters: Real multi-agent systems are asynchronous. This fills a fundamental gap in MF-RL theory.
๐Ÿ›๏ธ

Lab-by-Lab Check

OpenAI

No new research publications. Sent memo to Congress accusing DeepSeek of IP theft via distillation.

Anthropic

Published distillation accusations naming DeepSeek, Moonshot, MiniMax. $380B valuation via $30B Series G. No new papers.

Google DeepMind

Paper #5: AlphaEvolve for discovering multi-agent algorithms. Demis Hassabis spoke at India AI Summit.

Meta FAIR

No new papers or announcements detected.

DeepSeek

V4 release imminent. Reuters: trained on banned Nvidia Blackwell. Reportedly outperforms Claude + ChatGPT in coding.

Microsoft Research

Paper #7: Agent oversight user studies โ€” even better interfaces don't help humans catch errors.

Zhipu AI (Tsinghua)

GLM-5 (744B, MIT license) released Feb 12. Shares tumbled 23% on compute shortage + user complaints.

๐Ÿ” Researcher Check: All 9 researchers individually searched (Karpathy, Jim Fan, Jason Wei, Hyung Won Chung, Shunyu Yao, Tri Dao, Percy Liang, Graham Neubig, Jianlin Su). Sources: arxiv, Twitter/X, lab pages, kexue.fm. No new papers or significant posts in last 24h.
๐Ÿค–

Agentic AI News & Products

๐Ÿ”ฅ Anthropic Accuses Chinese Labs of Industrial-Scale Distillation

DeepSeek (150K exchanges targeting reasoning/censorship bypass), Moonshot AI (3.4M targeting agentic reasoning/tool use), MiniMax (13M targeting agentic coding/orchestration). 24K+ fake accounts total. MiniMax redirected half its traffic on Claude launch day.

First public forensic evidence of cross-border model distillation at this scale.

๐Ÿ”ฅ Alipay AI Pay: 120M+ Agentic Transactions in One Week

Alipay's agentic payment solution processed 120M+ transactions Feb 5-11. Luckin Coffee among early adopters. Agentic AI at consumer payment scale.

First proof point of agents processing >100M real-money transactions/week. The agent economy is no longer theoretical.

๐Ÿ”ง Microsoft Foundry Updates

Added GPT-5.2, Mistral Large 3, DeepSeek V3.2, Kimi-K2 Thinking. New: persistent agent memory + agent-to-agent communication primitives.

Microsoft building the "agent App Store" layer. Persistent memory + agent-to-agent comms are key production primitives.

๐Ÿ’ฐ Coinbase Agentic Wallets

Wallets designed for AI agents to hold and transact crypto. Solana also making agent economy moves.

๐Ÿ‡จ๐Ÿ‡ณ

China AI Ecosystem

๐Ÿšจ DeepSeek V4 Imminent โ€” Trained on Nvidia Blackwell

Reuters exclusive: Senior Trump admin official confirmed V4 trained on banned Blackwell chips. Release as soon as next week. Reportedly outperforms Claude + ChatGPT in coding. DeepSeek expected to scrub technical indicators before release. Inner Mongolia data center.

Potential export control violation that could reshape the entire chip policy debate. Markets bracing for Nasdaq impact โ€” R1 triggered $1T selloff last time.

๐Ÿ“‰ Zhipu AI Shares Tumble 23% Despite GLM-5

GLM-5: 744B MoE (44B active), MIT license, #1 on SWE-rebench (42.1%), trained entirely on Huawei Ascend. Despite strong model, shares crashed on compute shortage + user complaints. Publicly appealing for inference compute support.

The "compute ceiling" hitting Chinese AI in real-time. Great model โ‰  great product if you can't serve it.

๐ŸŒ IEEE Spectrum: Different AI Futures

Excellent analysis: "China defines AI as an enabler to improve existing industry โ€” healthcare, energy, agriculture." Beijing prioritizes applied AI over frontier AGI. Domestic chips expected to power 30-40% of China's compute by 2026.

๐Ÿ›ก๏ธ Distillation War Escalates

Both Anthropic and OpenAI now formally accusing Chinese labs. WSJ, Bloomberg, NYT, Fox News all covering. Becoming a US-China AI cold war flashpoint with policy consequences.

๐Ÿ’ผ

China Agent Startups Hiring

Post-Lunar New Year wave expected

DeepSeek

๐Ÿ“ Hangzhou

V4 imminent. Backed by High-Flyer quant fund. Scaling aggressively for post-launch. Algorithm researchers + infra engineers.

Moonshot AI (Kimi)

๐Ÿ“ Beijing

Released K2.5 + coding agent in Jan. Building agentic reasoning capabilities. ML engineers + backend. Series B+.

MiniMax

๐Ÿ“ Shanghai

Heavy investment in agent orchestration/coding. Tencent-backed. ML, backend, product roles. Series C.

Zhipu AI

๐Ÿ“ Beijing

Desperate for infra/compute engineers after GLM-5 scaling issues. Tsinghua-adjacent. Series C ($300M+).

ByteDance (Coze/Doubao)

๐Ÿ“ Beijing / Shanghai

Massive hiring for Coze agent platform. Backend, ML, full-stack. Biggest AI agent employer in China by headcount.

Alibaba (Qwen team)

๐Ÿ“ Hangzhou

Qwen 3.5 just released with agent focus. Algorithm researchers + systems engineers.