Uses information theory to show that non-zero mutual information between CoT and output is necessary but NOT sufficient for CoT monitorability. Identifies two failure modes: "information gap" (monitor can't extract available info) and "elicitation error" (monitor approximates poorly). Proposes two training fixes: oracle-based method + label-free approach maximizing conditional mutual information. Both significantly improve monitoring even against adversarial training.
Addresses the brutal efficiency problem of RL training for reasoning: long rollouts eat memory and time. Compresses intermediate reasoning into fixed-size vector representations, eliminating backprop through full-cache rollouts. Tested on Qwen2.5-3B/7B and DeepSeek-R1-Distill-Llama-8B: +19.3% over LoRA fine-tuning, +29.9% over base LRMs, up to +23.4 accuracy on AIME2024/2025 under tight cache budgets.
Introduces two novel CoT quality metrics: "reusability" (can another model re-execute your CoT?) and "verifiability" (can another model reproduce your answer from your CoT?). Key finding: CoTs from o1-class reasoning models are NOT consistently more reusable or verifiable than general-purpose LLMs like Llama and Gemma.
Unifies CoT, ToT, and GoT into a single general-purpose framework with built-in hyperparameter tuning, prompt optimization, parallel execution, and intelligent caching. Empirically shows faster execution, reduced costs, and better task scores. Open-source codebase released.
Uses AlphaEvolve (Google's evolutionary coding agent) to automatically discover new MARL algorithms. Discovers VAD-CFR with novel non-intuitive mechanisms including volatility-sensitive discounting โ outperforming SOTA Discounted Predictive CFR+. Also discovers SHOR-PSRO for population-based training.
First framework bringing diffusion policies to online MARL. Uses relaxed policy objective maximizing scaled joint entropy. Joint distributional value function within CTDE paradigm. Results: 2.5โ5ร improvement in sample efficiency over SOTA across 10 MPE and MAMuJoCo tasks.
Three user studies on Computer User Agents. Finding: even improved interfaces for verifying agent traces didn't meaningfully improve human accuracy at catching errors. Users' correctness criteria are subjective and changing. Communicating agent process is both essential and insufficient.
Comprehensive index of 30 SOTA AI agents documenting origins, design, capabilities, and safety features. Most developers share little about safety, evaluations, or societal impacts. Available at aiagentindex.mit.edu.
Combines variance-retained SVD + activation-based pruning + 8-bit quantization. LLaMA-2-7B: 75% memory reduction while IMPROVING perplexity (5.47โ4.91). 6.86GB vs GPTQ's 7.16GB. 1.9ร speedup.
Constrains Hyper-Connection mixer on operator-norm-bounded manifolds (Stiefel, Grassmann). Uses Cayley transforms for orthogonality without post-hoc normalization. Evaluated on ARC-AGI: faster convergence, higher accuracy, lower compute vs baselines.
Large-scale study: transformers are dramatically less data-efficient than RNNs for state tracking. Data requirements grow much faster with state-space size and sequence length. RNNs share weights across lengths; transformers learn length-specific solutions in isolation, showing negligible or detrimental weight sharing.
Automates generation of diverse manipulation tasks for single-arm, dual-arm, and mobile robots. Diversity-driven sampling + self-reflection + human-in-the-loop. 18K real trajectories collected.
Novel "pushlines" โ 2D visual augmentation for robotic pushing actions. Outperforms grid-based augmentations optimized for pick-and-place. Couples multi-strategy manipulation with memory-driven VLM.
500 Lean 4 proof obligations from real open-source projects (NOT Mathlib). Key: Mathlib-tuned provers transfer POORLY to software verification. Success correlates with transitive dependency closure size. Released at github.com/utopia-group/VeriSoftBench.
Video world model conditioned on tracked head pose AND joint-level hand poses for XR. Bidirectional video diffusion teacher distilled to causal interactive system. Human subjects show improved task performance and higher perceived control. Project: codeysun.github.io/generated-reality
First work on adversarial attacks against discrete image tokenizers in multimodal LLMs. Attacks are computationally efficient and effective across classification, retrieval, captioning. Defense: unsupervised adversarial training with all other components frozen.
Extends mean-field RL to asynchronous settings where not all agents act simultaneously. Builds theory around population distribution ฮผ rather than mean action. Proves O(1/โN) approximation bound regardless of how many agents act per step. TMF-PG converges to unique equilibrium.
No new research publications. Sent memo to Congress accusing DeepSeek of IP theft via distillation.
Published distillation accusations naming DeepSeek, Moonshot, MiniMax. $380B valuation via $30B Series G. No new papers.
Paper #5: AlphaEvolve for discovering multi-agent algorithms. Demis Hassabis spoke at India AI Summit.
No new papers or announcements detected.
V4 release imminent. Reuters: trained on banned Nvidia Blackwell. Reportedly outperforms Claude + ChatGPT in coding.
Paper #7: Agent oversight user studies โ even better interfaces don't help humans catch errors.
GLM-5 (744B, MIT license) released Feb 12. Shares tumbled 23% on compute shortage + user complaints.
DeepSeek (150K exchanges targeting reasoning/censorship bypass), Moonshot AI (3.4M targeting agentic reasoning/tool use), MiniMax (13M targeting agentic coding/orchestration). 24K+ fake accounts total. MiniMax redirected half its traffic on Claude launch day.
Alipay's agentic payment solution processed 120M+ transactions Feb 5-11. Luckin Coffee among early adopters. Agentic AI at consumer payment scale.
Added GPT-5.2, Mistral Large 3, DeepSeek V3.2, Kimi-K2 Thinking. New: persistent agent memory + agent-to-agent communication primitives.
Wallets designed for AI agents to hold and transact crypto. Solana also making agent economy moves.
Reuters exclusive: Senior Trump admin official confirmed V4 trained on banned Blackwell chips. Release as soon as next week. Reportedly outperforms Claude + ChatGPT in coding. DeepSeek expected to scrub technical indicators before release. Inner Mongolia data center.
GLM-5: 744B MoE (44B active), MIT license, #1 on SWE-rebench (42.1%), trained entirely on Huawei Ascend. Despite strong model, shares crashed on compute shortage + user complaints. Publicly appealing for inference compute support.
Excellent analysis: "China defines AI as an enabler to improve existing industry โ healthcare, energy, agriculture." Beijing prioritizes applied AI over frontier AGI. Domestic chips expected to power 30-40% of China's compute by 2026.
Both Anthropic and OpenAI now formally accusing Chinese labs. WSJ, Bloomberg, NYT, Fox News all covering. Becoming a US-China AI cold war flashpoint with policy consequences.
V4 imminent. Backed by High-Flyer quant fund. Scaling aggressively for post-launch. Algorithm researchers + infra engineers.
Released K2.5 + coding agent in Jan. Building agentic reasoning capabilities. ML engineers + backend. Series B+.
Heavy investment in agent orchestration/coding. Tencent-backed. ML, backend, product roles. Series C.
Desperate for infra/compute engineers after GLM-5 scaling issues. Tsinghua-adjacent. Series C ($300M+).
Massive hiring for Coze agent platform. Backend, ML, full-stack. Biggest AI agent employer in China by headcount.
Qwen 3.5 just released with agent focus. Algorithm researchers + systems engineers.