ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration
ScarfBench tests whether AI agents can reliably support enterprise-scale Java framework migrations using realistic tasks and evaluation. The benchmark is designed to stress real-world constraints where agent “it works in demos” isn’t enough, pushing for measurable migration correctness instead of generic coding ability.
DysLexLens: A Low-Resource LLM Framework for Analysing Dyslexic Learners Insights from Online Forums
DysLexLens turns messy dyslexia-related forum posts into an evidence-traceable pipeline that combines dictionary-driven filtering, knowledge-graph reasoning, and verifiable response evaluation. It’s built for low-resource data and explicitly targets hallucination and evidence alignment, using quantitative and human-grounded checks across 30 questions.
Introducing GeneBench-Pro
GeneBench-Pro expands the benchmark frontier for genomics and scientific research by evaluating models on complex, real-world datasets rather than toy tasks. The goal is to better measure whether AI systems can handle the data richness and scientific constraints that real lab and research workflows require.
Grounded Iterative Language Planning: How Parameterized World Models Reduce Hallucination Propagation in LLM Agents
This work dissects why agentic world models can “hallucinate” state changes that are hard to score, then introduces Grounded Iterative Language Planning (GILP) to curb that error. By training a compact parameterized backbone and gating consistency between it and API-based reasoning, the authors report a major drop in hallucinated-state rate on real GPT-4o-mini calls.
Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning
The paper targets a core limitation of many LLM agents – they plan reactively instead of simulating “what if” futures before committing. It proposes a capability-first, three-stage training pipeline that teaches a model to generate plan-conditioned success estimates, outperforming baselines on search and math reasoning tasks.
MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy
MER-R1 reframes multimodal emotion recognition as a recall-versus-precision trade that can be optimized rather than accepted. Using reinforcement learning to explicitly combine fast-thinking intuition with slow-thinking selectivity, it improves emotion recognition performance while making the reasoning behavior more robust.
AI-Model Network: Concept, Current State and Future
AI-Model Network (AI-ModelNet) is a systems vision for connecting specialized models so they can share capabilities and collaborate on reasoning – modeled after how the internet interconnects computers. The paper outlines an architecture for pathways between models, aiming to reduce today’s bottleneck where multi-model interaction is still ad hoc.
When Does Personality Composition Matter for Multi-Agent LLM Teams?
This study tests whether changing LLM personality settings actually improves group performance in multi-agent settings. The results are domain-dependent – personality shifts barely affect coding milestone completion but can significantly degrade open-ended collaboration and competitive bargaining performance.
Towards Reliable and Robust LLM Planning: Symbolic Feedback-Driven Iterative Self-Refinement Framework
The framework improves LLM planning reliability by introducing symbolic feedback that maps logical symbols into natural language constraints. A symbolic verifier catches errors, issues corrective instructions, and uses goal reachability recognition to guide iterative self-refinement toward feasible, correct long-horizon plans.
Netflix is using an AI-generated Gene Wilder voice in its Willy Wonka reality show
Netflix’s upcoming Wonka reality show will feature an AI-generated Gene Wilder voice, using consent from Wilder’s family and collaboration with ElevenLabs. The move highlights how mainstream media is normalizing high-profile voice synthesis while trying to keep rights and permissions in the loop.
Odyssey: Constructing Verifiable Local Truth-Preserving Foundation Models
Odyssey proposes a categorical framework for building foundation models as compositions of “foundries” that preserve local truths and support verifiable integration. The approach formalizes how model components can be restricted, glued, diagnosed, and admitted via a certification mechanism to keep claims grounded across heterogeneous sources.
ToE: A Hierarchical and Explainable Claim Verification Framework with Dynamic Multi-source Evidence Retrieval and Aggregation
Tree of Evidence (ToE) tackles AI misinformation amplified through retrieval poisoning by building an explainable argument tree for each claim. It combines a retrieval agent, evidence evaluation, and hierarchical aggregation, reporting sizable gains on adversarially poisoned inputs.
Google’s NotebookLM can sum up your research in a TikTok-style clip
NotebookLM is adding short vertical video overviews that convert uploaded sources into ~60-second AI clips, aiming to make research skimmable in a social format. The feature joins existing NotebookLM output styles like podcasts, cinematic video summaries, and visual explainers.
Core dump epidemiology: fixing an 18-year-old bug
OpenAI engineers used large-scale core dump analysis to track down rare infrastructure crashes, uncovering a hardware fault and an 18-year-old software bug in the process. It’s a reminder that “AI progress” often depends on boring, high-stakes reliability work happening underneath the models.
Why Specialization Is Inevitable
This analysis argues that the field is moving toward domain-specific specialization because the cost and complexity of universal models become hard to justify in practice. It frames specialization as a necessary outcome of scaling pressures, deployment realities, and the need for private, lightweight systems.