SpeechDx: A Multi-Task Benchmark for Clinical Speech AI
SpeechDx introduces a large-scale clinical speech benchmark spanning 12 datasets and 27 tasks, organizing evaluations by the specific stage of speech production that each condition disrupts (conceptualization, formulation, articulation). Testing across limited-label settings and cross-dataset transfers, the authors find large-scale speech models are the strongest baselines overall, but no current approach reliably generalizes across the full clinical speech landscape—highlighting a clear gap toward truly general-purpose clinical speech representations.
Introducing LifeSciBench
LifeSciBench is positioned as an expert-authored, expert-reviewed benchmark designed to evaluate whether AI systems can handle real-world life science research tasks and decisions. The focus is less on toy question answering and more on performance in workflows that resemble what scientists actually do—turning “bench talk” into something closer to scientific operational competence.
When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval
This work tackles legal case retrieval by using an LLM-driven, self-evolving loop that automatically generates rule sets for query rewriting—then validates and prunes ineffective rules without parameter training. On LeCaRD-v2, the framework improves over non-evolutionary BM25 baselines and human-designed rules, showing that “rule evolution” can systematically upgrade sparse retrieval when the right validation environment is in place.
Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty
Instead of judging reliability by answer variation alone, the paper introduces structural uncertainty: a consistency-aware signal built from how an LLM ranks its own reasoning candidates via pairwise preferences and ranking stability. Across multiple benchmarks, structural signals help identify unreliable logical and mathematical reasoning—while collapsing toward uniformity for factual retrieval—suggesting the method is regime-sensitive rather than a universal confidence score.
SkillChain-Gym: A Benchmark for Reskilling-Aware Production-Inventory Control under Disruptions
SkillChain-Gym proposes a benchmark where workforce skills aren’t just background constraints—they’re actively trained, decay over time, and compete for the same limited hours as production. The authors show policy performance is regime-dependent: maintenance training can be necessary even without disruptions, and forecast-aware adaptive training helps when bottlenecks are visible, while lean insurance-style cross-training can dominate under surprise shocks and absenteeism.
Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes
This paper lays out an architecture for open, peer-to-peer networks of heterogeneous LLM agents that discover collaborators, negotiate trust, and execute open-ended tasks. The authors argue standard multi-agent or P2P overlays don’t cover the core requirement: semantic propagation of intentions, capabilities, and cooperation constraints—then propose mechanism design for discovery, identity/reputation, and scalable execution, backed by prototype overhead and attack-aware simulations.
The next humanoid robot might not look human at all
Genesis AI’s Eno is a bet that “humanoid” can be functional rather than visual—designed around human capability instead of human appearance, potentially with no head, no legs, or even a folding, wheeled form factor. The story emphasizes a general-purpose direction for the platform, while keeping hands as the closest-to-human element for manipulation and dexterity.
Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation
This work presents an online adaptive clinical decision support framework that combines treatment-effect estimation, patient digital twins for trajectory simulation, and reinforcement learning for sequential recommendations. Safety is enforced with a rule-based contraindication monitor, and cases with high internal disagreement are flagged for clinician review—showing improved effectiveness and stability in both synthetic simulation and a real ovarian cancer dataset.
Anthropic got hit by export rules nobody understands
Anthropic says a Trump administration export-controls order forced it to cut access for foreign nationals—blocking model access broadly, including inside the US for certain users and employees—resulting in Fable 5 and Mythos 5 being unavailable to everyone. The article spotlights the unusual way export rules are being used to restrict AI access and the confusion around the legal basis.
Google’s first smart speaker in six years arrives next week
Google’s Home Speaker—shipping June 25 and opening preorders June 17—marks the company’s first new smart speaker in six years with a familiar design language and Gemini-for-Home positioning. The reporting notes it’s hardware-consistent with the announcement timeline, but it’s clearly aimed at tightening Google’s voice and smart-home integration.
MemTrace: Probing What Final Accuracy Misses in Long-Term Memory
MemTrace argues that long-term memory evaluation shouldn’t just average correctness over question episodes, because pooled accuracy can mask failures in how facts change over time. By treating “knowledge points” as the unit of measurement and testing memory age, question type, and evidence conditions, the study finds evidence-use dominates the bottlenecks: systems often retrieve evidence that should resolve contradictions, yet still fail to use it effectively.
AI search grounded in Facebook posts? What could go wrong?
Meta’s new AI Mode in Facebook search aims to handle complex queries by pulling from public posts across Meta apps, including Facebook Groups and Instagram Reels. The piece raises the obvious risk: grounding can help contextual relevance, but it also creates new failure modes when the system gets details wrong or over-assumes what public content implies.
Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search
The paper shows why “breadth scaling” in agentic search often stalls: parallel rollouts tend to issue redundant first queries, causing overlapping retrieval that limits downstream gains. With DivInit, a training-free intervention that selects diverse first-turn seeds from a single candidate generation, the authors report consistent improvements over standard parallel sampling on multi-hop QA at matched compute.
A near-autonomous AI chemist improves a challenging reaction in medicinal chemistry
A near-autonomous AI chemist workflow used GPT-5.4 to improve a key medicinal chemistry reaction—pushing beyond brainstorming into iterative experimental-style improvement. The takeaway is practical: with the right autonomy level, LLM-powered systems can contribute to real lab progress on difficult transformations rather than stopping at literature suggestions.