AI News Daily Digest (26-07-04)

Midjourney Medical’s Ultrasound Scanner: The Promising Tech, the Missing Proof

Midjourney Medical shows a behind-the-scenes build of its dunk-tank ultrasound scanner setup, describing a hacked-together array of ultrasound probes plus off-the-shelf compute. The video is a rare look at the hardware pipeline, but it leaves open the big question: where are the results that prove reliability, accuracy, and clinical readiness at scale.

Read the full article here

Diffusion Language Models for Interactive Radiology Report Drafting

DiffusionGemma-26B adapts diffusion-based text generation for medical visual question answering and finds diffusion can match or beat an autoregressive sibling under the same LoRA recipe. The bigger “radiology-friendly” leap is infill: clinicians can fix fragments and the model fills in the missing text between them using any-order infill, a capability inherently easier in diffusion than next-token generation.

Read the full article here

RLVR Proof of Concept for Tool-Use Agents on Atlassian Workflows

This work tests Reinforcement Learning with Verifiable Rewards (RLVR) directly against tool-call traces, using synthetic Jira REST v3 and Confluence v2 environments with schema-faithful checks. On scenarios where the reward signal is well-behaved, RL-trained policies jump from baseline reward ranges around 0.35-0.92 to 0.95-1.00, highlighting a path to outcome-optimized small models without needing live APIs or human labeling.

Read the full article here

Auto-FL-Research: Agentic Search for Federated Learning Algorithms

Auto-FL-Research (AFR) uses constrained coding agents to search over federated learning “recipes” including aggregation rules, client update schedules, objectives, and model variants within a fixed task profile. Experiments across healthcare FL benchmarks show gains on many tasks, but also reveal seed-sensitive failures and improvements that sometimes come from repeated, isolated mechanisms rather than broadly general algorithmic breakthroughs.

Read the full article here

Anthropic Wants to Develop Its Own Drugs

Anthropic’s “Claude Science” positions its AI workbench as a unified environment for scientific tools, datasets, and figure generation, aiming to accelerate discovery and healthcare intervention development. The company also signals an ambition to move beyond assistance into drug development itself, setting up a closer relationship between foundation models and biotech execution.

Read the full article here

Wiola: A Fully Novel Small Language Model Architecture

Wiola claims a ground-up SLM design with no structural lineage to major families like GPT, LLaMA, or Mistral, combining five novel components for positioning, attention, and stability. Spiral Rotary Positional Encoding and gated cross-layer attention target long-range coherence, while adaptive token merging reduces attention cost and modified RMSNorm helps prevent representation collapse.

Read the full article here

World Feedback for Clinical Agents: Diagnosing RL in FHIR Environments

This paper audits MedAgentBench and finds a high “silent-finish” ceiling where the RL agent learns inaction because the evaluation fails to sufficiently penalize it. It introduces MedAgentBench-v3 with better task coverage and shows RL training is held back by capability and format-knowledge barriers, suggesting the fix is targeted SFT for code/format knowledge plus RL for conditional decision logic.

Read the full article here

When Service Agents Reconsider: Difficulty-Routed Control in Customer-Service Operations

The work addresses a core risk in autonomous customer-service agents: acting too confidently when instructions, policy constraints, and backend writes interact. It proposes a difficulty-routed control system that escalates only operationally coupled or conflict-heavy requests, improving reliability while keeping routine sessions fast and low-friction through targeted reconsideration before consequential writes.

Read the full article here

Agent4cs: A Multi-agent System for Code Summarization in Large Hierarchical Repositories

Agent4cs breaks code summarization into a bottom-up multi-agent pipeline that extracts keywords from subfolders, produces robust summaries, and then quality-assures output for consistency across the repository hierarchy. Across evaluations, it improves semantic consistency across folder levels and boosts normalized keyword coverage, tackling the common failure mode where single-model summarizers treat codebases as flat text.

Read the full article here

PACE: A Neuro-Symbolic Framework for Feasibility-Aware Counterfactual Explanations

PACE separates prediction from reasoning so counterfactual recommendations must pass symbolic feasibility constraints rather than only flipping a classifier’s decision. Using a neuro model plus ASP rules, it generates explanations that are more realistic and actionable under domain-specific intervention limits, improving the “plausibility vs. validity” trade-off common in explainable AI.

Read the full article here

CreativityNeuro: Steering Model Weights to Improve Divergent Thinking

CreativityNeuro boosts divergent thinking in LLMs using data-free contrastive weight steering rather than retraining or gradient updates. Experiments report improvements on creativity benchmarks and show reduced mode collapse, with evidence that weight-space steering can transfer across tasks where activation steering doesn’t.

Read the full article here

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

Procedural Memory Distillation (PMD) addresses a limitation of RL-with-verifiable-rewards style training: the model’s richer rollout experience often doesn’t get reused effectively across episodes. PMD distills cross-episode procedural signals into a hierarchical memory during training and then absorbs that knowledge into the policy weights, improving performance on scientific and coding benchmarks versus prior self-distillation approaches.

Read the full article here