Skill-Constrained Model Predictive Control for Resilient Manufacturing Supply Chains
This work tackles production planning when human certifications decay over time and training competes with production capacity, framing workforce capability as a real operational bottleneck. The authors benchmark a closed-loop skill-constrained model predictive controller against production-only, maintenance-only, and static “skill insurance” strategies in disruption-heavy SkillChain-Gym scenarios, finding predictive control wins when skill shortages are forecastable early—but no policy dominates under surprise shocks near demand-capacity boundaries.
Midjourney goes from generating cat images to full-body ultrasound scans
Midjourney CEO David Holz unveiled the company’s first medical hardware product: an ultrasound-based full-body scanner that captures internal “vertical slices” of anatomy using a ring of sensors. The company positions it as a route to high-quality body composition/organ imaging—potentially at annual or even daily cadence—and aims for MRI-comparable image quality in many cases.
Photoshop and Premiere now have AI assistants
Adobe is rolling out a public beta that embeds bespoke AI assistants into major Creative Cloud apps like Photoshop and Premiere, with each assistant designed to operate “as a specialist” inside its specific workflow. The assistants are powered by Adobe’s conversational creative-agent stack and are aimed at organizing work and automating tasks rather than just offering generic chat.
Who decides when AI is too dangerous?
An on-the-ground account of how AI safety, regulation, and geopolitics collide—centered on the US export controls that pulled Anthropic’s Claude Mythos/Fable family after a sudden national security scramble. The discussion frames the breakdown as both a question of technical risk (jailbreak claims) and governance mechanics (timelines, who gets consulted, and why compliance approaches ended up so blunt).
Is it agentic enough? Benchmarking open models on your own tooling
The Hugging Face team argues that “agentic” performance needs testing against the real tooling an agent will use—not just benchmarks that measure pure text quality. Their focus is practical: how to evaluate open models in setups that reflect permissions, environment constraints, and the actual operational loops agents must run to be considered truly agentic.
Distributed General-Purpose Agent Networks: Architecture, Key Mechanisms, and Prototypes
This paper proposes a blueprint for open peer-to-peer networks of heterogeneous agents that can discover each other, establish trust, and negotiate cooperation rules to execute long-horizon tasks. The authors highlight why agent networks aren’t “just P2P + multi-agent,” pushing a layered architecture with semantic declaration propagation, verifiable identities and multi-topic reputation, and mechanism design for open-ended execution.
SkillChain-Gym: A Benchmark for Reskilling-Aware Production-Inventory Control under Disruptions
SkillChain-Gym introduces a standardized testbed where workforce learning and forgetting are first-class citizens in production-inventory control, including certification thresholds and training actions that consume the same worker-hours as production. The benchmark’s seeded disruption framework and resilience metrics make it possible to compare policies that can train, maintain, or insure skills—and to study when adaptive training helps versus when lean static cross-training remains the safer bet.
When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval
This method upgrades legal retrieval by letting an LLM agent iteratively generate and prune BM25-style query rewriting rules—without parameter training—based on experiment feedback. Tested on LeCaRD-v2, the self-evolving framework outperforms fixed baselines like human-designed rules and greedy selection, with analyses showing the agent learns which rule combinations to discard using prior validation results.
Nothing from Something: Can a Language Model Discover 0?
The research tests whether language-model generalization can go beyond training data into genuinely new mathematical structure—specifically, whether models can independently “discover” the concept of zero. Results show GPT-2-sized models can’t generalize to zero at test time under most conditions, but improve dramatically after training on a limited number of examples, with language pretraining effectively scaffolding discovery and halving sample requirements.
Beyond LoRA: Can you beat the most popular fine-tuning technique?
This post explores PEFT alternatives aimed at improving on LoRA’s fine-tuning efficiency while maintaining or boosting performance. The takeaway for practitioners: depending on model architecture and task, “beyond LoRA” strategies can deliver better adaptation quality under constrained compute and memory budgets.
Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty
Instead of judging reliability by output diversity, this work measures whether an LLM consistently ranks its own reasoning candidates—capturing instability and ambiguity directly in the reasoning process. The authors’ structural uncertainty framework uses self-preference-induced ranking distributions and shows a regime boundary: it helps identify unreliable logical/mathematical reasoning, while collapsing toward uniformity for factual retrieval tasks.
Improving health intelligence in ChatGPT
OpenAI highlights improvements to how ChatGPT handles health and wellness questions, emphasizing clearer reasoning, better context use, and physician-informed evaluations. The update frames “health intelligence” as more than answers—pushing toward safer, more clinically grounded response behavior.
SpeechDx: A Multi-Task Benchmark for Clinical Speech AI
SpeechDx builds a broad, cross-dataset benchmark for clinical speech AI by organizing tasks around the stage of speech production affected (conceptualization, formulation, articulation) rather than isolated disease-only evaluations. The results are sobering: even strong large-scale baselines don’t generalize reliably across the clinical speech landscape, suggesting the field still lacks representations that transfer across conditions and datasets.
Amazon employees say they’re facing termination for backing data center limits
In Seattle City Council testimony, Amazon software engineers alleged that the company retaliated after they supported a municipal move to restrict data center growth. The dispute escalates with HR investigations and disciplinary action claims, putting AI-adjacent infrastructure politics directly into workplace risk and employment law territory.
MemTrace: Probing What Final Accuracy Misses in Long-Term Memory
This benchmark targets a blind spot in long-term memory evaluation: scoring question accuracy independently can hide whether a system tracks how a user fact changes over time. MemTrace measures memory by knowledge point (the fact itself) and tests it across controlled dimensions like memory age and evidence conditions, concluding that failures often stem from evidence use—not just retrieval—so “more memory” isn’t automatically “better memory.”