Oct 9, 2025
Dear Readers,
Welcome to the 19th edition of Fine-Tuned by Genloop! The past two weeks brought major developer tools releases and efficiency breakthroughs. OpenAI launched AgentKit with visual workflow builders, Anthropic shipped Claude Sonnet 4.5 achieving 77.2% on SWE-bench, and IBM released Granite 4.0 cutting memory usage by 70% with hybrid Mamba architecture.
On the research front, we explore how reinforcement learning in pretraining builds reasoning foundations earlier, and why effective AI reasoning is more about structure than length.
Let's dive in!
🌟 AI Industry Highlights
OpenAI DevDay 2025: Launches AgentKit, and Codex Updates
OpenAI released AgentKit and related developer tools at DevDay 2025, introducing visual workflow builders and new integration options for building AI agents.
OpenAI released AgentKit at DevDay 2025, introducing visual workflow tools and new Codex features aimed at simplifying agent development for enterprises and developers.
AgentKit
Agent Builder: Visual canvas with drag-and-drop nodes for multi-agent workflows, supporting versioning, preview runs, and inline eval configuration—Ramp reduced iteration cycles by 70%
Connector Registry: Centralized admin panel for managing data sources (Dropbox, Google Drive, SharePoint, Microsoft Teams) across ChatGPT and API, with support for third-party MCPs
ChatKit & Enhanced Evals: Embeddable chat UIs for agents plus new evaluation features including datasets, trace grading, automated prompt optimization, and third-party model support

Codex Updates
Slack Integration: Tag @Codex in channels for task delegation, automatically gathers context and provides completed task links in Codex cloud
Codex SDK: Embed Codex agent in custom workflows with TypeScript SDK (more languages coming), includes structured outputs and built-in context management for session resumption
Admin Tools: Environment controls, monitoring dashboards, and analytics for tracking usage across CLI, IDE, and web—GPT-5-Codex served 40 trillion tokens in three weeks since launch
AgentKit's ChatKit and Evals are generally available; Agent Builder is in beta with Connector Registry rolling out to Enterprise customers with Global Admin Console.

Anthropic Launches Claude Sonnet 4.5: 77.2% on SWE-bench with 30+ Hour Focus
Anthropic released Claude Sonnet 4.5, achieving the highest score ever on SWE-bench Verified while maintaining focus on complex tasks for over 30 hours straight
Key highlights:
Record Coding Performance: 77.2% on SWE-bench Verified (highest ever achieved), 61.4% on OSWorld computer use benchmark (up from 42.2% four months ago), and 76.5% on Cybench cybersecurity tasks
30+ Hour Autonomous Work: Can maintain focus on multi-step tasks for more than 30 hours without losing context, enabled by new checkpoints feature, context editing that reduces token use by 84%, and memory tool for long-term knowledge storage
Most Aligned Model Yet: 60% improvement in alignment metrics with 99.29% harmless response rate, 30x reduction in scheming behavior, and dramatically reduced sycophancy—while maintaining same $3/$15 per million token pricing as Sonnet 4
Major product updates include Claude Agent SDK (open-sourcing the infrastructure behind Claude Code), native VS Code extension, file creation for Excel/Word/PowerPoint directly in chat, and Chrome extension for Max users.

IBM Granite 4.0: 3B to 32B parameters released as open-weight
IBM launched Granite 4.0 with a hybrid Mamba-transformer architecture that cuts memory requirements while maintaining competitive performance on cheaper GPUs.
Key highlights:
70% Memory Reduction: Uses over 70% less RAM than conventional transformers when handling long contexts and concurrent sessions, enabling deployment on cheaper hardware
Hybrid Architecture: Combines Mamba-2 layers with transformer blocks in 9:1 ratio—scales linearly instead of quadratically with sequence length while keeping memory constant
Strong Performance: Granite-4.0-H-Small tops IFEval benchmark among open models (except Llama 4 Maverick which is 12x larger), available under Apache 2.0 with ISO 42001 certification
Four model variants from 3B to 32B parameters released, with reasoning variants and Nano models coming by end of 2025.

🔬 Research Corner
Check out the top papers of the week on LLM Research Hub. Each week, our AI agents scour the internet for the best research papers, evaluate their relevance, and our experts carefully curate the top selections.
Don't forget to follow us to stay up to date with our weekly research curation!
Now, let's deep dive into the top research from the last two weeks:
RLP: Reinforcement Learning Pretraining - Teaching Models to Think Earlier
NVIDIA Research introduces RLP (Reinforcement Learning Pre-training), moving reinforcement learning from post-training into pretraining to teach models step-by-step reasoning much earlier in their development.
Key findings:
Information-Gain Pretraining: Treats chain-of-thought as internal action at each token prediction, sampling short thoughts and measuring reward by how much extra information the thought adds to next word prediction—eliminating need for human judges or verifiers with dense feedback at every step
Stable Training Mechanics: Only adjusts special "thought tokens" (not all text predictions), uses clipping to limit adjustment sizes, samples multiple thoughts per step for balanced feedback, and employs slower-moving baseline model as steady reference to prevent reward exploitation
Scalability and Efficiency: Lifts Qwen3-1.7B-Base by 19% vs base on math/science benchmarks; on Nemotron-Nano-12B-v2, accuracy jumps from 42.81% to 61.32% (35% relative improvement) using just 0.125% of baseline's data—200B fewer tokens
This work demonstrates that reinforcement principles integrated early can establish reasoning foundations without requiring curated datasets or massive computational overhead.
Read Our TuesdayPaperThoughts analysis

What Characterizes Effective Reasoning? Structure Over Length
Meta Superintelligence Labs and NYU challenge the assumption that longer, more detailed reasoning chains produce better answers, finding that structure matters more than length in AI reasoning.
Key findings:
More Thinking Isn't Always Smarter: Systematic analysis across 10 models on math and scientific reasoning shows that longer reasoning or adding review steps often reduces accuracy—within the same question, shorter CoTs and lower review ratios consistently correlate with higher correctness across difficulty levels
Failed-Step Fraction (FSF): Mapping reasoning as branching graphs, researchers introduce FSF—the proportion of abandoned reasoning branches—which emerges as the strongest predictor of success, showing consistent correlations across all models and datasets
Causal Evidence Through Intervention: Ranking 64 candidate solutions by FSF improves accuracy by 5-13% on AIME 2025; surgically removing failed branches from incorrect CoTs boosts continuation accuracy by 8-14%, revealing that failed attempts bias subsequent reasoning even after backtracking
The takeaway: effective reasoning isn't about being long-winded, it's about failing less and staying on track.
Read Our TuesdayPaperThoughts analysis

Looking Forward
This fortnight's releases highlight a clear trend—AI infrastructure is maturing fast. We're seeing visual builders replace complex code orchestration, hybrid architectures dramatically cut compute costs, and sparse attention making long-context work affordable. The tools are finally catching up to the ambition.
What strikes us most is how these efficiency gains unlock entirely new use cases. When you can analyze 350-page documents at half the cost or run capable models on cheaper GPUs, AI deployment stops being a luxury reserved for well-funded teams. That's when things get interesting.





