OpenAI Launches AgentKit, while Anthropic releases Claude Sonnet 4.5

OpenAI Launches AgentKit, while Anthropic releases Claude Sonnet 4.5

OpenAI Launches AgentKit, while Anthropic releases Claude Sonnet 4.5

Oct 9, 2025

Table of Contents

Dear Readers,

Welcome to the 19th edition of Fine-Tuned by Genloop! The past two weeks brought major developer tools releases and efficiency breakthroughs. OpenAI launched AgentKit with visual workflow builders, Anthropic shipped Claude Sonnet 4.5 achieving 77.2% on SWE-bench, and IBM released Granite 4.0 cutting memory usage by 70% with hybrid Mamba architecture.

On the research front, we explore how reinforcement learning in pretraining builds reasoning foundations earlier, and why effective AI reasoning is more about structure than length.

Let's dive in!

🌟 AI Industry Highlights

OpenAI DevDay 2025: Launches AgentKit, and Codex Updates

OpenAI released AgentKit and related developer tools at DevDay 2025, introducing visual workflow builders and new integration options for building AI agents.

OpenAI released AgentKit at DevDay 2025, introducing visual workflow tools and new Codex features aimed at simplifying agent development for enterprises and developers.

AgentKit

  • Agent Builder: Visual canvas with drag-and-drop nodes for multi-agent workflows, supporting versioning, preview runs, and inline eval configuration—Ramp reduced iteration cycles by 70%

  • Connector Registry: Centralized admin panel for managing data sources (Dropbox, Google Drive, SharePoint, Microsoft Teams) across ChatGPT and API, with support for third-party MCPs

  • ChatKit & Enhanced Evals: Embeddable chat UIs for agents plus new evaluation features including datasets, trace grading, automated prompt optimization, and third-party model support

Codex Updates

  • Slack Integration: Tag @Codex in channels for task delegation, automatically gathers context and provides completed task links in Codex cloud

  • Codex SDK: Embed Codex agent in custom workflows with TypeScript SDK (more languages coming), includes structured outputs and built-in context management for session resumption

  • Admin Tools: Environment controls, monitoring dashboards, and analytics for tracking usage across CLI, IDE, and web—GPT-5-Codex served 40 trillion tokens in three weeks since launch

AgentKit's ChatKit and Evals are generally available; Agent Builder is in beta with Connector Registry rolling out to Enterprise customers with Global Admin Console.

Anthropic Launches Claude Sonnet 4.5: 77.2% on SWE-bench with 30+ Hour Focus

Anthropic released Claude Sonnet 4.5, achieving the highest score ever on SWE-bench Verified while maintaining focus on complex tasks for over 30 hours straight

Key highlights:

  • Record Coding Performance: 77.2% on SWE-bench Verified (highest ever achieved), 61.4% on OSWorld computer use benchmark (up from 42.2% four months ago), and 76.5% on Cybench cybersecurity tasks

  • 30+ Hour Autonomous Work: Can maintain focus on multi-step tasks for more than 30 hours without losing context, enabled by new checkpoints feature, context editing that reduces token use by 84%, and memory tool for long-term knowledge storage

  • Most Aligned Model Yet: 60% improvement in alignment metrics with 99.29% harmless response rate, 30x reduction in scheming behavior, and dramatically reduced sycophancy—while maintaining same $3/$15 per million token pricing as Sonnet 4

Major product updates include Claude Agent SDK (open-sourcing the infrastructure behind Claude Code), native VS Code extension, file creation for Excel/Word/PowerPoint directly in chat, and Chrome extension for Max users.

Learn more

IBM Granite 4.0: 3B to 32B parameters released as open-weight

IBM launched Granite 4.0 with a hybrid Mamba-transformer architecture that cuts memory requirements while maintaining competitive performance on cheaper GPUs.

Key highlights:

  • 70% Memory Reduction: Uses over 70% less RAM than conventional transformers when handling long contexts and concurrent sessions, enabling deployment on cheaper hardware

  • Hybrid Architecture: Combines Mamba-2 layers with transformer blocks in 9:1 ratio—scales linearly instead of quadratically with sequence length while keeping memory constant

  • Strong Performance: Granite-4.0-H-Small tops IFEval benchmark among open models (except Llama 4 Maverick which is 12x larger), available under Apache 2.0 with ISO 42001 certification

Four model variants from 3B to 32B parameters released, with reasoning variants and Nano models coming by end of 2025.

Learn more

🔬 Research Corner

Check out the top papers of the week on LLM Research Hub. Each week, our AI agents scour the internet for the best research papers, evaluate their relevance, and our experts carefully curate the top selections.

Don't forget to follow us to stay up to date with our weekly research curation!

Now, let's deep dive into the top research from the last two weeks:

RLP: Reinforcement Learning Pretraining - Teaching Models to Think Earlier

NVIDIA Research introduces RLP (Reinforcement Learning Pre-training), moving reinforcement learning from post-training into pretraining to teach models step-by-step reasoning much earlier in their development.

Key findings:

  • Information-Gain Pretraining: Treats chain-of-thought as internal action at each token prediction, sampling short thoughts and measuring reward by how much extra information the thought adds to next word prediction—eliminating need for human judges or verifiers with dense feedback at every step

  • Stable Training Mechanics: Only adjusts special "thought tokens" (not all text predictions), uses clipping to limit adjustment sizes, samples multiple thoughts per step for balanced feedback, and employs slower-moving baseline model as steady reference to prevent reward exploitation

  • Scalability and Efficiency: Lifts Qwen3-1.7B-Base by 19% vs base on math/science benchmarks; on Nemotron-Nano-12B-v2, accuracy jumps from 42.81% to 61.32% (35% relative improvement) using just 0.125% of baseline's data—200B fewer tokens

This work demonstrates that reinforcement principles integrated early can establish reasoning foundations without requiring curated datasets or massive computational overhead.

Read Our TuesdayPaperThoughts analysis

What Characterizes Effective Reasoning? Structure Over Length

Meta Superintelligence Labs and NYU challenge the assumption that longer, more detailed reasoning chains produce better answers, finding that structure matters more than length in AI reasoning.

Key findings:

  • More Thinking Isn't Always Smarter: Systematic analysis across 10 models on math and scientific reasoning shows that longer reasoning or adding review steps often reduces accuracy—within the same question, shorter CoTs and lower review ratios consistently correlate with higher correctness across difficulty levels

  • Failed-Step Fraction (FSF): Mapping reasoning as branching graphs, researchers introduce FSF—the proportion of abandoned reasoning branches—which emerges as the strongest predictor of success, showing consistent correlations across all models and datasets

  • Causal Evidence Through Intervention: Ranking 64 candidate solutions by FSF improves accuracy by 5-13% on AIME 2025; surgically removing failed branches from incorrect CoTs boosts continuation accuracy by 8-14%, revealing that failed attempts bias subsequent reasoning even after backtracking

The takeaway: effective reasoning isn't about being long-winded, it's about failing less and staying on track.

Read Our TuesdayPaperThoughts analysis


Looking Forward

This fortnight's releases highlight a clear trend—AI infrastructure is maturing fast. We're seeing visual builders replace complex code orchestration, hybrid architectures dramatically cut compute costs, and sparse attention making long-context work affordable. The tools are finally catching up to the ambition.

What strikes us most is how these efficiency gains unlock entirely new use cases. When you can analyze 350-page documents at half the cost or run capable models on cheaper GPUs, AI deployment stops being a luxury reserved for well-funded teams. That's when things get interesting.

Ready to Elevate Your Business with Personalized LLMs?

Santa Clara, California, United States 95051

© 2025 Genloop™. All Rights Reserved.

Ready to Elevate Your Business with Personalized LLMs?

Santa Clara, California, United States 95051

© 2025 Genloop™. All Rights Reserved.

Ready to Elevate Your Business with Personalized LLMs?

Santa Clara, California,

United States 95051

© 2025 Genloop™. All Rights Reserved.

Ready to Elevate Your Business with Personalized LLMs?

Santa Clara, California,

United States 95051

© 2025 Genloop™. All Rights Reserved.