🤖 AI Agents Weekly: Code World Model, Gemini Robotics-ER 1.5, Figma MCP server, Overhearing LLM Agents, Qwen3-Max, Gamma API

🤖 AI Agents Weekly: Code World Model, Gemini Robotics-ER 1.5, Figma MCP server, Overhearing LLM Agents, Qwen3-Max, Gamma API

Code World Model, Gemini Robotics-ER 1.5, Figma MCP server, Overhearing LLM Agents, Qwen3-Max, Gamma API

Elvis Saravia's NLP Blog
platform
GitHub Copilot CLIがリリース

GitHub Copilot CLIがリリース

2025年9月25日、GitHubが「GitHub Copilot CLI」をパブリックプレビューとして公開しました。 GitHub Copilot CLI is now in public preview - GitHub ChangelogGitHub Copilot CLI is now in public preview We’re bringing the power of GitHub Copilot coding agent directly to your terminal. With GitHub Copilot CLI, you can work locally and…The GitHub BlogAllison

Lai.so Blog
api tool
Chrome DevTools MCP で AI エージェントのフロントエンド開発をサポートする

Chrome DevTools MCP で AI エージェントのフロントエンド開発をサポートする

自律的な AI エージェントを利用したコーディングでは、生成したコードを実行した結果からフィードバックを得て、コードを改善していく反復的なプロセスが重要です。しかし、フロントエンド開発では、生成したコードはブラウザ上で実行されるため、AI エージェントが直接コードを実行したり、ブラウザのコンソールログを取得したりすることは困難です。Chrome DevTools MCP はこの課題を解決するためのツールです。

azukiazusa のテックブログ2
api tool
No Image

ForcedLeak: AI Agent risks exposed in Salesforce AgentForce

Classic lethal trifecta image exfiltration bug reported against Salesforce AgentForce by Sasi Levi and Noma Security. Here the malicious instructions come in via the Salesforce Web-to-Lead feature. When a Salesforce …

Simon Willison's Blog
api cloud security
No Image

How to stop AI’s “lethal trifecta”

This is the second mention of the lethal trifecta in the Economist in just the last week! Their earlier coverage was Why AI systems may never be secure on September …

Simon Willison's Blog
security
YANS2025 参加報告

YANS2025 参加報告

AI ShiftのTECH BLOGです。AI技術の情報や活用方法などをご案内いたします。

AI-Shift Tech Blog
platform
Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G

Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G

<h2><a id="introduction" class="anchor" href="#introduction" aria-hidden="true"><svg aria-hidden="true" class="octicon octicon-link" height="16" version="1.1...

LMSYS Blog
framework tool
No Image

GitHub Copilot CLI is now in public preview

GitHub now have their own entry in the coding terminal CLI agent space: Copilot CLI. It's the same basic shape as Claude Code, Codex CLI, Gemini CLI and a growing …

Simon Willison's Blog
api tool
Improved Gemini 2.5 Flash and Flash-Lite

Improved Gemini 2.5 Flash and Flash-Lite

Two new preview models from Google - updates to their fast and inexpensive Flash and Flash Lite families: The latest version of Gemini 2.5 Flash-Lite was trained and built based …

Simon Willison's Blog
api tool
No Image

Don't hide your best documentation

If you hide the system prompt and tool descriptions for your LLM agent, what you're actually doing is deliberately hiding the most useful documentation describing your service from your most …

Simon Willison's Blog
platform
Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part II): 3.8x Prefill, 4.8x Decode Throughput

Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part II): 3.8x Prefill, 4.8x Decode Throughput

<p>The GB200 NVL72 is one of the most powerful hardware for deep learning. In this blog post, we share our progress to optimize the inference performance of ...

LMSYS Blog
library tool
No Image

Quoting Stanford CS221 Autumn 2025

[2 points] Learn basic NumPy operations with an AI tutor! Use an AI chatbot (e.g., ChatGPT, Claude, Gemini, or Stanford AI Playground) to teach yourself how to do basic vector …

Simon Willison's Blog
tool
No Image

Cross-Agent Privilege Escalation: When Agents Free Each Other

Here's a clever new form of AI exploit from Johann Rehberger, who has coined the term Cross-Agent Privilege Escalation to describe an attack where multiple coding agents - GitHub Copilot …

Simon Willison's Blog
security
6 easy ways to level up Claude Code

6 easy ways to level up Claude Code

Walk through six tips and tricks that help you level up Claude Code to move beyond simply entering prompts into a text box.

logrocket-dev
api tool
GPT-5-Codex

GPT-5-Codex

OpenAI half-relased this model earlier this month, adding it to their Codex CLI tool but not their API. Today they've fixed that - the new model can now be accessed …

Simon Willison's Blog
api library tool
No Image

Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action

I've been looking forward to this. Qwen 2.5 VL is one of the best available open weight vision LLMs, so I had high hopes for Qwen 3's vision models. Firstly, …

Simon Willison's Blog
platform
YAML ファイルで AI エージェントを構築する cagent

YAML ファイルで AI エージェントを構築する cagent

cagent は Docker 社が開発した AI エージェントフレームワークです。YAML ファイルでエージェントの振る舞い・役割・使用するツールを宣言的に定義でき、コードを 1 行も書かずにエージェントを構築できます。この記事では cagent の概要とインストール方法、YAML ファイルの書き方、実際にエージェントを動作させるまでの手順を解説します。

azukiazusa のテックブログ2
api tool
No Image

Why AI systems might never be secure

The Economist have a new piece out about LLM security, with this headline and subtitle: Why AI systems might never be secure A “lethal trifecta” of conditions opens them to …

Simon Willison's Blog
security
No Image

Quoting Kate Niederhoffer, Gabriella Rosen Kellerman, Angela Lee, Alex Liebscher, Kristina Rapuano and Jeffrey T. Hancock

We define workslop as AI generated work content that masquerades as good work, but lacks the substance to meaningfully advance a given task. Here’s how this happens. As AI tools …

Simon Willison's Blog
tool
Four new releases from Qwen

Four new releases from Qwen

It's been an extremely busy day for team Qwen. Within the last 24 hours (all links to Twitter, which seems to be their preferred platform for these announcements): Qwen3-Next-80B-A3B-Instruct-FP8 and …

Simon Willison's Blog
library tool
CompileBench: Can AI Compile 22-year-old Code?

CompileBench: Can AI Compile 22-year-old Code?

Interesting new LLM benchmark from Piotr Grabowski and Piotr Migdał: how well can different models handle compilation challenges such as cross-compiling gucr for ARM64 architecture? This is one of my …

Simon Willison's Blog
api tool
No Image

ChatGPT Is Blowing Up Marriages as Spouses Use AI to Attack Their Partners

Maggie Harrison Dupré for Futurism. It turns out having an always-available "marriage therapist" with a sycophantic instinct to always take your side is catastrophic for relationships. The tension in the …

Simon Willison's Blog
platform
Enabling Deterministic Inference for SGLang

Enabling Deterministic Inference for SGLang

<p>This post highlights our initial efforts to achieve deterministic inference in SGLang. By integrating batch invariant kernels released by Thinking Machine...

LMSYS Blog
api tool
No Image

Locally AI

Handy new iOS app by Adrien Grondin for running local LLMs on your phone. It just added support for the new iOS 26 Apple Foundation model, so you can install …

Simon Willison's Blog
mobile
🥇Top AI Papers of the Week

🥇Top AI Papers of the Week

The Top AI Papers of the Week (September 15-21)

Elvis Saravia's NLP Blog
platform
GPT‑5 Codexがリリース

GPT‑5 Codexがリリース

OpenAIが2025年9月15日にGPT‑5 Codexを発表しました。GPT‑5 CodexはGPT‑5を土台にして、エージェントのコーディング能力に適した学習と強化が加えられたモデルです。長時間の自律的な作業に特に強みがあります。 We’re releasing new Codex features to make it a more effective coding collaborator: - A new IDE extension - Easily move tasks between the cloud and your local environment - Code reviews in GitHub - Revamped Codex CLI Powered by

Lai.so Blog
api tool
No Image

llm-openrouter 0.5

New release of my LLM plugin for accessing models made available via OpenRouter. The release notes in full: Support for tool calling. Thanks, James Sanford. #43 Support for reasoning options, …

Simon Willison's Blog
api tool
Optimizing FP4 Mixed-Precision Inference on AMD GPUs

Optimizing FP4 Mixed-Precision Inference on AMD GPUs

<p>Haohui Mai (CausalFlow.ai), Lei Zhang (AMD)</p> <h2><a id="introduction" class="anchor" href="#introduction" aria-hidden="true"><svg aria-hidden="true" cl...

LMSYS Blog
library tool
Grok 4 Fast

Grok 4 Fast

New hosted vision-enabled reasoning model from xAI that's designed to be fast and extremely competitive on price. It has a 2 million token context window and "was trained end-to-end with …

Simon Willison's Blog
tool
🤖 AI Agents Weekly: GPT-5-Codex, Grok 4 Fast, Tongyi DeepResearch, Magistral Small 1.2, Agent Payments Protocol (AP2)

🤖 AI Agents Weekly: GPT-5-Codex, Grok 4 Fast, Tongyi DeepResearch, Magistral Small 1.2, Agent Payments Protocol (AP2)

GPT-5-Codex, Grok 4 Fast, Tongyi DeepResearch, Magistral Small 1.2, Agent Payments Protocol (AP2)

Elvis Saravia's NLP Blog
api tool
AI エージェントのための Agent Payments Protocol (AP2) を試してみた

AI エージェントのための Agent Payments Protocol (AP2) を試してみた

現状の決済システムでは人間が信頼できる画面上で直接購入ボタンをクリックすることを前提としており、自立型の AI エージェントがユーザーに代わって決済することは想定されていません。そこで Google により Agent Payments Protocol (AP2) と呼ばれる新しいプロトコルが提案されました。プラットフォーム間でエージェント主導の決済を安全に開始・処理することを可能にします。この記事では AP2 のサンプルコードを実際に試してみた手順を紹介します。

azukiazusa のテックブログ2
api security tool
No Image

Magistral 1.2

Mistral quietly released two new models yesterday: Magistral Small 1.2 (Apache 2.0, 96.1 GB on Hugging Face) and Magistral Medium 1.2 (not open weights same as Mistral's other "medium" models.) …

Simon Willison's Blog
platform
No Image

The Hidden Risk in Notion 3.0 AI Agents: Web Search Tool Abuse for Data Exfiltration

Abi Raghuram reports that Notion 3.0, released yesterday, introduces new prompt injection data exfiltration vulnerabilities thanks to enabling lethal trifecta attacks. Abi's attack involves a PDF with hidden text (white …

Simon Willison's Blog
security
Environment-aware model routing: Build smarter AI apps with AI SDK

Environment-aware model routing: Build smarter AI apps with AI SDK

Discover a handy pattern for routing LLM calls in an “environment-aware” manner, using AI SDK’s middleware.

logrocket-dev
api tool
No Image

Quoting Steve Jobs

Well, the types of computers we have today are tools. They’re responders: you ask a computer to do something and it will do it. The next stage is going to …

Simon Willison's Blog
tool
I think "agent" may finally have a widely enough agreed upon definition to be useful jargon now

I think "agent" may finally have a widely enough agreed upon definition to be useful jargon now

I’ve noticed something interesting over the past few weeks: I’ve started using the term “agent” in conversations where I don’t feel the need to then define it, roll my eyes …

Simon Willison's Blog
platform
No Image

Anthropic: A postmortem of three recent issues

Anthropic had a very bad month in terms of model reliability: Between August and early September, three infrastructure bugs intermittently degraded Claude's response quality. We've now resolved these issues and …

Simon Willison's Blog
platform
No Image

ICPC medals for OpenAI and Gemini

In July it was the International Math Olympiad (OpenAI, Gemini), today it's the International Collegiate Programming Contest (ICPC). Once again, both OpenAI and Gemini competed with models that achieved Gold …

Simon Willison's Blog
platform
How to stop your AI agents from hallucinating: A guide to n8n’s Eval Node

How to stop your AI agents from hallucinating: A guide to n8n’s Eval Node

Walk through a practical example of n8n's Eval feature, which helps developers reduce hallucinations and increase reliability of AI products.

logrocket-dev
tool
No Image

Announcing the 2025 PSF Board Election Results!

I'm happy to share that I've been re-elected for second term on the board of directors of the Python Software Foundation. Jannis Leidel was also re-elected and Abigail Dogbe and …

Simon Willison's Blog
tool
Let’s kill vibe coding and bring back prompt engineering

Let’s kill vibe coding and bring back prompt engineering

Vibe coding is trending, but is it sustainable? Explore why prompt engineering still matters for building reliable, high-quality AI apps.

logrocket-dev
tool
openai/codex でのプロジェクト固有MCPを設定する

openai/codex でのプロジェクト固有MCPを設定する

この記事では、OpenAIのCodexを使用してプロジェクト固有のMCP(Model Context Protocol)を設定する方法について説明しています。CodexはグローバルにMCPを設定することしかできないため、プロジェクトごとに独立した設定が必要です。2つの手段が提案されており、1つ目は環境変数を使用して読み込みディレクトリを変更し、プロジェクト固有の設定をロードする方法です。しかし、この方法では認証情報が含まれるため、普段使いには適していません。2つ目の手段は、Codexのコマンドラインオプションを使用して直接TOML設定をロードする方法で、こちらの方が安全です。具体的なコマンドや設定例も示されており、実装に関する注意点も記載されています。 • CodexはグローバルにMCPを設定するが、プロジェクトごとに独立した設定が必要な場合がある。 • 手段1では環境変数を使用してプロジェクト固有のMCP設定をロードできるが、認証情報が含まれるため普段使いには不向き。 • 手段2では--configオプションを使用して直接TOMLをロードする方法があり、こちらが安全とされる。 • 具体的なコマンドや設定例が示されており、実装方法が詳細に説明されている。 • JSONからTOMLへの変換に関する注意点も記載されている。

Zenn mizchi
api tool
GPT‑5-Codex and upgrades to Codex

GPT‑5-Codex and upgrades to Codex

OpenAI half-released a new model today: GPT‑5-Codex, a fine-tuned GPT-5 variant explicitly designed for their various AI-assisted programming tools. I say half-released because it's not yet available via their API, …

Simon Willison's Blog
api library tool
No Image

Models can prompt now

Here's an interesting example of models incrementally improving over time: I am finding that today's leading models are competent at writing prompts for themselves and each other. A year ago …

Simon Willison's Blog
platform
🥇Top AI Papers of the Week

🥇Top AI Papers of the Week

The Top AI Papers of the Week (September 8-14)

Elvis Saravia's NLP Blog
platform
メインブラウザをEdgeに切り替えた理由とAIブラウザの可能性

メインブラウザをEdgeに切り替えた理由とAIブラウザの可能性

ChromeからEdgeに乗り換え 最近、筆者はAI統合型のブラウザを常用するべくメインブラウザをGoogle ChromeからMicrosoft Edgeに切り替えました。EdgeのCopilot Modeは8月にGPT-5が搭載され、かなり使い勝手が良くなりました。2年前にこの前哨戦となる「Bing AIチャットをデフォルトのウェブ検索にして使ってみた」を投稿したのですが、当時と比べると雲泥の差です。 この記事では、筆者がEdgeへの移行を検討するに至った背景や、実際の使用感について整理しました。また、AIブラウザの台頭に伴い、セキュリティ面での新たなリスクについても考えることになったのでそれを喚起します。 移行の動機 筆者がメインブラウザをChromeからEdgeに移行した最大の理由は、AI統合型のウェブブラウジングを日常にしたかったからでした。実は2年前にもプログラミングにAI機能を使いたいという理由で、エディタをJetBrainsから強制的にVSCode/Cursorに移行した経験があり、それを思い出します。 現在、ブラウザやOSとLLMの統合は急速に進んでいます

Lai.so Blog
tool ui
🤖 AI Agents Weekly: Agent 3, ChatGPT Developer Mode, MCP Registry, Writing Effective Tools for Agents, Qodo Aware

🤖 AI Agents Weekly: Agent 3, ChatGPT Developer Mode, MCP Registry, Writing Effective Tools for Agents, Qodo Aware

Agent 3, ChatGPT Developer Mode, MCP Registry, Writing Effective Tools for Agents, Qodo Aware

Elvis Saravia's NLP Blog
library tool
自然言語で CI/CD パイプラインを定義する Agentic Workflows

自然言語で CI/CD パイプラインを定義する Agentic Workflows

Agentic Workflows は自然言語で CI/CD パイプラインを定義できるツールとして GitHub Next が開発しています。自然言語で定義されたワークフローは GitHub CLI の拡張機能として提供される gh aw コマンドでコンパイルして実行できます。これは継続体なAI(Continuous AI)を実現するためのツールです。

azukiazusa のテックブログ2
api tool
No Image

gpt-5 and gpt-5-mini rate limit updates

OpenAI have increased the rate limits for their two main GPT-5 models. These look significant: gpt-5 Tier 1: 30K → 500K TPM (1.5M batch) Tier 2: 450K → 1M (3M …

Simon Willison's Blog
api
No Image

Quoting Matt Webb

The trick with Claude Code is to give it large, but not too large, extremely well defined problems. (If the problems are too large then you are now vibe coding… …

Simon Willison's Blog
platform
今週の話題:Claudeの劣化問題の修正、Claude Code API差し替え、sonoma-alpha

今週の話題:Claudeの劣化問題の修正、Claude Code API差し替え、sonoma-alpha

AnthropicがClaudeの性能劣化に対応 Anthropicが公式に、8月からコミュニティで報告されていたClaude Sonnetの性能劣化を修正したと発表しました。原因は推論スタックのインフラ層にあり、独立したバグによるものであり「モデル本体の意図的な性能ダウン」や「需要対策によるダウングレード」は否定されています。 Model output qualityAnthropic’s Status Page - Model output quality.Model output quality 発表には、2025年8月下旬〜9月初旬にかけてSonnet 4系で品質劣化(degraded output quality)が発生し、8月5日〜9月4日には少数のSonnet 4.0リクエストに出力品質の低下が見られたという記載があります。Opus 4.1にはいまだ未解決の問題もあります。 8月中にはRedditでClaude Codeの応答劣化の件は炎上していました。有料プランの週次制限の開始あたりから加熱した印象です。一部ではCodex CLIに乗り換えようという声がありまし

Lai.so Blog
api tool
No Image

Comparing the memory implementations of Claude and ChatGPT

Shlok Khemani has been doing excellent work reverse-engineering LLM systems and documenting his discoveries. Last week he wrote about ChatGPT memory. This week it's Claude. Claude's memory system has two …

Simon Willison's Blog
api tool
Qwen3-Next-80B-A3B: 🐧🦩 Who needs legs?!

Qwen3-Next-80B-A3B: 🐧🦩 Who needs legs?!

Qwen announced two new models via their Twitter account (nothing on their blog yet): Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking. They make some big claims on performance: Qwen3-Next-80B-A3B-Instruct approaches our 235B flagship. Qwen3-Next-80B-A3B-Thinking …

Simon Willison's Blog
tool
No Image

Defeating Nondeterminism in LLM Inference

A very common question I see about LLMs concerns why they can't be made to deliver the same response to the same prompt by setting a fixed random number seed. …

Simon Willison's Blog
library tool
No Image

Claude API: Web fetch tool

New in the Claude API: if you pass the web-fetch-2025-09-10 beta header you can add {"type": "web_fetch_20250910", "name": "web_fetch", "max_uses": 5} to your "tools" list and Claude will gain the …

Simon Willison's Blog
api tool
What you actually need to build and ship AI-powered apps in 2025

What you actually need to build and ship AI-powered apps in 2025

Discover what you actually need to build and ship AI-powered apps in 2025, with tips for which tools to choose and how to implement them.

logrocket-dev
platform tool
No Image

I Replaced Animal Crossing's Dialogue with a Live LLM by Hacking GameCube Memory

Brilliant retro-gaming project by Josh Fonseca, who figured out how to run 2002 Game Cube Animal Crossing in the Dolphin Emulator such that dialog with the characters was instead generated …

Simon Willison's Blog
api tool
AI dev tool power rankings & comparison [Sept 2025]

AI dev tool power rankings & comparison [Sept 2025]

Compare the top AI development tools and models of September 2025. View updated rankings, feature breakdowns, and find the best fit for you.

logrocket-dev
api cloud tool
SGLang HiCache: Fast Hierarchical KV Caching with Your Favorite Storage Backends

SGLang HiCache: Fast Hierarchical KV Caching with Your Favorite Storage Backends

<h2><a id="from-the-community" class="anchor" href="#from-the-community" aria-hidden="true"><svg aria-hidden="true" class="octicon octicon-link" height="16" ...

LMSYS Blog
library tool
No Image

Quoting Apple Security Engineering and Architecture

There has never been a successful, widespread malware attack against iPhone. The only system-level iOS attacks we observe in the wild come from mercenary spyware, which is vastly more complex …

Simon Willison's Blog
security
My review of Claude's new Code Interpreter, released under a very confusing name

My review of Claude's new Code Interpreter, released under a very confusing name

Today on the Anthropic blog: Claude can now create and edit files: Claude can now create and edit Excel spreadsheets, documents, PowerPoint slide decks, and PDFs directly in Claude.ai and …

Simon Willison's Blog
api tool
MCP is replacing the browser: Here’s how devs should prepare

MCP is replacing the browser: Here’s how devs should prepare

Learn how MCP will replace the traditional browser, what this shift means for frontend devs, and how to start prepping for an AI-first future.

logrocket-dev
api framework tool
No Image

The 2025 PSF Board Election is Open!

The Python Software Foundation's annual board member election is taking place right now, with votes (from previously affirmed voting members) accepted from September 2nd, 2:00 pm UTC through Tuesday, September …

Simon Willison's Blog
api cloud platform
No Image

Geoffrey Huntley is cursed

Geoffrey Huntley vibe-coded an entirely new programming language using Claude: The programming language is called "cursed". It's cursed in its lexical structure, it's cursed in how it was built, it's …

Simon Willison's Blog
api library tool
Improve your AI code output with AGENTS.md (+ my best tips)

Improve your AI code output with AGENTS.md (+ my best tips)

Stop re-prompting. Put the rules in AGENTS.md: do and don’ts, file-level tests, and real examples so agents ship code that matches your project.

Builder.io Blog
library tool
Recreating the Apollo AI adoption rate chart with GPT-5, Python and Pyodide

Recreating the Apollo AI adoption rate chart with GPT-5, Python and Pyodide

Apollo Global Management’s “Chief Economist” Dr. Torsten Sløk released this interesting chart which appears to show a slowdown in AI adoption rates among large (>250 empoloyees) companies: Here’s the full …

Simon Willison's Blog
api library tool
No Image

Anthropic status: Model output quality

Anthropic previously reported model serving bugs that affected Claude Opus 4 and 4.1 for 56.5 hours. They've now fixed additional bugs affecting "a small percentage" of Sonnet 4 requests for …

Simon Willison's Blog
platform
No Image

Quoting TheSoftwareGuy

Having worked inside AWS I can tell you one big reason [that they don't document their internals] is the attitude/fear that anything we put in out public docs may end …

Simon Willison's Blog
cloud
No Image

Load Llama-3.2 WebGPU in your browser from a local folder

Inspired by a comment on Hacker News I decided to see if it was possible to modify the transformers.js-examples/tree/main/llama-3.2-webgpu Llama 3.2 chat demo (online here, I wrote about it last …

Simon Willison's Blog
tool
No Image

Quoting James Luan

I recently spoke with the CTO of a popular AI note-taking app who told me something surprising: they spend twice as much on vector search as they do on OpenAI …

Simon Willison's Blog
api
No Image

Is the LLM response wrong, or have you just failed to iterate it?

More from Mike Caulfield (see also the SIFT method). He starts with a fantastic example of Google's AI mode usually correctly handling a common piece of misinformation but occasionally falling …

Simon Willison's Blog
platform
No Image

Quoting Anil Dash

I agree with the intellectual substance of virtually every common critique of AI. And it's very clear that turning those critiques into a competition about who can frame them in …

Simon Willison's Blog
platform
No Image

The SIFT method

The SIFT method is "an evaluation strategy developed by digital literacy expert, Mike Caulfield, to help determine whether online content can be trusted for credible or reliable sources of information." …

Simon Willison's Blog
tool
🥇Top AI Papers of the Week

🥇Top AI Papers of the Week

The Top AI Papers of the Week (September 1-7)

Elvis Saravia's NLP Blog
platform
AI mode is good, actually

AI mode is good, actually

When I wrote about how good ChatGPT with GPT-5 is at search yesterday I nearly added a note about how comparatively disappointing Google's efforts around this are. I'm glad I …

Simon Willison's Blog
api cloud tool
仕様駆動開発を支える Spec Kit を試してみた

仕様駆動開発を支える Spec Kit を試してみた

仕様駆動開発(Specification-Driven Development, SDD)は、AI コーディングエージェントを活用した新しいソフトウェア開発スタイルです。GitHub が提供する Spec Kit は、仕様駆動開発を支援するためのツールキットであり、AI との対話を通じて正確な受け入れ基準の定義とコード生成を支援します。この記事では Spec Kit を使用して仕様駆動開発を試してみます。

azukiazusa のテックブログ2
api tool
GPT-5 Thinking in ChatGPT (aka Research Goblin) is shockingly good at search

GPT-5 Thinking in ChatGPT (aka Research Goblin) is shockingly good at search

“Don’t use chatbots as search engines” was great advice for several years... until it wasn’t. I wrote about how good OpenAI’s o3 was at using its Bing-backed search tool back …

Simon Willison's Blog
api tool
No Image

Quoting Jason Liu

I am once again shocked at how much better image retrieval performance you can get if you embed highly opinionated summaries of an image, a summary that came out of …

Simon Willison's Blog
api
Kimi-K2-Instruct-0905

Kimi-K2-Instruct-0905

New not-quite-MIT licensed model from Chinese Moonshot AI, a follow-up to the highly regarded Kimi-K2 model they released in July. This one is an incremental improvement - I've seen it …

Simon Willison's Blog
library tool
🤖 AI Agents Weekly: Universal Deep Research, GPT-4b micro, Self-Evolving Agents, Tracking Multi-Agent Failures

🤖 AI Agents Weekly: Universal Deep Research, GPT-4b micro, Self-Evolving Agents, Tracking Multi-Agent Failures

Universal Deep Research, GPT-4b micro, Self-Evolving Agents, Tracking Multi-Agent Failures

Elvis Saravia's NLP Blog
tool
No Image

Anthropic to pay $1.5 billion to authors in landmark AI settlement

I wrote about the details of this case when it was found that Anthropic's training on book content was fair use, but they needed to have purchased individual copies of …

Simon Willison's Blog
platform
TypeScriptファーストなコーディングAIエージェントのベンチマーク「ts-bench」を公開しました

TypeScriptファーストなコーディングAIエージェントのベンチマーク「ts-bench」を公開しました

AIコーディングエージェントのTypeScriptコード編集能力を評価するための、手軽に再現可能なベンチマークプロジェクト「ts-bench」を公開しました。この記事では、筆者がなぜ ts-bench を作ったのか、今後どうしていきたいかについてお話しします。 GitHub - laiso/ts-benchContribute to laiso/ts-bench development by creating an account on GitHub.GitHublaiso ts-benchの仕組み ts-benchは、プログラミング学習プラットフォーム Exercism のTypeScript問題セットを利用します。各問題には、仕様を説明するドキュメント、エージェントが編集すべきソースコードのひな形、そして正解判定に使うテストコードが含まれています。 ベンチマークタスクは、各問題に対して以下の4つのステップを順番に実行します。 1. AIエージェントの実行: 問題の指示書をプロンプトとしてAIエージェントに渡し、ソースコードを編集させます。 2. テストファイルの復元

Lai.so Blog
library tool
Introducing EmbeddingGemma

Introducing EmbeddingGemma

Brand new open weights (under the slightly janky Gemma license) 308M parameter embedding model from Google: Based on the Gemma 3 architecture, EmbeddingGemma is trained on 100+ languages and is …

Simon Willison's Blog
library tool
No Image

Highlighted tools

Any time I share my collection of tools built using vibe coding and AI-assisted development (now at 124, here's the definitive list) someone will inevitably complain that they're mostly trivial. …

Simon Willison's Blog
tool
Beyond Vibe Coding

Beyond Vibe Coding

Back in May I wrote Two publishers and three authors fail to understand what “vibe coding” means where I called out the authors of two forthcoming books on "vibe coding" …

Simon Willison's Blog
tool
AI coding tools still suck at context — here’s how to work around it

AI coding tools still suck at context — here’s how to work around it

Discover why you might be having difficulty with AI coding tools, and learn some practical strategies to work with AI more effectively.

logrocket-dev
api tool
No Image

gov.uscourts.dcd.223205.1436.0_1.pdf

Here's the 230 page PDF ruling on the 2023 United States v. Google LLC federal antitrust case - the case that could have resulted in Google selling off Chrome and …

Simon Willison's Blog
api cloud tool
AGENTS.md Gains Traction as an Open Format for AI Coding Agents

AGENTS.md Gains Traction as an Open Format for AI Coding Agents

AGENTS.md is a fast-growing open format giving AI coding agents a shared, predictable way to understand project setup, style, and workflows.

Socket
api tool
Cursor vs Claude Code: The Ultimate Comparison Guide

Cursor vs Claude Code: The Ultimate Comparison Guide

Cursor or Claude Code? Both start at $20/mo but work differently. Compare features, hidden costs, and real workflows to pick the right AI coding tool.

Builder.io Blog
library tool
Rich Pixels

Rich Pixels

Neat Python library by Darren Burns adding pixel image support to the Rich terminal library, using tricks to render an image using full or half-height colored blocks. Here's the key …

Simon Willison's Blog
library tool
No Image

August 2025 newsletter

I just sent out my August 2025 sponsors-only newsletter summarizing the past month in LLMs and my other work. Topics included GPT-5, gpt-oss, image editing models (Qwen-Image-Edit and Gemini Nano …

Simon Willison's Blog
platform
No Image

Introducing gpt-realtime

Released a few days ago (August 28th), gpt-realtime is OpenAI's new "most advanced speech-to-speech model". It looks like this is a replacement for the older gpt-4o-realtime-preview model that was released …

Simon Willison's Blog
platform
Cloudflare Radar: AI Insights

Cloudflare Radar: AI Insights

Cloudflare launched this dashboard back in February, incorporating traffic analysis from Cloudflare's network along with insights from their popular 1.1.1.1 DNS service. I found this chart particularly interesting, showing which …

Simon Willison's Blog
cloud
LongCat-Flash: Deploying Meituan's Agentic Model with SGLang

LongCat-Flash: Deploying Meituan's Agentic Model with SGLang

<h3><a id="1-introduction-deploying-meituans-agentic-open-source-moe-model" class="anchor" href="#1-introduction-deploying-meituans-agentic-open-source-moe-m...

LMSYS Blog
library tool
🥇Top AI Papers of the Week

🥇Top AI Papers of the Week

The Top AI Papers of the Week (August 25-31)

Elvis Saravia's NLP Blog
platform
エンティティリンキングの性能改善のための効果的な絞り込み手法の検証

エンティティリンキングの性能改善のための効果的な絞り込み手法の検証

AI ShiftのTECH BLOGです。AI技術の情報や活用方法などをご案内いたします。

AI-Shift Tech Blog
api tool
No Image

Claude Opus 4.1 and Opus 4 degraded quality

Notable because often when people complain of degraded model quality it turns out to be unfounded - Anthropic in the past have emphasized that they don't change the model weights …

Simon Willison's Blog
platform
🤖 AI Agents Weekly: Gemini 2.5 Flash Image, gpt-realtime, Anemoi Agent, Fine-tuning LLM Agents, Codex Updates, Agent Client Protocol

🤖 AI Agents Weekly: Gemini 2.5 Flash Image, gpt-realtime, Anemoi Agent, Fine-tuning LLM Agents, Codex Updates, Agent Client Protocol

Gemini 2.5 Flash Image, gpt-realtime, Anemoi Agent, Fine-tuning LLM Agents, Codex Updates, Agent Client Protocol

Elvis Saravia's NLP Blog
tool
No Image

Quoting Benj Edwards

LLMs are intelligence without agency—what we might call "vox sine persona": voice without person. Not the voice of someone, not even the collective voice of many someones, but a voice …

Simon Willison's Blog
platform
AI コーディングエージェントの管理を行う Vibe Kanban を試してみた

AI コーディングエージェントの管理を行う Vibe Kanban を試してみた

Vibe Kanban は、AI コーディングエージェントの管理を支援するためのツールです。カンバン方式の UI でタスク管理を行い、各タスクに対して AI エージェントを割り当てて人間がその進捗を管理できます。この記事では Vibe Kanban を使用して AI コーディングエージェントの管理を実際に試してみます。

azukiazusa のテックブログ2
tool
The perils of vibe coding

The perils of vibe coding

I was interviewed by Elaine Moore for this opinion piece in the Financial Times, which ended up in the print edition of the paper too! I picked up a copy …

Simon Willison's Blog
api tool
How to build a multimodal AI app with voice and vision in Next.js

How to build a multimodal AI app with voice and vision in Next.js

Learn how to build multimodal AI interactions to process images, audio, and even real-time video streams, using Next.js and Gemini.

logrocket-dev
api framework tool
No Image

Lossy encyclopedia

Since I love collecting questionable analogies for LLMs, here's a new one I just came up with: an LLM is a lossy encyclopedia. They have a huge array of facts …

Simon Willison's Blog
platform
No Image

Python: The Documentary

New documentary about the origins of the Python programming language - 84 minutes long, built around extensive interviews with Guido van Rossum and others who were there at the start …

Simon Willison's Blog
youtube