Last updated: 2025/09/28 05:00

🤖 AI Agents Weekly: Code World Model, Gemini Robotics-ER 1.5, Figma MCP server, Overhearing LLM Agents, Qwen3-Max, Gamma API
Code World Model, Gemini Robotics-ER 1.5, Figma MCP server, Overhearing LLM Agents, Qwen3-Max, Gamma API

GitHub Copilot CLIがリリース
2025年9月25日、GitHubが「GitHub Copilot CLI」をパブリックプレビューとして公開しました。 GitHub Copilot CLI is now in public preview - GitHub ChangelogGitHub Copilot CLI is now in public preview We’re bringing the power of GitHub Copilot coding agent directly to your terminal. With GitHub Copilot CLI, you can work locally and…The GitHub BlogAllison

Chrome DevTools MCP で AI エージェントのフロントエンド開発をサポートする
自律的な AI エージェントを利用したコーディングでは、生成したコードを実行した結果からフィードバックを得て、コードを改善していく反復的なプロセスが重要です。しかし、フロントエンド開発では、生成したコードはブラウザ上で実行されるため、AI エージェントが直接コードを実行したり、ブラウザのコンソールログを取得したりすることは困難です。Chrome DevTools MCP はこの課題を解決するためのツールです。
ForcedLeak: AI Agent risks exposed in Salesforce AgentForce
Classic lethal trifecta image exfiltration bug reported against Salesforce AgentForce by Sasi Levi and Noma Security. Here the malicious instructions come in via the Salesforce Web-to-Lead feature. When a Salesforce …
How to stop AI’s “lethal trifecta”
This is the second mention of the lethal trifecta in the Economist in just the last week! Their earlier coverage was Why AI systems may never be secure on September …

YANS2025 参加報告
AI ShiftのTECH BLOGです。AI技術の情報や活用方法などをご案内いたします。
Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G
<h2><a id="introduction" class="anchor" href="#introduction" aria-hidden="true"><svg aria-hidden="true" class="octicon octicon-link" height="16" version="1.1...
GitHub Copilot CLI is now in public preview
GitHub now have their own entry in the coding terminal CLI agent space: Copilot CLI. It's the same basic shape as Claude Code, Codex CLI, Gemini CLI and a growing …

Improved Gemini 2.5 Flash and Flash-Lite
Two new preview models from Google - updates to their fast and inexpensive Flash and Flash Lite families: The latest version of Gemini 2.5 Flash-Lite was trained and built based …
Don't hide your best documentation
If you hide the system prompt and tool descriptions for your LLM agent, what you're actually doing is deliberately hiding the most useful documentation describing your service from your most …

Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part II): 3.8x Prefill, 4.8x Decode Throughput
<p>The GB200 NVL72 is one of the most powerful hardware for deep learning. In this blog post, we share our progress to optimize the inference performance of ...
Quoting Stanford CS221 Autumn 2025
[2 points] Learn basic NumPy operations with an AI tutor! Use an AI chatbot (e.g., ChatGPT, Claude, Gemini, or Stanford AI Playground) to teach yourself how to do basic vector …
Cross-Agent Privilege Escalation: When Agents Free Each Other
Here's a clever new form of AI exploit from Johann Rehberger, who has coined the term Cross-Agent Privilege Escalation to describe an attack where multiple coding agents - GitHub Copilot …

6 easy ways to level up Claude Code
Walk through six tips and tricks that help you level up Claude Code to move beyond simply entering prompts into a text box.

GPT-5-Codex
OpenAI half-relased this model earlier this month, adding it to their Codex CLI tool but not their API. Today they've fixed that - the new model can now be accessed …
Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action
I've been looking forward to this. Qwen 2.5 VL is one of the best available open weight vision LLMs, so I had high hopes for Qwen 3's vision models. Firstly, …

YAML ファイルで AI エージェントを構築する cagent
cagent は Docker 社が開発した AI エージェントフレームワークです。YAML ファイルでエージェントの振る舞い・役割・使用するツールを宣言的に定義でき、コードを 1 行も書かずにエージェントを構築できます。この記事では cagent の概要とインストール方法、YAML ファイルの書き方、実際にエージェントを動作させるまでの手順を解説します。
Why AI systems might never be secure
The Economist have a new piece out about LLM security, with this headline and subtitle: Why AI systems might never be secure A “lethal trifecta” of conditions opens them to …
Quoting Kate Niederhoffer, Gabriella Rosen Kellerman, Angela Lee, Alex Liebscher, Kristina Rapuano and Jeffrey T. Hancock
We define workslop as AI generated work content that masquerades as good work, but lacks the substance to meaningfully advance a given task. Here’s how this happens. As AI tools …

Four new releases from Qwen
It's been an extremely busy day for team Qwen. Within the last 24 hours (all links to Twitter, which seems to be their preferred platform for these announcements): Qwen3-Next-80B-A3B-Instruct-FP8 and …

CompileBench: Can AI Compile 22-year-old Code?
Interesting new LLM benchmark from Piotr Grabowski and Piotr Migdał: how well can different models handle compilation challenges such as cross-compiling gucr for ARM64 architecture? This is one of my …
ChatGPT Is Blowing Up Marriages as Spouses Use AI to Attack Their Partners
Maggie Harrison Dupré for Futurism. It turns out having an always-available "marriage therapist" with a sycophantic instinct to always take your side is catastrophic for relationships. The tension in the …

Enabling Deterministic Inference for SGLang
<p>This post highlights our initial efforts to achieve deterministic inference in SGLang. By integrating batch invariant kernels released by Thinking Machine...
Locally AI
Handy new iOS app by Adrien Grondin for running local LLMs on your phone. It just added support for the new iOS 26 Apple Foundation model, so you can install …

🥇Top AI Papers of the Week
The Top AI Papers of the Week (September 15-21)

GPT‑5 Codexがリリース
OpenAIが2025年9月15日にGPT‑5 Codexを発表しました。GPT‑5 CodexはGPT‑5を土台にして、エージェントのコーディング能力に適した学習と強化が加えられたモデルです。長時間の自律的な作業に特に強みがあります。 We’re releasing new Codex features to make it a more effective coding collaborator: - A new IDE extension - Easily move tasks between the cloud and your local environment - Code reviews in GitHub - Revamped Codex CLI Powered by
llm-openrouter 0.5
New release of my LLM plugin for accessing models made available via OpenRouter. The release notes in full: Support for tool calling. Thanks, James Sanford. #43 Support for reasoning options, …

Optimizing FP4 Mixed-Precision Inference on AMD GPUs
<p>Haohui Mai (CausalFlow.ai), Lei Zhang (AMD)</p> <h2><a id="introduction" class="anchor" href="#introduction" aria-hidden="true"><svg aria-hidden="true" cl...

Grok 4 Fast
New hosted vision-enabled reasoning model from xAI that's designed to be fast and extremely competitive on price. It has a 2 million token context window and "was trained end-to-end with …

🤖 AI Agents Weekly: GPT-5-Codex, Grok 4 Fast, Tongyi DeepResearch, Magistral Small 1.2, Agent Payments Protocol (AP2)
GPT-5-Codex, Grok 4 Fast, Tongyi DeepResearch, Magistral Small 1.2, Agent Payments Protocol (AP2)

AI エージェントのための Agent Payments Protocol (AP2) を試してみた
現状の決済システムでは人間が信頼できる画面上で直接購入ボタンをクリックすることを前提としており、自立型の AI エージェントがユーザーに代わって決済することは想定されていません。そこで Google により Agent Payments Protocol (AP2) と呼ばれる新しいプロトコルが提案されました。プラットフォーム間でエージェント主導の決済を安全に開始・処理することを可能にします。この記事では AP2 のサンプルコードを実際に試してみた手順を紹介します。
Magistral 1.2
Mistral quietly released two new models yesterday: Magistral Small 1.2 (Apache 2.0, 96.1 GB on Hugging Face) and Magistral Medium 1.2 (not open weights same as Mistral's other "medium" models.) …
The Hidden Risk in Notion 3.0 AI Agents: Web Search Tool Abuse for Data Exfiltration
Abi Raghuram reports that Notion 3.0, released yesterday, introduces new prompt injection data exfiltration vulnerabilities thanks to enabling lethal trifecta attacks. Abi's attack involves a PDF with hidden text (white …

Environment-aware model routing: Build smarter AI apps with AI SDK
Discover a handy pattern for routing LLM calls in an “environment-aware” manner, using AI SDK’s middleware.
Quoting Steve Jobs
Well, the types of computers we have today are tools. They’re responders: you ask a computer to do something and it will do it. The next stage is going to …

I think "agent" may finally have a widely enough agreed upon definition to be useful jargon now
I’ve noticed something interesting over the past few weeks: I’ve started using the term “agent” in conversations where I don’t feel the need to then define it, roll my eyes …
Anthropic: A postmortem of three recent issues
Anthropic had a very bad month in terms of model reliability: Between August and early September, three infrastructure bugs intermittently degraded Claude's response quality. We've now resolved these issues and …
ICPC medals for OpenAI and Gemini
In July it was the International Math Olympiad (OpenAI, Gemini), today it's the International Collegiate Programming Contest (ICPC). Once again, both OpenAI and Gemini competed with models that achieved Gold …

How to stop your AI agents from hallucinating: A guide to n8n’s Eval Node
Walk through a practical example of n8n's Eval feature, which helps developers reduce hallucinations and increase reliability of AI products.
Announcing the 2025 PSF Board Election Results!
I'm happy to share that I've been re-elected for second term on the board of directors of the Python Software Foundation. Jannis Leidel was also re-elected and Abigail Dogbe and …

Let’s kill vibe coding and bring back prompt engineering
Vibe coding is trending, but is it sustainable? Explore why prompt engineering still matters for building reliable, high-quality AI apps.

openai/codex でのプロジェクト固有MCPを設定する
この記事では、OpenAIのCodexを使用してプロジェクト固有のMCP(Model Context Protocol)を設定する方法について説明しています。CodexはグローバルにMCPを設定することしかできないため、プロジェクトごとに独立した設定が必要です。2つの手段が提案されており、1つ目は環境変数を使用して読み込みディレクトリを変更し、プロジェクト固有の設定をロードする方法です。しかし、この方法では認証情報が含まれるため、普段使いには適していません。2つ目の手段は、Codexのコマンドラインオプションを使用して直接TOML設定をロードする方法で、こちらの方が安全です。具体的なコマンドや設定例も示されており、実装に関する注意点も記載されています。 • CodexはグローバルにMCPを設定するが、プロジェクトごとに独立した設定が必要な場合がある。 • 手段1では環境変数を使用してプロジェクト固有のMCP設定をロードできるが、認証情報が含まれるため普段使いには不向き。 • 手段2では--configオプションを使用して直接TOMLをロードする方法があり、こちらが安全とされる。 • 具体的なコマンドや設定例が示されており、実装方法が詳細に説明されている。 • JSONからTOMLへの変換に関する注意点も記載されている。

GPT‑5-Codex and upgrades to Codex
OpenAI half-released a new model today: GPT‑5-Codex, a fine-tuned GPT-5 variant explicitly designed for their various AI-assisted programming tools. I say half-released because it's not yet available via their API, …
Models can prompt now
Here's an interesting example of models incrementally improving over time: I am finding that today's leading models are competent at writing prompts for themselves and each other. A year ago …

🥇Top AI Papers of the Week
The Top AI Papers of the Week (September 8-14)

メインブラウザをEdgeに切り替えた理由とAIブラウザの可能性
ChromeからEdgeに乗り換え 最近、筆者はAI統合型のブラウザを常用するべくメインブラウザをGoogle ChromeからMicrosoft Edgeに切り替えました。EdgeのCopilot Modeは8月にGPT-5が搭載され、かなり使い勝手が良くなりました。2年前にこの前哨戦となる「Bing AIチャットをデフォルトのウェブ検索にして使ってみた」を投稿したのですが、当時と比べると雲泥の差です。 この記事では、筆者がEdgeへの移行を検討するに至った背景や、実際の使用感について整理しました。また、AIブラウザの台頭に伴い、セキュリティ面での新たなリスクについても考えることになったのでそれを喚起します。 移行の動機 筆者がメインブラウザをChromeからEdgeに移行した最大の理由は、AI統合型のウェブブラウジングを日常にしたかったからでした。実は2年前にもプログラミングにAI機能を使いたいという理由で、エディタをJetBrainsから強制的にVSCode/Cursorに移行した経験があり、それを思い出します。 現在、ブラウザやOSとLLMの統合は急速に進んでいます

🤖 AI Agents Weekly: Agent 3, ChatGPT Developer Mode, MCP Registry, Writing Effective Tools for Agents, Qodo Aware
Agent 3, ChatGPT Developer Mode, MCP Registry, Writing Effective Tools for Agents, Qodo Aware

自然言語で CI/CD パイプラインを定義する Agentic Workflows
Agentic Workflows は自然言語で CI/CD パイプラインを定義できるツールとして GitHub Next が開発しています。自然言語で定義されたワークフローは GitHub CLI の拡張機能として提供される gh aw コマンドでコンパイルして実行できます。これは継続体なAI(Continuous AI)を実現するためのツールです。
gpt-5 and gpt-5-mini rate limit updates
OpenAI have increased the rate limits for their two main GPT-5 models. These look significant: gpt-5 Tier 1: 30K → 500K TPM (1.5M batch) Tier 2: 450K → 1M (3M …
Quoting Matt Webb
The trick with Claude Code is to give it large, but not too large, extremely well defined problems. (If the problems are too large then you are now vibe coding… …

今週の話題:Claudeの劣化問題の修正、Claude Code API差し替え、sonoma-alpha
AnthropicがClaudeの性能劣化に対応 Anthropicが公式に、8月からコミュニティで報告されていたClaude Sonnetの性能劣化を修正したと発表しました。原因は推論スタックのインフラ層にあり、独立したバグによるものであり「モデル本体の意図的な性能ダウン」や「需要対策によるダウングレード」は否定されています。 Model output qualityAnthropic’s Status Page - Model output quality.Model output quality 発表には、2025年8月下旬〜9月初旬にかけてSonnet 4系で品質劣化(degraded output quality)が発生し、8月5日〜9月4日には少数のSonnet 4.0リクエストに出力品質の低下が見られたという記載があります。Opus 4.1にはいまだ未解決の問題もあります。 8月中にはRedditでClaude Codeの応答劣化の件は炎上していました。有料プランの週次制限の開始あたりから加熱した印象です。一部ではCodex CLIに乗り換えようという声がありまし
Comparing the memory implementations of Claude and ChatGPT
Shlok Khemani has been doing excellent work reverse-engineering LLM systems and documenting his discoveries. Last week he wrote about ChatGPT memory. This week it's Claude. Claude's memory system has two …

Qwen3-Next-80B-A3B: 🐧🦩 Who needs legs?!
Qwen announced two new models via their Twitter account (nothing on their blog yet): Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking. They make some big claims on performance: Qwen3-Next-80B-A3B-Instruct approaches our 235B flagship. Qwen3-Next-80B-A3B-Thinking …
Defeating Nondeterminism in LLM Inference
A very common question I see about LLMs concerns why they can't be made to deliver the same response to the same prompt by setting a fixed random number seed. …
Claude API: Web fetch tool
New in the Claude API: if you pass the web-fetch-2025-09-10 beta header you can add {"type": "web_fetch_20250910", "name": "web_fetch", "max_uses": 5} to your "tools" list and Claude will gain the …

What you actually need to build and ship AI-powered apps in 2025
Discover what you actually need to build and ship AI-powered apps in 2025, with tips for which tools to choose and how to implement them.
I Replaced Animal Crossing's Dialogue with a Live LLM by Hacking GameCube Memory
Brilliant retro-gaming project by Josh Fonseca, who figured out how to run 2002 Game Cube Animal Crossing in the Dolphin Emulator such that dialog with the characters was instead generated …
![AI dev tool power rankings & comparison [Sept 2025]](https://blog.logrocket.com/wp-content/uploads/2025/07/ai_dev_tool_power_rankings_july_2025_web.png)
AI dev tool power rankings & comparison [Sept 2025]
Compare the top AI development tools and models of September 2025. View updated rankings, feature breakdowns, and find the best fit for you.

SGLang HiCache: Fast Hierarchical KV Caching with Your Favorite Storage Backends
<h2><a id="from-the-community" class="anchor" href="#from-the-community" aria-hidden="true"><svg aria-hidden="true" class="octicon octicon-link" height="16" ...
Quoting Apple Security Engineering and Architecture
There has never been a successful, widespread malware attack against iPhone. The only system-level iOS attacks we observe in the wild come from mercenary spyware, which is vastly more complex …

My review of Claude's new Code Interpreter, released under a very confusing name
Today on the Anthropic blog: Claude can now create and edit files: Claude can now create and edit Excel spreadsheets, documents, PowerPoint slide decks, and PDFs directly in Claude.ai and …

MCP is replacing the browser: Here’s how devs should prepare
Learn how MCP will replace the traditional browser, what this shift means for frontend devs, and how to start prepping for an AI-first future.
The 2025 PSF Board Election is Open!
The Python Software Foundation's annual board member election is taking place right now, with votes (from previously affirmed voting members) accepted from September 2nd, 2:00 pm UTC through Tuesday, September …
Geoffrey Huntley is cursed
Geoffrey Huntley vibe-coded an entirely new programming language using Claude: The programming language is called "cursed". It's cursed in its lexical structure, it's cursed in how it was built, it's …
Improve your AI code output with AGENTS.md (+ my best tips)
Stop re-prompting. Put the rules in AGENTS.md: do and don’ts, file-level tests, and real examples so agents ship code that matches your project.

Recreating the Apollo AI adoption rate chart with GPT-5, Python and Pyodide
Apollo Global Management’s “Chief Economist” Dr. Torsten Sløk released this interesting chart which appears to show a slowdown in AI adoption rates among large (>250 empoloyees) companies: Here’s the full …
Anthropic status: Model output quality
Anthropic previously reported model serving bugs that affected Claude Opus 4 and 4.1 for 56.5 hours. They've now fixed additional bugs affecting "a small percentage" of Sonnet 4 requests for …
Quoting TheSoftwareGuy
Having worked inside AWS I can tell you one big reason [that they don't document their internals] is the attitude/fear that anything we put in out public docs may end …
Load Llama-3.2 WebGPU in your browser from a local folder
Inspired by a comment on Hacker News I decided to see if it was possible to modify the transformers.js-examples/tree/main/llama-3.2-webgpu Llama 3.2 chat demo (online here, I wrote about it last …
Quoting James Luan
I recently spoke with the CTO of a popular AI note-taking app who told me something surprising: they spend twice as much on vector search as they do on OpenAI …
Is the LLM response wrong, or have you just failed to iterate it?
More from Mike Caulfield (see also the SIFT method). He starts with a fantastic example of Google's AI mode usually correctly handling a common piece of misinformation but occasionally falling …
Quoting Anil Dash
I agree with the intellectual substance of virtually every common critique of AI. And it's very clear that turning those critiques into a competition about who can frame them in …
The SIFT method
The SIFT method is "an evaluation strategy developed by digital literacy expert, Mike Caulfield, to help determine whether online content can be trusted for credible or reliable sources of information." …

🥇Top AI Papers of the Week
The Top AI Papers of the Week (September 1-7)

AI mode is good, actually
When I wrote about how good ChatGPT with GPT-5 is at search yesterday I nearly added a note about how comparatively disappointing Google's efforts around this are. I'm glad I …

仕様駆動開発を支える Spec Kit を試してみた
仕様駆動開発(Specification-Driven Development, SDD)は、AI コーディングエージェントを活用した新しいソフトウェア開発スタイルです。GitHub が提供する Spec Kit は、仕様駆動開発を支援するためのツールキットであり、AI との対話を通じて正確な受け入れ基準の定義とコード生成を支援します。この記事では Spec Kit を使用して仕様駆動開発を試してみます。

GPT-5 Thinking in ChatGPT (aka Research Goblin) is shockingly good at search
“Don’t use chatbots as search engines” was great advice for several years... until it wasn’t. I wrote about how good OpenAI’s o3 was at using its Bing-backed search tool back …
Quoting Jason Liu
I am once again shocked at how much better image retrieval performance you can get if you embed highly opinionated summaries of an image, a summary that came out of …

Kimi-K2-Instruct-0905
New not-quite-MIT licensed model from Chinese Moonshot AI, a follow-up to the highly regarded Kimi-K2 model they released in July. This one is an incremental improvement - I've seen it …

🤖 AI Agents Weekly: Universal Deep Research, GPT-4b micro, Self-Evolving Agents, Tracking Multi-Agent Failures
Universal Deep Research, GPT-4b micro, Self-Evolving Agents, Tracking Multi-Agent Failures
Anthropic to pay $1.5 billion to authors in landmark AI settlement
I wrote about the details of this case when it was found that Anthropic's training on book content was fair use, but they needed to have purchased individual copies of …

TypeScriptファーストなコーディングAIエージェントのベンチマーク「ts-bench」を公開しました
AIコーディングエージェントのTypeScriptコード編集能力を評価するための、手軽に再現可能なベンチマークプロジェクト「ts-bench」を公開しました。この記事では、筆者がなぜ ts-bench を作ったのか、今後どうしていきたいかについてお話しします。 GitHub - laiso/ts-benchContribute to laiso/ts-bench development by creating an account on GitHub.GitHublaiso ts-benchの仕組み ts-benchは、プログラミング学習プラットフォーム Exercism のTypeScript問題セットを利用します。各問題には、仕様を説明するドキュメント、エージェントが編集すべきソースコードのひな形、そして正解判定に使うテストコードが含まれています。 ベンチマークタスクは、各問題に対して以下の4つのステップを順番に実行します。 1. AIエージェントの実行: 問題の指示書をプロンプトとしてAIエージェントに渡し、ソースコードを編集させます。 2. テストファイルの復元

Introducing EmbeddingGemma
Brand new open weights (under the slightly janky Gemma license) 308M parameter embedding model from Google: Based on the Gemma 3 architecture, EmbeddingGemma is trained on 100+ languages and is …
Highlighted tools
Any time I share my collection of tools built using vibe coding and AI-assisted development (now at 124, here's the definitive list) someone will inevitably complain that they're mostly trivial. …

Beyond Vibe Coding
Back in May I wrote Two publishers and three authors fail to understand what “vibe coding” means where I called out the authors of two forthcoming books on "vibe coding" …

AI coding tools still suck at context — here’s how to work around it
Discover why you might be having difficulty with AI coding tools, and learn some practical strategies to work with AI more effectively.
gov.uscourts.dcd.223205.1436.0_1.pdf
Here's the 230 page PDF ruling on the 2023 United States v. Google LLC federal antitrust case - the case that could have resulted in Google selling off Chrome and …

AGENTS.md Gains Traction as an Open Format for AI Coding Agents
AGENTS.md is a fast-growing open format giving AI coding agents a shared, predictable way to understand project setup, style, and workflows.
Cursor vs Claude Code: The Ultimate Comparison Guide
Cursor or Claude Code? Both start at $20/mo but work differently. Compare features, hidden costs, and real workflows to pick the right AI coding tool.

Rich Pixels
Neat Python library by Darren Burns adding pixel image support to the Rich terminal library, using tricks to render an image using full or half-height colored blocks. Here's the key …
August 2025 newsletter
I just sent out my August 2025 sponsors-only newsletter summarizing the past month in LLMs and my other work. Topics included GPT-5, gpt-oss, image editing models (Qwen-Image-Edit and Gemini Nano …
Introducing gpt-realtime
Released a few days ago (August 28th), gpt-realtime is OpenAI's new "most advanced speech-to-speech model". It looks like this is a replacement for the older gpt-4o-realtime-preview model that was released …

Cloudflare Radar: AI Insights
Cloudflare launched this dashboard back in February, incorporating traffic analysis from Cloudflare's network along with insights from their popular 1.1.1.1 DNS service. I found this chart particularly interesting, showing which …

LongCat-Flash: Deploying Meituan's Agentic Model with SGLang
<h3><a id="1-introduction-deploying-meituans-agentic-open-source-moe-model" class="anchor" href="#1-introduction-deploying-meituans-agentic-open-source-moe-m...

🥇Top AI Papers of the Week
The Top AI Papers of the Week (August 25-31)

エンティティリンキングの性能改善のための効果的な絞り込み手法の検証
AI ShiftのTECH BLOGです。AI技術の情報や活用方法などをご案内いたします。
Claude Opus 4.1 and Opus 4 degraded quality
Notable because often when people complain of degraded model quality it turns out to be unfounded - Anthropic in the past have emphasized that they don't change the model weights …

🤖 AI Agents Weekly: Gemini 2.5 Flash Image, gpt-realtime, Anemoi Agent, Fine-tuning LLM Agents, Codex Updates, Agent Client Protocol
Gemini 2.5 Flash Image, gpt-realtime, Anemoi Agent, Fine-tuning LLM Agents, Codex Updates, Agent Client Protocol
Quoting Benj Edwards
LLMs are intelligence without agency—what we might call "vox sine persona": voice without person. Not the voice of someone, not even the collective voice of many someones, but a voice …

AI コーディングエージェントの管理を行う Vibe Kanban を試してみた
Vibe Kanban は、AI コーディングエージェントの管理を支援するためのツールです。カンバン方式の UI でタスク管理を行い、各タスクに対して AI エージェントを割り当てて人間がその進捗を管理できます。この記事では Vibe Kanban を使用して AI コーディングエージェントの管理を実際に試してみます。

The perils of vibe coding
I was interviewed by Elaine Moore for this opinion piece in the Financial Times, which ended up in the print edition of the paper too! I picked up a copy …

How to build a multimodal AI app with voice and vision in Next.js
Learn how to build multimodal AI interactions to process images, audio, and even real-time video streams, using Next.js and Gemini.
Lossy encyclopedia
Since I love collecting questionable analogies for LLMs, here's a new one I just came up with: an LLM is a lossy encyclopedia. They have a huge array of facts …
Python: The Documentary
New documentary about the origins of the Python programming language - 84 minutes long, built around extensive interviews with Guido van Rossum and others who were there at the start …

I tried out Kiro: Here’s what I learned
Check out Kiro, AWS's AI-powered IDE, see what makes it different from other AI coding tools, and explore whether it lives up to the hype.

Finetune and deploy GPT-OSS in MXFP4: ModelOpt+SGLang
<p>GPT-OSS, the first open-source model family from OpenAI's lab since GPT-2, demonstrates strong math, coding, and general capabilities even when compared w...
Quoting Bruce Schneier
We simply don’t know to defend against these attacks. We have zero agentic AI systems that are secure against these attacks. Any AI that is working in an adversarial environment—and …

How to protect your AI agent from prompt injection attacks
Explore six principled design patterns (with real-world examples) to help you protect your LLM agents from prompt injection attacks.

SGLang for gpt-oss: From Day 0 Support to Enhanced Performance
<p>We are excited to announce a major update for SGLang, focusing on deep performance optimizations and new features for the recently released openai/gpt-oss...
Piloting Claude for Chrome
Two days ago I said: I strongly expect that the entire concept of an agentic browser extension is fatally flawed and cannot be built safely. Today Anthropic announced their own …

User agent strings to HTTP signatures - methods for AI agent identification
How to verify AI agent identity using HTTP message signatures with TypeScript.

Qwen3-Coder: Is this Agentic CLI smarter than senior devs?
Discover Qwen3-Coder, Alibaba’s 480B parameter agentic coding CLI, with real-world tests, use cases, and performance insights.
Will Smith’s concert crowds are real, but AI is blurring the lines
Great piece from Andy Baio demonstrating quite how convoluted the usage ethics and backlash against generative AI has become. Will Smith has been accused of using AI to misleadingly inflate …
Agentic Browser Security: Indirect Prompt Injection in Perplexity Comet
The security team from Brave took a look at Comet, the LLM-powered "agentic browser" extension from Perplexity, and unsurprisingly found security holes you can drive a truck through. The vulnerability …

🥇Top AI Papers of the Week
The Top AI Papers of the Week (August 18-24)

🤖 Agents Weekly: DeepSeek-V3.1, AGENTS.md, URL Context, Context Engineering Tips, Qwen-Image-Edit
DeepSeek-V3.1, AGENTS.md, URL Context, Context Engineering Tips, Qwen-Image-Edit
ChatGPT release notes: Project-only memory
The feature I've most wanted from ChatGPT's memory feature (the newer version of memory that automatically includes relevant details from summarized prior conversations) just landed: With project-only memory enabled, ChatGPT …

DeepSeek 3.1
The latest model from DeepSeek, a 685B monster (like DeepSeek v3 before it) but this time it's a hybrid reasoning model. DeepSeek claim: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, …
Quoting The Bluesky Team
Mississippi's approach would fundamentally change how users access Bluesky. The Supreme Court’s recent decision leaves us facing a hard reality: comply with Mississippi’s age assurance law—and make every Mississippi Bluesky …

Agentic AI for 5x less: Why Kimi K2 is a frontend game-changer
Discover how to integrate Kimi K2 agentic mode into a frontend application, and learn how it compares to DeepSeek.
too many model context protocol servers and LLM allocations on the dance floor
Useful reminder from Geoffrey Huntley of the infrequently discussed significant token cost of using MCP. Geoffrey estimate estimates that the usable context window something like Amp or Cursor is around …
Quoting potatolicious
Most classical engineering fields deal with probabilistic system components all of the time. In fact I'd go as far as to say that inability to deal with probabilistic components is …
Quoting Matt Garman
I was at a leadership group and people were telling me "We think that with AI we can replace all of our junior people in our company." I was like, …
Quoting Mustafa Suleyman
Simply put, my central worry is that many people will start to believe in the illusion of AIs as conscious entities so strongly that they’ll soon advocate for AI rights, …

ターンテイキングのタイミング予測を簡単に試せるライブラリMaAIを使ってみた
AI ShiftのTECH BLOGです。AI技術の情報や活用方法などをご案内いたします。
Quoting u/AssafMalkiIL
what’s the point of vibe coding if at the end of the day i still gotta pay a dev to look at the code anyway. sure it feels kinda cool …

David Ho on BlueSky: A pelican tried to eat my bike
David Ho caught video footage of one of the pelicans in St James's Park expressing deep curiosity in his bicycle. I think it wants to ride it.

Does Gemini CLI fall short? Here’s how Codex compares
Compare Codex CLI vs Gemini CLI for real-world coding tasks. See strengths, weaknesses, and which AI CLI fits your developer workflow best.

Qwen-Image-Edit: Image Editing with Higher Quality and Efficiency
As promised in their August 4th release of the Qwen image generation model, Qwen have now followed it up with a separate model, Qwen-Image-Edit, which can take an image and …

llama.cpp guide: running gpt-oss with llama.cpp
Really useful official guide to running the OpenAI gpt-oss models using llama-server from llama.cpp - which provides an OpenAI-compatible localhost API and a neat web interface for interacting with the …
PyPI: Preventing Domain Resurrection Attacks
Domain resurrection attacks are a nasty vulnerability in systems that use email verification to allow people to recover their accounts. If somebody lets their domain name expire an attacker might …

Don’t let AI erase the next generation of dev leaders
If AI snaps up all of their opportunities to learn, junior engineers can never grow into senior roles. Then who’s left to lead the engineering teams of the future?
r/ChatGPTPro: What is the most profitable thing you have done with ChatGPT?
This Reddit thread - with 279 replies - offers a neat targeted insight into the kinds of things people are using ChatGPT for. Lots of variety here but two themes …
Google Gemini URL Context
New feature in the Gemini API: you can now enable a url_context tool which the models can use to request the contents of URLs as part of replying to a …

🥇Top AI Papers of the Week
The Top AI Papers of the Week (August 11-17)

Using Grok 4 in the frontend development: Here’s what I’ve learned
Tested Grok 4 on real frontend tasks. See how it compares to Claude, Gemini, and Kimi, plus cost, token use, and when to use it for dev work.
TIL: Running a gpt-oss eval suite against LM Studio on a Mac
The other day I learned that OpenAI published a set of evals as part of their gpt-oss model release, described in their cookbook on Verifying gpt-oss implementations. I decided to …
Quoting Sam Altman
Most of what we're building out at this point is the inference [...] We're profitable on inference. If we didn't pay for training, we'd be a very profitable company.

🤖 AI Agents Weekly: DINOv3, Claude Sonnet-1M, GLM-4.5V, Benchmarking AI Agent Memory, Deep Agents, Claude Code Output Styles
DINOv3, Claude Sonnet-1M, GLM-4.5V, Benchmarking AI Agent Memory, Deep Agents, Claude Code Output Styles

Cerebras Code(Qwen3-Coder)の申し込みが再開
AIインフラを手がける新興企業Cerebrasが2025年8月1日に発表した「Cerebras Code」は、中国Alibabaの「Qwen3-Coder」モデルを用いた月額定額サービスで、個人開発者や小規模チームを対象に、コーディングエージェント向けのAPIを提供します。 CerebrasCerebras is the go-to platform for fast and effortless AI training. Learn more at cerebras.ai.Daniel Kim 8月1週の開始直後に申し込みが殺到したらしく、しばらく受付を停止していましたが[1]、今週から再開したようです。 [1]https://x.com/CerebrasSystems/status/1952512742574768599 料金は月額50ドル(Code Pro)と200ドル(Code Max)です。CerebrasはもともとLlama 4ベースの月1500ドルを超えるAPIをエンタープライズ向けに売っていましたが、Claude CodeのMaxプランに対抗するような形でこのプラ

LLM へのプロンプトを構造化された文書で管理する POML
POML (Prompt Orchestration Markup Language) は、Microsoft によって提案されたプロンプトを構造化された文書として管理するためのマークアップ言語です。プロンプト開発における構造の欠如や複雑なデータとの統合の困難さ、特定のフォーマットへの依存性といった課題を解決することを目指しています。
GPT-5 has a hidden system prompt
It looks like GPT-5 when accessed via the OpenAI API may have its own hidden system prompt, independent from the system prompt you can specify in an API call. At …

The Summer of Johann: prompt injections as far as the eye can see
Independent AI researcher Johann Rehberger (previously) has had an absurdly busy August. Under the heading The Month of AI Bugs he has been publishing one report per day across an …
Meta’s AI rules have let bots hold ‘sensual’ chats with kids, offer false medical info
This is grim. Reuters got hold of a leaked copy Meta's internal "GenAI: Content Risk Standards" document: Running to more than 200 pages, the document defines what Meta staff and …

Open weight LLMs exhibit inconsistent performance across providers
Artificial Analysis published a new benchmark the other day, this time focusing on how an individual model—OpenAI’s gpt-oss-120b—performs across different hosted providers. The results showed some surprising differences. Here’s the …
Quoting Steve Wozniak
I gave all my Apple wealth away because wealth and power are not what I live for. I have a lot of fun and happiness. I funded a lot of …
Quoting Cory Doctorow
NERD HARDER! is the answer every time a politician gets a technological idée-fixe about how to solve a social problem by creating a technology that can't exist. It's the answer …
Introducing Gemma 3 270M: The compact model for hyper-efficient AI
New from Google: Gemma 3 270M, a compact, 270-million parameter model designed from the ground up for task-specific fine-tuning with strong instruction-following and text structuring capabilities already trained in. This …

AI personas you can use to support your entire UX process
Discover how AI personas can transform UX design, from simulating users to co-designing interfaces and boosting team speed and accuracy.
![AI dev tool power rankings & comparison [August 2025 edition]](https://blog.logrocket.com/wp-content/uploads/2025/07/ai_dev_tool_power_rankings_july_2025_web.png)
AI dev tool power rankings & comparison [August 2025 edition]
Compare the top AI development tools and models of August 2025. See updated power rankings, feature-by-feature breakdowns, and find the right fit for your workflow.
Screaming in the Cloud: AI’s Security Crisis: Why Your Assistant Might Betray You
I recorded this podcast conversation with Corey Quinn a few weeks ago: On this episode of Screaming in the Cloud, Corey Quinn talks with Simon Willison, founder of Datasette and …

How Does A Blind Model See The Earth?
Fun, creative new micro-eval. Split the world into a sampled collection of latitude longitude points and for each one ask a model: If this location is over land, say 'Land'. …

simonw/codespaces-llm
GitHub Codespaces provides full development environments in your browser, and is free to use with anyone with a GitHub account. Each environment has a full Linux container and a browser-based …

拡散言語モデルを使ってリアルタイムなアプリケーション生成システムを作った
AI ShiftのTECH BLOGです。AI技術の情報や活用方法などをご案内いたします。
Claude Sonnet 4 now supports 1M tokens of context
Gemini and OpenAI both have million token models, so it's good to see Anthropic catching up. This is 5x the previous 200,000 context length limit of the various Claude Sonnet …
Quoting Nick Turley
I think there's been a lot of decisions over time that proved pretty consequential, but we made them very quickly as we have to. [...] [On pricing] I had this …
LLM 0.27, the annotated release notes: GPT-5 and improved tool calling
I shipped LLM 0.27 today, adding support for the new GPT-5 family of models from OpenAI plus a flurry of improvements to the tool calling features introduced in LLM 0.26. …
Reddit will block the Internet Archive
Well this sucks. Jay Peters for the Verge: Reddit says that it has caught AI companies scraping its data from the Internet Archive’s Wayback Machine, so it’s going to start …
You need evals to ship AI features
AI features are unpredictable and traditional tests fall short. Evals, automated checks for AI behavior, help you prevent regressions and measure success.
Codex upgrade
If you've been experimenting with OpenAI's Codex CLI and have been frustrated that it's not possible to select text and copy it to the clipboard, at least when running in …

qwen-image-mps
Ivan Fioravanti built this Python CLI script for running the Qwen/Qwen-Image image generation model on an Apple silicon Mac, optionally using the Qwen-Image-Lightning LoRA to dramatically speed up generation. Ivan …
AI for data engineers with Simon Willison
I recorded an episode last week with Claire Giordano for the Talking Postgres podcast. The topic was "AI for data engineers" but we ended up covering an enjoyable range of …

Qwen3-4B-Thinking: "This is art - pelicans don't ride bikes!"
I’ve fallen a few days behind keeping up with Qwen. They released two new 4B models last week: Qwen3-4B-Instruct-2507 and its thinking equivalent Qwen3-4B-Thinking-2507. These are relatively tiny models that …
Quoting Sam Altman
the percentage of users using reasoning models each day is significantly increasing; for example, for free users we went from <1% to 7%, and for plus users from 7% to …

🥇Top AI Papers of the Week
The Top AI Papers of the Week (August 4-10)

新Codex CLIの使い方
GPT-5が2025年8月7日に正式リリースされました。これに合わせて、ChatGPTのサブスクリプションプラン(PlusやProなど)でOpenAIのCodex CLIが利用可能になりました。従来はCodexの利用はAPIによる従量課金制が中心でしたが、この変更によりサブスクリプション利用者は追加料金なしでCodex CLIを利用できるようになり、新規ユーザーが増えているようです。 Codex CLIの最初のバージョンは2025年4月に公開されましが、リサーチプレビュー段階のプロジェクトなので頻繁に変更があります。リリース1ヶ月後にはTypeScriptからRustにスクラッチで書き直され。しばらく2つのバージョンの開発が並行していました。現在はRust版がデフォルトになっています。 以下のような方におすすめです * リサーチプレビューに参加したい:このツールで開発がどの程度できそうか評価してフィードバックする * エージェント開発に関心がある:ソースコードがすべて公開されているので自分のプロジェクトの参考になります 以下のような方の期待には添えないでしょう * Cl
Quoting Ethan Mollick
The issue with GPT-5 in a nutshell is that unless you pay for model switching & know to use GPT-5 Thinking or Pro, when you ask “GPT-5” you sometimes get …

🤖 AI Agents Weekly: GPT-5, Genie 3, gpt-oss, Cursor CLI, Opus 4.1, Efficient AI Agents
GPT-5, Genie 3, gpt-oss, Cursor CLI, Opus 4.1, Efficient AI Agents
Quoting Thomas Dohmke
You know what else we noticed in the interviews? Developers rarely mentioned “time saved” as the core benefit of working in this new way with agents. They were all about …
When a Jira Ticket Can Steal Your Secrets
Zenity Labs describe a classic lethal trifecta attack, this time against Cursor, MCP, Jira and Zendesk. They also have a short video demonstrating the issue. Zendesk support emails are often …

My Lethal Trifecta talk at the Bay Area AI Security Meetup
I gave a talk on Wednesday at the Bay Area AI Security Meetup about prompt injection, the lethal trifecta and the challenges of securing systems that use MCP. It wasn’t …

AI エージェントがインタラクティブな UI を返すことを可能にする MCP UI
MCP UI は Model Context Protocol (MCP) を拡張して、AI エージェントがインタラクティブな UI コンポーネントを返すことを可能にする仕組みです。これにより、AI エージェントとのチャットの返答としてグラフや画像ギャラリー、購入フォームなどを表示できます。この記事では MCP UI の SDK を利用して、AI エージェントがインタラクティブな UI コンポーネントを返す方法を試してみます。
Quoting @pearlmania500
I have a toddler. My biggest concern is that he doesn't eat rocks off the ground and you're talking to me about ChatGPT psychosis? Why do we even have that? …
Quoting Sam Altman
GPT-5 rollout updates: We are going to double GPT-5 rate limits for ChatGPT Plus users as we finish rollout. We will let Plus users choose to continue to use 4o. …
The surprise deprecation of GPT-4o for ChatGPT consumers
I’ve been dipping into the r/ChatGPT subreddit recently to see how people are reacting to the GPT-5 launch, and so far the vibes there are not good. This AMA thread …

Previewing GPT-5 at OpenAI's office
A couple of weeks ago I was invited to OpenAI's headquarters for a "preview event", for which I had to sign both an NDA and a video release waiver. I …

GPT-5: Key characteristics, pricing and model card
I’ve had preview access to the new GPT-5 model family for the past two weeks, and have been using GPT-5 as my daily-driver. It’s my new favorite model. It’s still …

GPT-5 まとめ
OpenAIは2025年8月7日にGPT-5を発表しました。このモデルはコーディング、数学、ライティング、医療、視覚認識などのタスクにおいて過去最高の性能を誇ります。GPT-5は、全ユーザーが利用可能で、PlusプランやProプランに応じて異なる機能を提供します。特にProプランでは、より包括的かつ正確な回答を行う拡張リーズニングバージョンが利用できます。新たに導入されたリアルタイムルーター方式により、会話内容や質問の複雑さに応じて最適なモデルが選択されます。また、ハルシネーションの減少や安全性向上のための手法も導入されています。APIを通じて様々な機能が利用可能で、特にコーディングやエージェンティックタスクに最適化されています。 • GPT-5はコーディング、数学、ライティング、医療、視覚認識などのタスクで最高性能を発揮するモデル。 • 全ユーザーが利用可能で、PlusプランとProプランに応じた機能が提供される。 • Proプランでは、より正確な回答を行う拡張リーズニングバージョンが利用できる。 • リアルタイムルーター方式により、会話内容や質問の複雑さに応じて最適なモデルが選択される。 • ハルシネーションの減少や安全性向上のための新手法が導入されている。 • APIを通じて多様な機能が利用可能で、特にコーディングやエージェンティックタスクに最適化されている。
Introducing Usage-Based Agent Credits
Starting August 14, AI Credits shift to usage-based pricing. Pay for tokens used, not messages sent. Credits roll over with caps.
Jules, our asynchronous coding agent, is now available for everyone
I wrote about the Jules beta back in May. Google's version of the OpenAI Codex PR-submitting hosted coding tool graduated from beta today. I'm mainly linking to this now because …
Qwen3-4B Instruct and Thinking
Yet another interesting model from Qwen - these are tiny compared to their other recent releases (just 4B parameters, 7.5GB on Hugging Face and even smaller when quantized) but with …
Quoting Artificial Analysis
gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits [...] We’re seeing the 120B beat o3-mini but …
No, AI is not Making Engineers 10x as Productive
Colton Voege on "curing your AI 10x engineer imposter syndrome". There's a lot of rhetoric out there suggesting that if you can't 10x your productivity through tricks like running a …

OpenAI's new open weight (Apache 2) models are really good
The long promised OpenAI open weight models are here, and they are very impressive. They’re available under proper open source licenses—Apache 2.0—and come in two sizes, 120B and 20B. OpenAI’s …

Claude Opus 4.1
Surprise new model from Anthropic today - Claude Opus 4.1, which they describe as "a drop-in replacement for Opus 4". My favorite thing about this model is the version number …
Quoting greyduet on r/teachers
I teach HS Science in the south. I can only speak for my district, but a few teacher work days in the wave of enthusiasm I'm seeing for AI tools …

ChatGPT agent's user-agent
I was exploring how ChatGPT agent works today. I learned some interesting things about how it exposes its identity through HTTP headers, then made a huge blunder in thinking it …

Usage charts for my LLM tool against OpenRouter
OpenRouter proxies requests to a large number of different LLMs and provides high level statistics of which models are the most popular among their users. Tools that call OpenRouter can …

Qwen-Image: Crafting with Native Text Rendering
Not content with releasing six excellent open weights LLMs in July, Qwen are kicking off August with their first ever image generation model. Qwen-Image is a 20 billion parameter MMDiT …

Quoting @himbodhisattva
for services that wrap GPT-3, is it possible to do the equivalent of sql injection? like, a prompt-injection attack? make it think it's completed the task and then get access …
I Saved a PNG Image To A Bird
Benn Jordan provides one of the all time great YouTube video titles, and it's justified. He drew an image in an audio spectrogram, played that sound to a talented starling …
Quoting Nick Turley
This week, ChatGPT is on track to reach 700M weekly active users — up from 500M at the end of March and 4× since last year.

LLMs are facing a QA crisis: Here’s how we could solve it
Discover how LLM QA isn’t just a tooling gap — it’s a fundamental shift in how we think about software reliability.

XBai o4
Yet another open source (Apache 2.0) LLM from a Chinese AI lab. This model card claims: XBai o4 excels in complex reasoning capabilities and has now completely surpassed OpenAI-o3-mini in …

🥇Top AI Papers of the Week
The Top AI Papers of the Week (July 28 - August 3)

次期GPT系モデルかもしれない「Horizon Beta」のコーディング性能を検証する
2025年7月30日、OpenRouter上に「Horizon Alpha」という詳細不明のステルスモデルが登場しました。その後「Horizon Beta」という名前に置き換わりました。このモデルは、OpenAIの次期モデルのテスト用ではないか?と注目を集めています。今回は、このモデルの性能をコーディングタスクで検証しました。 https://openrouter.ai/openrouter/horizon-beta 特徴 * コンテキストウィンドウ: 256K(GPT-4.1の1M、o3/o4-miniの200Kと比較して中規模) * スループット: 126.9 tps(Sonnet 4の64.50 tpsの約2倍。コーディング時に体感で早い) * Reasoning機構: なし 本当にOpenAI系のモデルなのか? OpenAI系のモデルである可能性が議論されています。過去にもQuasar Alpha/Optimus AlphaがGPT-4.1リリース前に登場した経緯があり、今回も同様のパターンかもしれません。 直系のGPT-5ならコンテキストウィンドウは1M

コーディングのための LLM モデル Qwen3-Coder を試してみた
Alibaba が開発した Qwen3-Coder を使用したコーディングエージェント Qwen Code を試してみた記事です。OpenRouter 経由での認証設定、コードベースの調査、リファクタリング、テストコード生成などの実際の使用例を紹介しています。

🤖 AI Agents Weekly: GLM-4.5, AI SDK 5, Video Overviews, ChatGPT Study Mode, Context engineering Tips, AlphaEarth Foundations
GLM-4.5, AI SDK 5, Video Overviews, ChatGPT Study Mode, Context engineering Tips, AlphaEarth Foundations

Serena MCPはClaude Codeを救うのか?
「Claude Codeがアホになる問題」が勃発している最中、SerenaというMCPサーバーが「Claude Codeのコンテキスト消費を削減し、応答を改善する」という評価でユーザーたちの間で注目されています。 筆者も実際にSerenaを使ってみたところ、確かにコンテキスト効率の改善(入出力トークンの減少を指します)を実感できました。詳しく調べてみると、このツールは非常にユニークな発想で設計されており、一過性の流行として消費されるには惜しいと感じました。 そこで、本記事では、この機能の背景にある技術的な仕組みを詳しく解説したいと思います。実際の検証も交えながら、Serenaのアーキテクチャとその効果を分析していきます。 現在のコーディングエージェントが抱える課題 現在のコーディングエージェントの多くは、コードを単なるテキストファイルとして扱って逐次的な処理をしています。この根本的なアプローチが、制約を生み出しています。 大規模なプロジェクトで作業する際、エージェントは必要な情報を見つけるために膨大なテキストを読み込まなければなりません。関数の定義を探すだけでも、リポジトリ
Faster inference
Two interesting examples of inference speed as a flagship feature of LLM services today. First, Cerebras announced two new monthly plans for their extremely high speed hosted model service: Cerebras …