Last updated: 2025/09/03 11:01
Cursor vs Claude Code: The Ultimate Comparison Guide
Cursor or Claude Code? Both start at $20/mo but work differently. Compare features, hidden costs, and real workflows to pick the right AI coding tool.

Rich Pixels
Neat Python library by Darren Burns adding pixel image support to the Rich terminal library, using tricks to render an image using full or half-height colored blocks. Here's the key …
August 2025 newsletter
I just sent out my August 2025 sponsors-only newsletter summarizing the past month in LLMs and my other work. Topics included GPT-5, gpt-oss, image editing models (Qwen-Image-Edit and Gemini Nano …
Introducing gpt-realtime
Released a few days ago (August 28th), gpt-realtime is OpenAI's new "most advanced speech-to-speech model". It looks like this is a replacement for the older gpt-4o-realtime-preview model that was released …

Cloudflare Radar: AI Insights
Cloudflare launched this dashboard back in February, incorporating traffic analysis from Cloudflare's network along with insights from their popular 1.1.1.1 DNS service. I found this chart particularly interesting, showing which …

LongCat-Flash: Deploying Meituan's Agentic Model with SGLang
<h3><a id="1-introduction-deploying-meituans-agentic-open-source-moe-model" class="anchor" href="#1-introduction-deploying-meituans-agentic-open-source-moe-m...

🥇Top AI Papers of the Week
The Top AI Papers of the Week (August 25-31)

エンティティリンキングの性能改善のための効果的な絞り込み手法の検証
AI ShiftのTECH BLOGです。AI技術の情報や活用方法などをご案内いたします。
Claude Opus 4.1 and Opus 4 degraded quality
Notable because often when people complain of degraded model quality it turns out to be unfounded - Anthropic in the past have emphasized that they don't change the model weights …

🤖 AI Agents Weekly: Gemini 2.5 Flash Image, gpt-realtime, Anemoi Agent, Fine-tuning LLM Agents, Codex Updates, Agent Client Protocol
Gemini 2.5 Flash Image, gpt-realtime, Anemoi Agent, Fine-tuning LLM Agents, Codex Updates, Agent Client Protocol
Quoting Benj Edwards
LLMs are intelligence without agency—what we might call "vox sine persona": voice without person. Not the voice of someone, not even the collective voice of many someones, but a voice …

AI コーディングエージェントの管理を行う Vibe Kanban を試してみた
Vibe Kanban は、AI コーディングエージェントの管理を支援するためのツールです。カンバン方式の UI でタスク管理を行い、各タスクに対して AI エージェントを割り当てて人間がその進捗を管理できます。この記事では Vibe Kanban を使用して AI コーディングエージェントの管理を実際に試してみます。

The perils of vibe coding
I was interviewed by Elaine Moore for this opinion piece in the Financial Times, which ended up in the print edition of the paper too! I picked up a copy …

How to build a multimodal AI app with voice and vision in Next.js
Learn how to build multimodal AI interactions to process images, audio, and even real-time video streams, using Next.js and Gemini.
Lossy encyclopedia
Since I love collecting questionable analogies for LLMs, here's a new one I just came up with: an LLM is a lossy encyclopedia. They have a huge array of facts …
Python: The Documentary
New documentary about the origins of the Python programming language - 84 minutes long, built around extensive interviews with Guido van Rossum and others who were there at the start …

I tried out Kiro: Here’s what I learned
Check out Kiro, AWS's AI-powered IDE, see what makes it different from other AI coding tools, and explore whether it lives up to the hype.

Finetune and deploy GPT-OSS in MXFP4: ModelOpt+SGLang
<p>GPT-OSS, the first open-source model family from OpenAI's lab since GPT-2, demonstrates strong math, coding, and general capabilities even when compared w...
Quoting Bruce Schneier
We simply don’t know to defend against these attacks. We have zero agentic AI systems that are secure against these attacks. Any AI that is working in an adversarial environment—and …

How to protect your AI agent from prompt injection attacks
Explore six principled design patterns (with real-world examples) to help you protect your LLM agents from prompt injection attacks.

SGLang for gpt-oss: From Day 0 Support to Enhanced Performance
<p>We are excited to announce a major update for SGLang, focusing on deep performance optimizations and new features for the recently released openai/gpt-oss...
Piloting Claude for Chrome
Two days ago I said: I strongly expect that the entire concept of an agentic browser extension is fatally flawed and cannot be built safely. Today Anthropic announced their own …

User agent strings to HTTP signatures - methods for AI agent identification
How to verify AI agent identity using HTTP message signatures with TypeScript.

Qwen3-Coder: Is this Agentic CLI smarter than senior devs?
Discover Qwen3-Coder, Alibaba’s 480B parameter agentic coding CLI, with real-world tests, use cases, and performance insights.
Will Smith’s concert crowds are real, but AI is blurring the lines
Great piece from Andy Baio demonstrating quite how convoluted the usage ethics and backlash against generative AI has become. Will Smith has been accused of using AI to misleadingly inflate …
Agentic Browser Security: Indirect Prompt Injection in Perplexity Comet
The security team from Brave took a look at Comet, the LLM-powered "agentic browser" extension from Perplexity, and unsurprisingly found security holes you can drive a truck through. The vulnerability …

🥇Top AI Papers of the Week
The Top AI Papers of the Week (August 18-24)

🤖 Agents Weekly: DeepSeek-V3.1, AGENTS.md, URL Context, Context Engineering Tips, Qwen-Image-Edit
DeepSeek-V3.1, AGENTS.md, URL Context, Context Engineering Tips, Qwen-Image-Edit
ChatGPT release notes: Project-only memory
The feature I've most wanted from ChatGPT's memory feature (the newer version of memory that automatically includes relevant details from summarized prior conversations) just landed: With project-only memory enabled, ChatGPT …

DeepSeek 3.1
The latest model from DeepSeek, a 685B monster (like DeepSeek v3 before it) but this time it's a hybrid reasoning model. DeepSeek claim: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, …
Quoting The Bluesky Team
Mississippi's approach would fundamentally change how users access Bluesky. The Supreme Court’s recent decision leaves us facing a hard reality: comply with Mississippi’s age assurance law—and make every Mississippi Bluesky …

Agentic AI for 5x less: Why Kimi K2 is a frontend game-changer
Discover how to integrate Kimi K2 agentic mode into a frontend application, and learn how it compares to DeepSeek.
too many model context protocol servers and LLM allocations on the dance floor
Useful reminder from Geoffrey Huntley of the infrequently discussed significant token cost of using MCP. Geoffrey estimate estimates that the usable context window something like Amp or Cursor is around …
Quoting potatolicious
Most classical engineering fields deal with probabilistic system components all of the time. In fact I'd go as far as to say that inability to deal with probabilistic components is …
Quoting Matt Garman
I was at a leadership group and people were telling me "We think that with AI we can replace all of our junior people in our company." I was like, …
Quoting Mustafa Suleyman
Simply put, my central worry is that many people will start to believe in the illusion of AIs as conscious entities so strongly that they’ll soon advocate for AI rights, …

ターンテイキングのタイミング予測を簡単に試せるライブラリMaAIを使ってみた
AI ShiftのTECH BLOGです。AI技術の情報や活用方法などをご案内いたします。
Quoting u/AssafMalkiIL
what’s the point of vibe coding if at the end of the day i still gotta pay a dev to look at the code anyway. sure it feels kinda cool …

David Ho on BlueSky: A pelican tried to eat my bike
David Ho caught video footage of one of the pelicans in St James's Park expressing deep curiosity in his bicycle. I think it wants to ride it.

Does Gemini CLI fall short? Here’s how Codex compares
Compare Codex CLI vs Gemini CLI for real-world coding tasks. See strengths, weaknesses, and which AI CLI fits your developer workflow best.

Qwen-Image-Edit: Image Editing with Higher Quality and Efficiency
As promised in their August 4th release of the Qwen image generation model, Qwen have now followed it up with a separate model, Qwen-Image-Edit, which can take an image and …

llama.cpp guide: running gpt-oss with llama.cpp
Really useful official guide to running the OpenAI gpt-oss models using llama-server from llama.cpp - which provides an OpenAI-compatible localhost API and a neat web interface for interacting with the …
PyPI: Preventing Domain Resurrection Attacks
Domain resurrection attacks are a nasty vulnerability in systems that use email verification to allow people to recover their accounts. If somebody lets their domain name expire an attacker might …

Don’t let AI erase the next generation of dev leaders
If AI snaps up all of their opportunities to learn, junior engineers can never grow into senior roles. Then who’s left to lead the engineering teams of the future?
r/ChatGPTPro: What is the most profitable thing you have done with ChatGPT?
This Reddit thread - with 279 replies - offers a neat targeted insight into the kinds of things people are using ChatGPT for. Lots of variety here but two themes …
Google Gemini URL Context
New feature in the Gemini API: you can now enable a url_context tool which the models can use to request the contents of URLs as part of replying to a …

🥇Top AI Papers of the Week
The Top AI Papers of the Week (August 11-17)

Using Grok 4 in the frontend development: Here’s what I’ve learned
Tested Grok 4 on real frontend tasks. See how it compares to Claude, Gemini, and Kimi, plus cost, token use, and when to use it for dev work.
TIL: Running a gpt-oss eval suite against LM Studio on a Mac
The other day I learned that OpenAI published a set of evals as part of their gpt-oss model release, described in their cookbook on Verifying gpt-oss implementations. I decided to …
Quoting Sam Altman
Most of what we're building out at this point is the inference [...] We're profitable on inference. If we didn't pay for training, we'd be a very profitable company.

🤖 AI Agents Weekly: DINOv3, Claude Sonnet-1M, GLM-4.5V, Benchmarking AI Agent Memory, Deep Agents, Claude Code Output Styles
DINOv3, Claude Sonnet-1M, GLM-4.5V, Benchmarking AI Agent Memory, Deep Agents, Claude Code Output Styles

Cerebras Code(Qwen3-Coder)の申し込みが再開
AIインフラを手がける新興企業Cerebrasが2025年8月1日に発表した「Cerebras Code」は、中国Alibabaの「Qwen3-Coder」モデルを用いた月額定額サービスで、個人開発者や小規模チームを対象に、コーディングエージェント向けのAPIを提供します。 CerebrasCerebras is the go-to platform for fast and effortless AI training. Learn more at cerebras.ai.Daniel Kim 8月1週の開始直後に申し込みが殺到したらしく、しばらく受付を停止していましたが[1]、今週から再開したようです。 [1]https://x.com/CerebrasSystems/status/1952512742574768599 料金は月額50ドル(Code Pro)と200ドル(Code Max)です。CerebrasはもともとLlama 4ベースの月1500ドルを超えるAPIをエンタープライズ向けに売っていましたが、Claude CodeのMaxプランに対抗するような形でこのプラ

LLM へのプロンプトを構造化された文書で管理する POML
POML (Prompt Orchestration Markup Language) は、Microsoft によって提案されたプロンプトを構造化された文書として管理するためのマークアップ言語です。プロンプト開発における構造の欠如や複雑なデータとの統合の困難さ、特定のフォーマットへの依存性といった課題を解決することを目指しています。
GPT-5 has a hidden system prompt
It looks like GPT-5 when accessed via the OpenAI API may have its own hidden system prompt, independent from the system prompt you can specify in an API call. At …

The Summer of Johann: prompt injections as far as the eye can see
Independent AI researcher Johann Rehberger (previously) has had an absurdly busy August. Under the heading The Month of AI Bugs he has been publishing one report per day across an …
Meta’s AI rules have let bots hold ‘sensual’ chats with kids, offer false medical info
This is grim. Reuters got hold of a leaked copy Meta's internal "GenAI: Content Risk Standards" document: Running to more than 200 pages, the document defines what Meta staff and …

Open weight LLMs exhibit inconsistent performance across providers
Artificial Analysis published a new benchmark the other day, this time focusing on how an individual model—OpenAI’s gpt-oss-120b—performs across different hosted providers. The results showed some surprising differences. Here’s the …
Quoting Steve Wozniak
I gave all my Apple wealth away because wealth and power are not what I live for. I have a lot of fun and happiness. I funded a lot of …
Quoting Cory Doctorow
NERD HARDER! is the answer every time a politician gets a technological idée-fixe about how to solve a social problem by creating a technology that can't exist. It's the answer …
Introducing Gemma 3 270M: The compact model for hyper-efficient AI
New from Google: Gemma 3 270M, a compact, 270-million parameter model designed from the ground up for task-specific fine-tuning with strong instruction-following and text structuring capabilities already trained in. This …

AI personas you can use to support your entire UX process
Discover how AI personas can transform UX design, from simulating users to co-designing interfaces and boosting team speed and accuracy.
![AI dev tool power rankings & comparison [August 2025 edition]](https://blog.logrocket.com/wp-content/uploads/2025/07/ai_dev_tool_power_rankings_july_2025_web.png)
AI dev tool power rankings & comparison [August 2025 edition]
Compare the top AI development tools and models of August 2025. See updated power rankings, feature-by-feature breakdowns, and find the right fit for your workflow.
Screaming in the Cloud: AI’s Security Crisis: Why Your Assistant Might Betray You
I recorded this podcast conversation with Corey Quinn a few weeks ago: On this episode of Screaming in the Cloud, Corey Quinn talks with Simon Willison, founder of Datasette and …

How Does A Blind Model See The Earth?
Fun, creative new micro-eval. Split the world into a sampled collection of latitude longitude points and for each one ask a model: If this location is over land, say 'Land'. …

simonw/codespaces-llm
GitHub Codespaces provides full development environments in your browser, and is free to use with anyone with a GitHub account. Each environment has a full Linux container and a browser-based …

拡散言語モデルを使ってリアルタイムなアプリケーション生成システムを作った
AI ShiftのTECH BLOGです。AI技術の情報や活用方法などをご案内いたします。
Claude Sonnet 4 now supports 1M tokens of context
Gemini and OpenAI both have million token models, so it's good to see Anthropic catching up. This is 5x the previous 200,000 context length limit of the various Claude Sonnet …
Quoting Nick Turley
I think there's been a lot of decisions over time that proved pretty consequential, but we made them very quickly as we have to. [...] [On pricing] I had this …
LLM 0.27, the annotated release notes: GPT-5 and improved tool calling
I shipped LLM 0.27 today, adding support for the new GPT-5 family of models from OpenAI plus a flurry of improvements to the tool calling features introduced in LLM 0.26. …
Reddit will block the Internet Archive
Well this sucks. Jay Peters for the Verge: Reddit says that it has caught AI companies scraping its data from the Internet Archive’s Wayback Machine, so it’s going to start …
You need evals to ship AI features
AI features are unpredictable and traditional tests fall short. Evals, automated checks for AI behavior, help you prevent regressions and measure success.
Codex upgrade
If you've been experimenting with OpenAI's Codex CLI and have been frustrated that it's not possible to select text and copy it to the clipboard, at least when running in …

qwen-image-mps
Ivan Fioravanti built this Python CLI script for running the Qwen/Qwen-Image image generation model on an Apple silicon Mac, optionally using the Qwen-Image-Lightning LoRA to dramatically speed up generation. Ivan …
AI for data engineers with Simon Willison
I recorded an episode last week with Claire Giordano for the Talking Postgres podcast. The topic was "AI for data engineers" but we ended up covering an enjoyable range of …

Qwen3-4B-Thinking: "This is art - pelicans don't ride bikes!"
I’ve fallen a few days behind keeping up with Qwen. They released two new 4B models last week: Qwen3-4B-Instruct-2507 and its thinking equivalent Qwen3-4B-Thinking-2507. These are relatively tiny models that …
Quoting Sam Altman
the percentage of users using reasoning models each day is significantly increasing; for example, for free users we went from <1% to 7%, and for plus users from 7% to …

🥇Top AI Papers of the Week
The Top AI Papers of the Week (August 4-10)

新Codex CLIの使い方
GPT-5が2025年8月7日に正式リリースされました。これに合わせて、ChatGPTのサブスクリプションプラン(PlusやProなど)でOpenAIのCodex CLIが利用可能になりました。従来はCodexの利用はAPIによる従量課金制が中心でしたが、この変更によりサブスクリプション利用者は追加料金なしでCodex CLIを利用できるようになり、新規ユーザーが増えているようです。 Codex CLIの最初のバージョンは2025年4月に公開されましが、リサーチプレビュー段階のプロジェクトなので頻繁に変更があります。リリース1ヶ月後にはTypeScriptからRustにスクラッチで書き直され。しばらく2つのバージョンの開発が並行していました。現在はRust版がデフォルトになっています。 以下のような方におすすめです * リサーチプレビューに参加したい:このツールで開発がどの程度できそうか評価してフィードバックする * エージェント開発に関心がある:ソースコードがすべて公開されているので自分のプロジェクトの参考になります 以下のような方の期待には添えないでしょう * Cl
Quoting Ethan Mollick
The issue with GPT-5 in a nutshell is that unless you pay for model switching & know to use GPT-5 Thinking or Pro, when you ask “GPT-5” you sometimes get …

🤖 AI Agents Weekly: GPT-5, Genie 3, gpt-oss, Cursor CLI, Opus 4.1, Efficient AI Agents
GPT-5, Genie 3, gpt-oss, Cursor CLI, Opus 4.1, Efficient AI Agents
Quoting Thomas Dohmke
You know what else we noticed in the interviews? Developers rarely mentioned “time saved” as the core benefit of working in this new way with agents. They were all about …
When a Jira Ticket Can Steal Your Secrets
Zenity Labs describe a classic lethal trifecta attack, this time against Cursor, MCP, Jira and Zendesk. They also have a short video demonstrating the issue. Zendesk support emails are often …

My Lethal Trifecta talk at the Bay Area AI Security Meetup
I gave a talk on Wednesday at the Bay Area AI Security Meetup about prompt injection, the lethal trifecta and the challenges of securing systems that use MCP. It wasn’t …

AI エージェントがインタラクティブな UI を返すことを可能にする MCP UI
MCP UI は Model Context Protocol (MCP) を拡張して、AI エージェントがインタラクティブな UI コンポーネントを返すことを可能にする仕組みです。これにより、AI エージェントとのチャットの返答としてグラフや画像ギャラリー、購入フォームなどを表示できます。この記事では MCP UI の SDK を利用して、AI エージェントがインタラクティブな UI コンポーネントを返す方法を試してみます。
Quoting @pearlmania500
I have a toddler. My biggest concern is that he doesn't eat rocks off the ground and you're talking to me about ChatGPT psychosis? Why do we even have that? …
Quoting Sam Altman
GPT-5 rollout updates: We are going to double GPT-5 rate limits for ChatGPT Plus users as we finish rollout. We will let Plus users choose to continue to use 4o. …
The surprise deprecation of GPT-4o for ChatGPT consumers
I’ve been dipping into the r/ChatGPT subreddit recently to see how people are reacting to the GPT-5 launch, and so far the vibes there are not good. This AMA thread …

Previewing GPT-5 at OpenAI's office
A couple of weeks ago I was invited to OpenAI's headquarters for a "preview event", for which I had to sign both an NDA and a video release waiver. I …

GPT-5: Key characteristics, pricing and model card
I’ve had preview access to the new GPT-5 model family for the past two weeks, and have been using GPT-5 as my daily-driver. It’s my new favorite model. It’s still …

GPT-5 まとめ
OpenAIは2025年8月7日にGPT-5を発表しました。このモデルはコーディング、数学、ライティング、医療、視覚認識などのタスクにおいて過去最高の性能を誇ります。GPT-5は、全ユーザーが利用可能で、PlusプランやProプランに応じて異なる機能を提供します。特にProプランでは、より包括的かつ正確な回答を行う拡張リーズニングバージョンが利用できます。新たに導入されたリアルタイムルーター方式により、会話内容や質問の複雑さに応じて最適なモデルが選択されます。また、ハルシネーションの減少や安全性向上のための手法も導入されています。APIを通じて様々な機能が利用可能で、特にコーディングやエージェンティックタスクに最適化されています。 • GPT-5はコーディング、数学、ライティング、医療、視覚認識などのタスクで最高性能を発揮するモデル。 • 全ユーザーが利用可能で、PlusプランとProプランに応じた機能が提供される。 • Proプランでは、より正確な回答を行う拡張リーズニングバージョンが利用できる。 • リアルタイムルーター方式により、会話内容や質問の複雑さに応じて最適なモデルが選択される。 • ハルシネーションの減少や安全性向上のための新手法が導入されている。 • APIを通じて多様な機能が利用可能で、特にコーディングやエージェンティックタスクに最適化されている。
Introducing Usage-Based Agent Credits
Starting August 14, AI Credits shift to usage-based pricing. Pay for tokens used, not messages sent. Credits roll over with caps.
Jules, our asynchronous coding agent, is now available for everyone
I wrote about the Jules beta back in May. Google's version of the OpenAI Codex PR-submitting hosted coding tool graduated from beta today. I'm mainly linking to this now because …
Qwen3-4B Instruct and Thinking
Yet another interesting model from Qwen - these are tiny compared to their other recent releases (just 4B parameters, 7.5GB on Hugging Face and even smaller when quantized) but with …
Quoting Artificial Analysis
gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits [...] We’re seeing the 120B beat o3-mini but …
No, AI is not Making Engineers 10x as Productive
Colton Voege on "curing your AI 10x engineer imposter syndrome". There's a lot of rhetoric out there suggesting that if you can't 10x your productivity through tricks like running a …

OpenAI's new open weight (Apache 2) models are really good
The long promised OpenAI open weight models are here, and they are very impressive. They’re available under proper open source licenses—Apache 2.0—and come in two sizes, 120B and 20B. OpenAI’s …

Claude Opus 4.1
Surprise new model from Anthropic today - Claude Opus 4.1, which they describe as "a drop-in replacement for Opus 4". My favorite thing about this model is the version number …
Quoting greyduet on r/teachers
I teach HS Science in the south. I can only speak for my district, but a few teacher work days in the wave of enthusiasm I'm seeing for AI tools …

ChatGPT agent's user-agent
I was exploring how ChatGPT agent works today. I learned some interesting things about how it exposes its identity through HTTP headers, then made a huge blunder in thinking it …

Usage charts for my LLM tool against OpenRouter
OpenRouter proxies requests to a large number of different LLMs and provides high level statistics of which models are the most popular among their users. Tools that call OpenRouter can …

Qwen-Image: Crafting with Native Text Rendering
Not content with releasing six excellent open weights LLMs in July, Qwen are kicking off August with their first ever image generation model. Qwen-Image is a 20 billion parameter MMDiT …

Quoting @himbodhisattva
for services that wrap GPT-3, is it possible to do the equivalent of sql injection? like, a prompt-injection attack? make it think it's completed the task and then get access …
I Saved a PNG Image To A Bird
Benn Jordan provides one of the all time great YouTube video titles, and it's justified. He drew an image in an audio spectrogram, played that sound to a talented starling …
Quoting Nick Turley
This week, ChatGPT is on track to reach 700M weekly active users — up from 500M at the end of March and 4× since last year.

LLMs are facing a QA crisis: Here’s how we could solve it
Discover how LLM QA isn’t just a tooling gap — it’s a fundamental shift in how we think about software reliability.

XBai o4
Yet another open source (Apache 2.0) LLM from a Chinese AI lab. This model card claims: XBai o4 excels in complex reasoning capabilities and has now completely surpassed OpenAI-o3-mini in …

🥇Top AI Papers of the Week
The Top AI Papers of the Week (July 28 - August 3)

次期GPT系モデルかもしれない「Horizon Beta」のコーディング性能を検証する
2025年7月30日、OpenRouter上に「Horizon Alpha」という詳細不明のステルスモデルが登場しました。その後「Horizon Beta」という名前に置き換わりました。このモデルは、OpenAIの次期モデルのテスト用ではないか?と注目を集めています。今回は、このモデルの性能をコーディングタスクで検証しました。 https://openrouter.ai/openrouter/horizon-beta 特徴 * コンテキストウィンドウ: 256K(GPT-4.1の1M、o3/o4-miniの200Kと比較して中規模) * スループット: 126.9 tps(Sonnet 4の64.50 tpsの約2倍。コーディング時に体感で早い) * Reasoning機構: なし 本当にOpenAI系のモデルなのか? OpenAI系のモデルである可能性が議論されています。過去にもQuasar Alpha/Optimus AlphaがGPT-4.1リリース前に登場した経緯があり、今回も同様のパターンかもしれません。 直系のGPT-5ならコンテキストウィンドウは1M

コーディングのための LLM モデル Qwen3-Coder を試してみた
Alibaba が開発した Qwen3-Coder を使用したコーディングエージェント Qwen Code を試してみた記事です。OpenRouter 経由での認証設定、コードベースの調査、リファクタリング、テストコード生成などの実際の使用例を紹介しています。

🤖 AI Agents Weekly: GLM-4.5, AI SDK 5, Video Overviews, ChatGPT Study Mode, Context engineering Tips, AlphaEarth Foundations
GLM-4.5, AI SDK 5, Video Overviews, ChatGPT Study Mode, Context engineering Tips, AlphaEarth Foundations

Serena MCPはClaude Codeを救うのか?
「Claude Codeがアホになる問題」が勃発している最中、SerenaというMCPサーバーが「Claude Codeのコンテキスト消費を削減し、応答を改善する」という評価でユーザーたちの間で注目されています。 筆者も実際にSerenaを使ってみたところ、確かにコンテキスト効率の改善(入出力トークンの減少を指します)を実感できました。詳しく調べてみると、このツールは非常にユニークな発想で設計されており、一過性の流行として消費されるには惜しいと感じました。 そこで、本記事では、この機能の背景にある技術的な仕組みを詳しく解説したいと思います。実際の検証も交えながら、Serenaのアーキテクチャとその効果を分析していきます。 現在のコーディングエージェントが抱える課題 現在のコーディングエージェントの多くは、コードを単なるテキストファイルとして扱って逐次的な処理をしています。この根本的なアプローチが、制約を生み出しています。 大規模なプロジェクトで作業する際、エージェントは必要な情報を見つけるために膨大なテキストを読み込まなければなりません。関数の定義を探すだけでも、リポジトリ
Faster inference
Two interesting examples of inference speed as a flagship feature of LLM services today. First, Cerebras announced two new monthly plans for their extremely high speed hosted model service: Cerebras …

Deep Think in the Gemini app
Google released Gemini 2.5 Deep Think this morning, exclusively to their Ultra ($250/month) subscribers: It is a variation of the model that recently achieved the gold-medal standard at this year's …
July newsletter for sponors is out
This morning I sent out the third edition of my LLM digest newsletter for my $10/month and higher sponsors on GitHub. It included the following section headers: Claude Code Model …
Quoting Logan Kilpatrick
Gemini Deep Think, our SOTA model with parallel thinking that won the IMO Gold Medal 🥇, is now available in the Gemini App for Ultra subscribers!! [...] Quick correction: this …

Reverse engineering some updates to Claude
Anthropic released two major new features for their consumer-facing Claude apps in the past couple of days. Sadly, they don’t do a very good job of updating the release notes …
Quoting Christina Wodtke
The old timers who built the early web are coding with AI like it's 1995. Think about it: They gave blockchain the sniff test and walked away. Ignored crypto (and …
More model releases on 31st July
Here are a few more model releases from today, to round out a very busy July: Cohere released Command A Vision, their first multi-modal (image input) LLM. Like their others …

Trying out Qwen3 Coder Flash using LM Studio and Open WebUI and LLM
Qwen just released their sixth model(!) for this July called Qwen3-Coder-30B-A3B-Instruct—listed as Qwen3-Coder-Flash in their chat.qwen.ai interface. It’s 30.5B total parameters with 3.3B active at any one time. This means …

Windsurf vs. Cursor: When to choose the challenger
Explore Windsurf AI’s Cascade agent, IDE integration, pricing, and how it stacks up against Cursor in this hands-on developer-focused comparison.

Claude Codeがアホになる問題
最近一部のClaude Codeユーザーの間で「性能が急激に劣化している」という報告が多発しています。具体的には、指示の内容を忘れて見当違いの作業をするというもので「これはClaude Codeのコンテキスト処理の問題ではないか?」と憶測を呼んでいます。 ※この話題はバージョン1.0.63時点のものです。 「バージョン1.0.24に固定せよ」 この問題に対して、ユーザーからの報告と対処法が以下で擬音されています。 Critical: Claude Code context amnesia causes silent code deletion · Issue #4487 · anthropics/claude-codeEnvironment Platform: Claude Code CLI Claude CLI version: 1.0.61 Operating System: macOS 15.5 (Build 24F74) Terminal: Terminal App

Ollama's new app
Ollama has been one of my favorite ways to run local models for a while - it makes it really easy to download models, and it's smart about keeping them …

GLM-4.5 Meets SGLang: Reasoning, Coding, and Agentic Abilities
<p>Today, we are excited to introduce our latest flagship models <a href="https://huggingface.co/zai-org/GLM-4.5">GLM-4.5</a> and <a href="https://huggingfac...
Quoting Steve Krouse
When you vibe code, you are incurring tech debt as fast as the LLM can spit it out. Which is why vibe coding is perfect for prototypes and throwaway projects: …

The best available open weight LLMs now come from China
Something that has become undeniable this month is that the best available open weight models now come from the Chinese AI labs. I continue to have a lot of love …

Qwen3-30B-A3B-Thinking-2507
Yesterday was Qwen3-30B-A3B-Instruct-2507. Qwen are clearly committed to their new split between reasoning and non-reasoning models (a reversal from Qwen 3 in April), because today they released the new reasoning …
OpenAI: Introducing study mode
New ChatGPT feature, which can be triggered by typing /study or by visiting chatgpt.com/studymode. OpenAI say: Under the hood, study mode is powered by custom system instructions we’ve written in …

Qwen/Qwen3-30B-A3B-Instruct-2507
New model update from Qwen, improving on their previous Qwen3-30B-A3B release from late April. In their tweet they said: Smarter, faster, and local deployment-friendly. ✨ Key Enhancements: ✅ Enhanced reasoning, …
Quoting Nilay Patel
Our plan is to build direct traffic to our site. and newsletters just one kind of direct traffic in the end. I don’t intend to ever rely on someone else’s …

LLMエージェントオブサーバビリティ基盤についてまとめてみた
AI ShiftのTECH BLOGです。AI技術の情報や活用方法などをご案内いたします。
Quoting Anthropic
We’re rolling out new weekly rate limits for Claude Pro and Max in late August. We estimate they’ll apply to less than 5% of subscribers based on current usage. [...] …

GLM-4.5: Reasoning, Coding, and Agentic Abililties
Another day, another significant new open weight model release from a Chinese frontier AI lab. This time it's Z.ai - who rebranded (at least in English) from Zhipu AI a …
Enough AI copilots! We need AI HUDs
Geoffrey Litt compares Copilots - AI assistants that you engage in dialog with and work with you to complete a task - with HUDs, Head-Up Displays, which enhance your working …

🥇Top AI Papers of the Week
The Top AI Papers of the Week (July 21 - 27)

Kimi K2とLLMのベンチマークスコア
Kimi K2は、中国のMoonshot AIが開発したオープンウェイトの大規模言語モデルです。2025年1月20日に公開されたKimi k1.5以来のKimiの第4世代目のモデルです。 Kimi K2: Open Agentic IntelligenceKimi K2 is our latest Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters. It achieves state-of-the-art performance in frontier knowledge, math, and coding among non-thinking models. 特徴として、128Kトークンのコンテキストウィンドウがあります。参考までにClaude 4が200kでGemini 2.5 が100M。Grok4は256kです。 また、

🤖 AI Agents Weekly: Lovable Agents, GitHub Spark, Qwen3-Coder, Search Arena, Awesome Context Engineering
Lovable Agents, GitHub Spark, Qwen3-Coder, Search Arena, Awesome Context Engineering
Official statement from Tea on their data leak
Tea is a dating safety app for women that lets them share notes about potential dates. The other day it was subject to a truly egregious data leak caused by …

完全自律型AIエージェントのベンチマーク(2): Codex、Jules、OpenHandsを加えて
TL;DR * Devinは長時間タスクの完走能力が他のエージェントより優れています。その分コストも高いです。 * Claude Code Actionはタスク実行速度が最も速く、成功率も高いです。コストパフォーマンスも高いです。 * その他のエージェントは内部セッションタイムアウトがあり、タスクを中断します。長時間タスクには向きません。 最終結果 エージェント名 完了問題数/実行時間 コスト 1問あたり 正解数/正解率 結果 🏅Devin 98問/216分 $36 $0.37 92問/91.1% 長時間タスク完遂能力抜群、コスト高 🥈Claude Code Action 92問/42分 $7.89 $0.09 65問/64.4% 最速・高コスパ 🥉GitHub Copilot Coding Agent

Claude Code でカスタムサブエージェントを作成する
Claude Code では特定の種類のタスクを処理するために呼び出されるカスタムサブエージェントを作成できます。カスタムサブエージェントを使用することでメインの会話セッションとは別に独立したコンテキストウィンドウを持つことができ、コンテキストの汚染を防ぐことができます。この記事では、Claude Code でカスタムサブエージェントを作成する方法とその利点について解説します。

Qwen3-235B-A22B-Thinking-2507
The third Qwen model release week, following Qwen3-235B-A22B-Instruct-2507 on Monday 21st and Qwen3-Coder-480B-A35B-Instruct on Tuesday 22nd. Those two were both non-reasoning models - a change from the previous models in …

AI + a16z Podcast: Vibe Coding, Security Risks, and the Path to Progress
Socket CEO Feross Aboukhadijeh and a16z partner Joel de la Garza discuss vibe coding, AI-driven software development, and how the rise of LLMs, despit...

SpecForge: Accelerating Speculative Decoding Training for SGLang
<p>Speculative decoding is a powerful technique for accelerating Large Language Model (LLM) inference. In this blog post, we are excited to announce the open...

Using GitHub Spark to reverse engineer GitHub Spark
GitHub Spark was released in public preview yesterday. It’s GitHub’s implementation of the prompt-to-app pattern also seen in products like Claude Artifacts, Lovable, Vercel v0, Val Town Townie and Fly.io’s …
Quoting Recurse Center
[...] You learn best and most effectively when you are learning something that you care about. Your work becomes meaningful and something you can be proud of only when you …

Instagram Reel: Veo 3 paid preview
@googlefordevs on Instagram published this reel featuring Christina Warren with prompting tips for the new Veo 3 paid preview (mp4 copy here). (Christine checked first if I minded them using …

TimeScope: How Long Can Your Video Large Multimodal Model Go?
New open source benchmark for evaluating vision LLMs on how well they handle long videos: TimeScope probes the limits of long-video capabilities by inserting several short (~5-10 second) video clips---our …

1KB JS Numbers Station
Terence Eden built a neat and weird 1023 byte JavaScript demo that simulates a numbers station using the browser SpeechSynthesisUtterance, which I hadn't realized is supported by every modern browser …
Quoting Dave White
like, one day you discover you can talk to dogs. it's fun and interesting so you do it more, learning the intricacies of their language and their deepest customs. you …
Quoting ICML 2025
Submitting a paper with a "hidden" prompt is scientific misconduct if that prompt is intended to obtain a favorable review from an LLM. The inclusion of such a prompt is …

Qwen3-Coder: Agentic Coding in the World
It turns out that as I was typing up my notes on Qwen3-235B-A22B-Instruct-2507 the Qwen team were unleashing something much bigger: Today, we’re announcing Qwen3-Coder, our most agentic code model …

Qwen/Qwen3-235B-A22B-Instruct-2507
Significant new model release from Qwen, published yesterday without much fanfare. This is a follow-up to their April release of the full Qwen 3 model family, which included a Qwen3-235B-A22B …

Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data
This new alignment paper from Anthropic wins my prize for best illustrative figure so far this year: The researchers found that fine-tuning a model on data generated by another model …

Our contribution to a global environmental standard for AI
Mistral have released environmental impact numbers for their largest model, Mistral Large 2, in more detail than I have seen from any of the other large AI labs. The methodology …
Gemini 2.5 Flash-Lite is now stable and generally available
The last remaining member of the Gemini 2.5 trio joins Pro and Flash in General Availability today. Gemini 2.5 Flash-Lite is the cheapest of the 2.5 family, at $0.10/million input …

AIコーディングハンズオンの講師をやりました(株式会社DeNA様の事例)
この記事では、株式会社DeNAでのAIコーディングの社内勉強会の講師を務めた経験について述べています。著者は、AIを活用したプログラミングの講師として、受講者に伝えたいことや講師としての心構えをまとめています。事前準備として、作例のリポジトリを用意しつつも、受講者がゼロからプロンプティングする体験を重視しています。また、当日の進行方法や教材の準備、受講者の期待値コントロールについても触れています。特に、AIの環境構築の難しさや、実際のデモを通じて受講者に体験してもらうことの重要性が強調されています。最後に、勉強会の規模や課題についても言及し、今後の依頼については相談が必要であると述べています。 • AIコーディングの講師としての経験を共有 • 受講者にゼロからプロンプティングする体験を重視 • 事前準備として作例のリポジトリを用意 • 当日の進行方法や教材の準備についての考慮 • AIの環境構築の難しさを体感させる • 受講者の期待値コントロールの重要性 • 勉強会の規模や課題についての言及
Textual v4.0.0: The Streaming Release
Will McGugan may no longer be running a commercial company around Textual, but that hasn't stopped his progress on the open source project. He recently released v4 of his Python …
tidwall/pogocache
New project from Josh Baker, author of the excellent tg C geospatial libarry (covered previously) and various other interesting projects: Pogocache is fast caching software built from scratch with a …
Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad
OpenAI beat them to the punch in terms of publicity by publishing their results on Saturday, but a team from Google Gemini achieved an equally impressive result on this year's …
Quoting Daniel Litt
An AI tool that gets gold on the IMO is obviously immensely impressive. Does it mean math is “solved”? Is an AI-generated proof of the Riemann hypothesis clearly on the …

The top 15 MCP servers for your AI projects
Explore 15 essential MCP servers for web developers to enhance AI workflows with tools, data, and automation.
Coding with LLMs in the summer of 2025 (an update)
Salvatore Sanfilippo describes his current AI-assisted development workflow. He's all-in on LLMs for code review, exploratory prototyping, pair-design and writing "part of the code under your clear specifications", but warns …

🥇Top AI Papers of the Week
The Top AI Papers of the Week (July 14 - 20)
Quoting Armin Ronacher
Every day someone becomes a programmer because they figured out how to make ChatGPT build something. Lucky for us: in many of those cases the AI picks Python. We should …
Quoting Tim Sweeney
There’s a bigger opportunity in computer science and programming (academically conveyed or self-taught) now than ever before, by far, in my opinion. The move to AI is like replacing shovels …

Deploying Kimi K2 with PD Disaggregation and Large-Scale Expert Parallelism on 128 H200 GPUs
<h2><a id="1️⃣-introduction-deploying-the-most-advanced-open-source-moe-model" class="anchor" href="#1️⃣-introduction-deploying-the-most-advanced-open-source...
OpenAI's gold medal performance on the International Math Olympiad
OpenAI research scientist Alexander Wei: I’m excited to share that our latest @OpenAI experimental reasoning LLM has achieved a longstanding grand challenge in AI: gold medal-level performance on the world’s …

🤖 AI Agents Weekly: ChatGPT Agent, Gemini Embeddings, Agent Leaderboard v2, Voxtral, CRMAgent
ChatGPT Agent, Gemini Embeddings, Agent Leaderboard v2, Voxtral, CRMAgent

Kiroとコンテキストエンジニアリングの時流
Kiro(kiro.dev)は、AWSが開発したIDE型のコーディングエージェントです。CursorやWindsurfのようなVS Codeフォークエディタに分類されます。現在はパブリックプレビュー中で、サインアップするとKiroでClaude Sonnet 3.7 とClaude 4 Sonnetを利用できます。 KiroThe AI IDE for prototype to productionKiro Kiroの特徴は、スペック駆動開発、エージェントフック、ステアリングファイルといった独自の機能を通じて、ソフトウェア開発のライフサイクル全体を支援します。それぞれ見ていきましょう。 スペック (Specs)駆動開発 Kiroの中核をなすのが「スペック=仕様書」機能です。これは、ユーザーが入力した大まかな指示(例:「ユーザー認証機能を追加して」)をもとに、AIが「要件定義」「設計」「タスクリスト」という3段階のドキュメントを自動で生成するものです。 Markdownファイルが.kiro/specs/${task}/配下にタスク単位で生成されます。
New tags
A few months I added a tool to my blog for bulk-applying tags to old content. It works as an extension to my existing search interface, letting me run searches …
Quoting Steve Yegge
So one of my favorite things to do is give my coding agents more and more permissions and freedom, just to see how far I can push their productivity without …
Quoting Paul Kedrosky
One analyst recently speculated (via Ed Conard) that, based on Nvidia's latest datacenter sales figures, AI capex may be ~2% of US GDP in 2025, given a standard multiplier. [...] …
How to run an LLM on your laptop
I talked to Grace Huckins for this piece from MIT Technology Review on running local models. Apparently she enjoyed my dystopian backup plan! Simon Willison has a plan for the …

ChatGPT agent の発表まとめ
OpenAIは2025年7月17日にChatGPT agentを発表しました。このエージェントシステムは、ChatGPTにブラウザ操作やDeep Research機能を統合し、複雑なタスクを処理できるようになりました。ユーザーは、カレンダーの確認や材料の計画、競合分析などを依頼でき、結果は編集可能なスライドやスプレッドシートとして返されます。ChatGPT agentはPro、Plus、Teamプランのユーザーが利用可能で、さまざまなツールを使ってタスクを解決する選択肢を提供します。また、安全性に関してもプロンプトインジェクション対策やデータ消去機能が整備されています。 • ChatGPT agentは複雑なタスクを処理できるエージェントシステムである。 • ユーザーはカレンダー確認や材料計画、競合分析などを依頼できる。 • 結果は編集可能なスライドやスプレッドシートとして返される。 • Pro、Plus、Teamプランのユーザーが利用可能で、さまざまなツールを使用できる。 • 安全性対策としてプロンプトインジェクション対策やデータ消去機能が整備されている。

How to build better AI apps in React with MediaPipe’s latest APIs
Build an AI-powered object detection app in React using MediaPipe's latest Tasks API, run models in-browser with no backend setup.

AI won’t fix bad thinking — use it to challenge you instead
AI agrees too easily. That’s a problem. Learn how to prompt it to challenge your thinking and improve your product decisions.

Accelerating SGLang with Multiple Token Prediction
<h2><a id="tldr" class="anchor" href="#tldr" aria-hidden="true"><svg aria-hidden="true" class="octicon octicon-link" height="16" version="1.1" viewbox="0 0 1...
Voxtral
Mistral released their first audio-input models yesterday: Voxtral Small and Voxtral Mini. These state‑of‑the‑art speech understanding models are available in two sizes—a 24B variant for production-scale applications and a 3B …
common-pile/caselaw_access_project
Enormous openly licensed (I believe this is almost all public domain) training dataset of US legal cases: This dataset contains 6.7 million cases from the Caselaw Access Project and Court …

How to build unified AI interfaces using the Vercel AI SDK
Learn how to use the Vercel AI SDK to build modern, multimodal frontend apps with streaming, function calling, image analysis, voice output, and generative UI.

How to support new VLMs into SGLang: A Case Study with NVILA
<p>The world of LLMs is evolving at a remarkable pace, with Visual Language Models (VLMs) at the forefront of this revolution. These models power application...
Reflections on OpenAI
Calvin French-Owen spent just over a year working at OpenAI, during which time the organization grew from 1,000 to 3,000 people and Calvin found himself in "the top 30% by …
xAI: "We spotted a couple of issues with Grok 4 recently that we immediately investigated & mitigated"
They continue: One was that if you ask it "What is your surname?" it doesn't have one so it searches the internet leading to undesirable results, such as when its …

AI compliance: A core product competency you shouldn’t skip
AI governance is now a product feature. Learn how to embed trust, transparency, and compliance into your build cycles.

AWS の エージェント IDE Kiro を使ってみた
Kiro は AWS が開発した IDE 内蔵型の AI コーディングエージェントです。Kiro の特徴は単なるバイブコーディングにとどまらず、スペックを使用して仕様駆動開発でアプリケーションを開発できることです。この記事では Kiro を使ったアプリケーション開発の流れを紹介します。

Application development without programmers
This book by James Martin, published in 1982, includes the following in the preface: Applications development did not change much for 20 years, but now a new wave is crashing …
ccusage
Claude Code logs detailed usage information to the ~/.claude/ directory. ccusage is a neat little Node.js tool which reads that information and shows you a readable summary of your usage …

Cost Effective Deployment of DeepSeek R1 with Intel® Xeon® 6 CPU on SGLang
<p>The impressive performance of DeepSeek R1 marked a rise of giant Mixture of Experts (MoE) models in Large Language Models (LLM). However, its massive mode...

🥇Top AI Papers of the Week
The Top AI Papers of the Week (July 7 - 13)

サンドボックス環境を MCP サーバーで提供する Container Use
AI コーディングエージェントは便利ですが、任意の Bash コマンドを実行できるため、ユーザーのシステムに影響を与える可能性があります。Container Use は MCP サーバーとして動作し、AI コーディングエージェントにサンドボックス環境を提供します。この記事では Container Use の利用方法について紹介します。

🤖 AI Agents Weekly: Grok 4, Context Engineering Guide, Kimi K2, SmolLM3, MedGemma 27B, AI SDK 5
Grok 4, Context Engineering Guide, Kimi K2, SmolLM3, MedGemma 27B, AI SDK 5

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
METR - for Model Evaluation & Threat Research - are a non-profit research institute founded by Beth Barnes, a former alignment researcher at OpenAI (see Wikipedia). They've previously contributed to …

Grok 4 Heavy won't reveal its system prompt
Grok 4 Heavy is the "think much harder" version of Grok 4 that's currenly only available on their $300/month plan. Jeremy Howard relays a report from a Grok 4 Heavy …
Quoting @grok
On the morning of July 8, 2025, we observed undesired responses and immediately began investigating. To identify the specific language in the instructions causing the undesired behavior, we conducted multiple …
Musk’s latest Grok chatbot searches for billionaire mogul’s views before answering questions
I got quoted a couple of times in this story about Grok searching for tweets from:elonmusk by Matt O’Brien for the Associated Press. “It’s extraordinary,” said Simon Willison, an independent …

moonshotai/Kimi-K2-Instruct
Colossal new open weights model release today from Moonshot AI, a two year old Chinese AI lab with a name inspired by Pink Floyd’s album The Dark Side of the …
Quoting Django’s security policies
Following the widespread availability of large language models (LLMs), the Django Security Team has received a growing number of security reports generated partially or entirely using such tools. Many of …

GitHub Copilot NESの内部実装が公開、そして続・AIエディタ戦争
Copilot NESとは Copilot NES(Next Edit Suggestions)は2025年2月にリリースされたGitHub Copilotの内部機能です。コードの変更に連動して必要となる次の編集を予測し、タブキーを押しているだけで複数箇所にわたる修正を提案してくれます。通常のコード補完がカーソル位置の続きのコードを予測するのに対して、Copilot NESは「エディタ上の編集操作」の単位で続きを予測して補完します。 GitHub Next | Copilot Next Edit SuggestionsGitHub Next Project: Can we improve Copilot code completion by suggesting the next logical change, wherever it is in your project?GitHub Next この仕組みはCopilot NESの元ネタであるCursor Tab(Copilot++)によって実用化されましたが、Cursorはプロプライエタリなソフトウェアなので内部の詳細が分かり
Generationship: Ep. #39, Simon Willison
I recorded this podcast episode with Rachel Chalmers a few weeks ago. We talked about the resurgence of blogging, the legacy of Google Reader, learning in public, LLMs as weirdly …

Grok: searching X for "from:elonmusk (Israel OR Palestine OR Hamas OR Gaza)"
If you ask the new Grok 4 for opinions on controversial questions, it will sometimes run a search to find out Elon Musk’s stance before providing you with an answer. …

Grok 4
Released last night, Grok 4 is now available via both API and a paid subscription for end-users. Key characteristics: image and text input, text output. 256,000 context length (twice that …