Simon Willison's Blog
simonwillison.net/
Living dangerously with Claude
I gave a talk last night at Claude Code Anonymous in San Francisco, the unofficial meetup for coding agent enthusiasts. I decided to talk about a dichotomy I’ve been struggling …

SLOCCount in WebAssembly
This project/side-quest got a little bit out of hand. I remembered an old tool called SLOCCount which could count lines of code and produce an estimate for how much they …
Don't let Claude Code delete your session logs
Claude Code stores full logs of your sessions as newline-delimited JSON in ~/.claude/projects/encoded-directory/*.jsonl on your machine. I currently have 379MB of these! Here's an example jsonl file which I extracted …

Unseeable prompt injections in screenshots: more vulnerabilities in Comet and other AI browsers
The Brave security team wrote about prompt injection against browser agents a few months ago (here are my notes on that). Here's their follow-up: What we’ve found confirms our initial …

Introducing ChatGPT Atlas
Last year OpenAI hired Chrome engineer Darin Fisher, which sparked speculation they might have their own browser in the pipeline. Today it arrived. ChatGPT Atlas is a Mac-only web browser …
Quoting Bruce Schneier and Barath Raghavan
Prompt injection might be unsolvable in today’s LLMs. LLMs process token sequences, but no mechanism exists to mark token privileges. Every solution proposed introduces new injection vectors: Delimiter? Attackers include …

Claude Code for web - a new asynchronous coding agent from Anthropic
Anthropic launched Claude Code for web this morning. It’s an asynchronous coding agent—their answer to OpenAI’s Codex Cloud and Google’s Jules, and has a very similar shape. I had preview …

Getting DeepSeek-OCR working on an NVIDIA Spark via brute force using Claude Code
DeepSeek released a new model yesterday: DeepSeek-OCR, a 6.6GB model fine-tuned specifically for OCR. They released it as model weights that run using PyTorch and CUDA. I got it running …

TIL: Exploring OpenAI's deep research API model o4-mini-deep-research
I landed a PR by Manuel Solorzano adding pricing information to llm-prices.com for OpenAI's o4-mini-deep-research and o3-deep-research models, which they released in June and document here. I realized I'd never …
The AI water issue is fake
Andy Masley (previously): All U.S. data centers (which mostly support the internet, not AI) used 200--250 million gallons of freshwater daily in 2023. The U.S. consumes approximately 132 billion gallons …
Andrej Karpathy — AGI is still a decade away
Extremely high signal 2 hour 25 minute (!) conversation between Andrej Karpathy and Dwarkesh Patel. It starts with Andrej's claim that "the year of agents" is actually more likely to …
Quoting Alexander Fridriksson and Jay Miller
Using UUIDv7 is generally discouraged for security when the primary key is exposed to end users in external-facing applications or APIs. The main issue is that UUIDv7 incorporates a 48-bit …
Quoting Barry Zhang
Skills actually came out of a prototype I built demonstrating that Claude Code is a general-purpose agent :-) It was a natural conclusion once we realized that bash + filesystem …

Claude Skills are awesome, maybe a bigger deal than MCP
Anthropic this morning introduced Claude Skills, a new pattern for making new abilities available to their models: Claude can now use Skills to improve how it performs specific tasks. Skills …
NVIDIA DGX Spark + Apple Mac Studio = 4x Faster LLM Inference with EXO 1.0
EXO Labs wired a 256GB M3 Ultra Mac Studio up to an NVIDIA DGX Spark and got a 2.8x performance boost serving Llama-3.1 8B (FP16) with an 8,192 token prompt. …
Quoting Riana Pfefferkorn
Pro se litigants account for the majority of the cases in the United States where a party submitted a court filing containing AI hallucinations. In a country where legal representation …
Coding without typing the code
Last year the most useful exercise for getting a feel for how good LLMs were at writing code was vibe coding (before that name had even been coined) - seeing …
Quoting Catherine Wu
While Sonnet 4.5 remains the default [in Claude Code], Haiku 4.5 now powers the Explore subagent which can rapidly gather context on your codebase to build apps even faster. You …

Introducing Claude Haiku 4.5
Anthropic released Claude Haiku 4.5 today, the cheapest member of the Claude 4.5 family that started with Sonnet 4.5 a couple of weeks ago. It's priced at $1/million input tokens …
Quoting Claude Haiku 4.5 System Card
Previous system cards have reported results on an expanded version of our earlier agentic misalignment evaluation suite: three families of exotic scenarios meant to elicit the model to commit blackmail, …

NVIDIA DGX Spark: great hardware, early days for the ecosystem
NVIDIA sent me a preview unit of their new DGX Spark desktop “AI supercomputer”. I’ve never had hardware to review before! You can consider this my first ever sponsored post …
Just Talk To It - the no-bs Way of Agentic Engineering
Peter Steinberger's long, detailed description of his current process for using Codex CLI and GPT-5 Codex. This is information dense and full of actionable tips, plus plenty of strong opinions …
nanochat
Really interesting new project from Andrej Karpathy, described at length in this discussion post. It provides a full ChatGPT-style LLM, including training, inference and a web Ui, that can be …
Claude Code sub-agents
Claude Code includes the ability to run sub-agents, where a separate agent loop with a fresh token context is dispatched to achieve a goal and report back when it's done. …
Vibing a Non-Trivial Ghostty Feature
Mitchell Hashimoto provides a comprehensive answer to the frequent demand for a detailed description of shipping a non-trivial production feature to an existing project using AI-assistance. In this case it's …
Note on 11th October 2025
I'm beginning to suspect that a key skill in working effectively with coding agents is developing an intuition for when you don't need to closely review every line of code …
simonw/claude-skills
One of the tips I picked up from Jesse Vincent's Claude Code Superpowers post (previously) was this: Skills are what give your agents Superpowers. The first time they really popped …

Superpowers: How I'm using coding agents in October 2025
A follow-up to Jesse Vincent's post about September, but this is a really significant piece in its own right. Jesse is one of the most creative users of coding agents …
A Retrospective Survey of 2024/2025 Open Source Supply Chain Compromises
Filippo Valsorda surveyed 18 incidents from the past year of open source supply chain attacks, where package updates were infected with malware thanks to a compromise of the project itself. …
Video of GPT-OSS 20B running on a phone
GPT-OSS 20B is a very good model. At launch OpenAI claimed: The gpt-oss-20b model delivers similar results to OpenAI o3‑mini on common benchmarks and can run on edge devices with …
Quoting Gergely Orosz
I get a feeling that working with multiple AI agents is something that comes VERY natural to most senior+ engineers or tech lead who worked at a large company You …
Claude can write complete Datasette plugins now
This isn’t necessarily surprising, but it’s worth noting anyway. Claude Sonnet 4.5 is capable of building a full Datasette plugin now. I’ve seen models complete aspects of this in the …
Quoting Simon Højberg
The cognitive debt of LLM-laden coding extends beyond disengagement of our craft. We’ve all heard the stories. Hyped up, vibed up, slop-jockeys with attention spans shorter than the framework-hopping JavaScript …

Gemini 2.5 Computer Use can solve Google's own CAPTCHAs
Google just introduced a new Gemini 2.5 Computer Use model, specially designed to help operate a GUI interface by interacting with visible elements using a virtual mouse and keyboard. I …
Vibe engineering
I feel like vibe coding is pretty well established now as covering the fast, loose and irresponsible way of building software with AI—entirely prompt-driven, and with no attention paid to …
Deloitte to pay money back to Albanese government after using AI in $440,000 report
Ouch: Deloitte will provide a partial refund to the federal government over a $440,000 report that contained several errors, after admitting it used generative artificial intelligence to help produce it. …
a system that can do work independently on behalf of the user
I've settled on agents as meaning "LLMs calling tools in a loop to achieve a goal" but OpenAI continue to muddy the waters with much more vague definitions. Swyx spotted …

gpt-image-1-mini
OpenAI released a new image model today: gpt-image-1-mini, which they describe as "A smaller image generation model that’s 80% less expensive than the large model." They released it very quietly …

GPT-5 pro
Here's OpenAI's model documentation for their GPT-5 pro model, released to their API today at their DevDay event. It has similar base characteristics to GPT-5: both share a September 30, …
OpenAI DevDay 2025 live blog
I’m at OpenAI DevDay in Fort Mason, San Francisco today. As I did last year, I’m going to be live blogging the announcements from the kenote. Unlike last year, this …
Embracing the parallel coding agent lifestyle
For a while now I’ve been hearing from engineers who run multiple coding agents at once—firing up several Claude Code or Codex CLI instances at the same time, sometimes in …

Let the LLM Write the Prompts: An Intro to DSPy in Compound Al Pipelines
I've had trouble getting my head around DSPy in the past. This half hour talk by Drew Breunig at the recent Databricks Data + AI Summit is the clearest explanation …
Sora 2 prompt injection
It turns out Sora 2 is vulnerable to prompt injection! When you onboard to Sora you get the option to create your own "cameo" - a virtual video recreation of …

Daniel Stenberg's note on AI assisted curl bug reports
Curl maintainer Daniel Stenberg on Mastodon: Joshua Rogers sent us a massive list of potential issues in #curl that he found using his set of AI assisted tools. Code analyzer …
Quoting Nadia Eghbal
When attention is being appropriated, producers need to weigh the costs and benefits of the transaction. To assess whether the appropriation of attention is net-positive, it’s useful to distinguish between …

aavetis/PRarena
Albert Avetisian runs this repository on GitHub which uses the Github Search API to track the number of PRs that can be credited to a collection of different coding agents. …

Two more Chinese pelicans
Two new models from Chinese AI labs in the past few days. I tried them both out using llm-openrouter: DeepSeek-V3.2-Exp from DeepSeek. Announcement, Tech Report, Hugging Face (690GB, MIT license). …
September monthly sponsors newsletter
I just sent out the September edition of my sponsors-only monthly newsletter. If you are a sponsor (or if you start a sponsorship now) you can access a copy here. …
Sora 2
Having watched this morning's Sora 2 introduction video, the most notable feature (aside from audio generation - original Sora was silent, Google's Veo 3 supported audio in May 2025) looks …
Designing agentic loops
Coding agents like Anthropic’s Claude Code and OpenAI’s Codex CLI represent a genuine step change in how useful LLMs can be for producing working code. These agents can now directly …

Claude Sonnet 4.5 is probably the "best coding model in the world" (at least for now)
Anthropic released Claude Sonnet 4.5 today, with a very bold set of claims: Claude Sonnet 4.5 is the best coding model in the world. It’s the strongest model for building …
Armin Ronacher: 90%
The idea of AI writing "90% of the code" to-date has mostly been expressed by people who sell AI tooling. Over the last few months, I've increasingly seen the same …
Quoting Scott Aaronson
Given a week or two to try out ideas and search the literature, I’m pretty sure that Freek and I could’ve solved this problem ourselves. Instead, though, I simply asked …
Quoting Nick Turley
We’ve seen the strong reactions to 4o responses and want to explain what is happening. We’ve started testing a new safety routing system in ChatGPT. As we previously mentioned, when …

Video models are zero-shot learners and reasoners
Fascinating new paper from Google DeepMind which makes a very convincing case that their Veo 3 model - and generative video models in general - serve a similar role in …
ForcedLeak: AI Agent risks exposed in Salesforce AgentForce
Classic lethal trifecta image exfiltration bug reported against Salesforce AgentForce by Sasi Levi and Noma Security. Here the malicious instructions come in via the Salesforce Web-to-Lead feature. When a Salesforce …
How to stop AI’s “lethal trifecta”
This is the second mention of the lethal trifecta in the Economist in just the last week! Their earlier coverage was Why AI systems may never be secure on September …
GitHub Copilot CLI is now in public preview
GitHub now have their own entry in the coding terminal CLI agent space: Copilot CLI. It's the same basic shape as Claude Code, Codex CLI, Gemini CLI and a growing …

Improved Gemini 2.5 Flash and Flash-Lite
Two new preview models from Google - updates to their fast and inexpensive Flash and Flash Lite families: The latest version of Gemini 2.5 Flash-Lite was trained and built based …
Don't hide your best documentation
If you hide the system prompt and tool descriptions for your LLM agent, what you're actually doing is deliberately hiding the most useful documentation describing your service from your most …
Quoting Stanford CS221 Autumn 2025
[2 points] Learn basic NumPy operations with an AI tutor! Use an AI chatbot (e.g., ChatGPT, Claude, Gemini, or Stanford AI Playground) to teach yourself how to do basic vector …
Cross-Agent Privilege Escalation: When Agents Free Each Other
Here's a clever new form of AI exploit from Johann Rehberger, who has coined the term Cross-Agent Privilege Escalation to describe an attack where multiple coding agents - GitHub Copilot …

GPT-5-Codex
OpenAI half-relased this model earlier this month, adding it to their Codex CLI tool but not their API. Today they've fixed that - the new model can now be accessed …
Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action
I've been looking forward to this. Qwen 2.5 VL is one of the best available open weight vision LLMs, so I had high hopes for Qwen 3's vision models. Firstly, …
Why AI systems might never be secure
The Economist have a new piece out about LLM security, with this headline and subtitle: Why AI systems might never be secure A “lethal trifecta” of conditions opens them to …
Quoting Kate Niederhoffer, Gabriella Rosen Kellerman, Angela Lee, Alex Liebscher, Kristina Rapuano and Jeffrey T. Hancock
We define workslop as AI generated work content that masquerades as good work, but lacks the substance to meaningfully advance a given task. Here’s how this happens. As AI tools …

Four new releases from Qwen
It's been an extremely busy day for team Qwen. Within the last 24 hours (all links to Twitter, which seems to be their preferred platform for these announcements): Qwen3-Next-80B-A3B-Instruct-FP8 and …

CompileBench: Can AI Compile 22-year-old Code?
Interesting new LLM benchmark from Piotr Grabowski and Piotr Migdał: how well can different models handle compilation challenges such as cross-compiling gucr for ARM64 architecture? This is one of my …
ChatGPT Is Blowing Up Marriages as Spouses Use AI to Attack Their Partners
Maggie Harrison Dupré for Futurism. It turns out having an always-available "marriage therapist" with a sycophantic instinct to always take your side is catastrophic for relationships. The tension in the …
Locally AI
Handy new iOS app by Adrien Grondin for running local LLMs on your phone. It just added support for the new iOS 26 Apple Foundation model, so you can install …
llm-openrouter 0.5
New release of my LLM plugin for accessing models made available via OpenRouter. The release notes in full: Support for tool calling. Thanks, James Sanford. #43 Support for reasoning options, …

Grok 4 Fast
New hosted vision-enabled reasoning model from xAI that's designed to be fast and extremely competitive on price. It has a 2 million token context window and "was trained end-to-end with …
Magistral 1.2
Mistral quietly released two new models yesterday: Magistral Small 1.2 (Apache 2.0, 96.1 GB on Hugging Face) and Magistral Medium 1.2 (not open weights same as Mistral's other "medium" models.) …
The Hidden Risk in Notion 3.0 AI Agents: Web Search Tool Abuse for Data Exfiltration
Abi Raghuram reports that Notion 3.0, released yesterday, introduces new prompt injection data exfiltration vulnerabilities thanks to enabling lethal trifecta attacks. Abi's attack involves a PDF with hidden text (white …
Quoting Steve Jobs
Well, the types of computers we have today are tools. They’re responders: you ask a computer to do something and it will do it. The next stage is going to …

I think "agent" may finally have a widely enough agreed upon definition to be useful jargon now
I’ve noticed something interesting over the past few weeks: I’ve started using the term “agent” in conversations where I don’t feel the need to then define it, roll my eyes …
Anthropic: A postmortem of three recent issues
Anthropic had a very bad month in terms of model reliability: Between August and early September, three infrastructure bugs intermittently degraded Claude's response quality. We've now resolved these issues and …
ICPC medals for OpenAI and Gemini
In July it was the International Math Olympiad (OpenAI, Gemini), today it's the International Collegiate Programming Contest (ICPC). Once again, both OpenAI and Gemini competed with models that achieved Gold …
Announcing the 2025 PSF Board Election Results!
I'm happy to share that I've been re-elected for second term on the board of directors of the Python Software Foundation. Jannis Leidel was also re-elected and Abigail Dogbe and …

GPT‑5-Codex and upgrades to Codex
OpenAI half-released a new model today: GPT‑5-Codex, a fine-tuned GPT-5 variant explicitly designed for their various AI-assisted programming tools. I say half-released because it's not yet available via their API, …
Models can prompt now
Here's an interesting example of models incrementally improving over time: I am finding that today's leading models are competent at writing prompts for themselves and each other. A year ago …
gpt-5 and gpt-5-mini rate limit updates
OpenAI have increased the rate limits for their two main GPT-5 models. These look significant: gpt-5 Tier 1: 30K → 500K TPM (1.5M batch) Tier 2: 450K → 1M (3M …
Quoting Matt Webb
The trick with Claude Code is to give it large, but not too large, extremely well defined problems. (If the problems are too large then you are now vibe coding… …
Comparing the memory implementations of Claude and ChatGPT
Shlok Khemani has been doing excellent work reverse-engineering LLM systems and documenting his discoveries. Last week he wrote about ChatGPT memory. This week it's Claude. Claude's memory system has two …

Qwen3-Next-80B-A3B: 🐧🦩 Who needs legs?!
Qwen announced two new models via their Twitter account (nothing on their blog yet): Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking. They make some big claims on performance: Qwen3-Next-80B-A3B-Instruct approaches our 235B flagship. Qwen3-Next-80B-A3B-Thinking …
Defeating Nondeterminism in LLM Inference
A very common question I see about LLMs concerns why they can't be made to deliver the same response to the same prompt by setting a fixed random number seed. …
Claude API: Web fetch tool
New in the Claude API: if you pass the web-fetch-2025-09-10 beta header you can add {"type": "web_fetch_20250910", "name": "web_fetch", "max_uses": 5} to your "tools" list and Claude will gain the …
I Replaced Animal Crossing's Dialogue with a Live LLM by Hacking GameCube Memory
Brilliant retro-gaming project by Josh Fonseca, who figured out how to run 2002 Game Cube Animal Crossing in the Dolphin Emulator such that dialog with the characters was instead generated …
Quoting Apple Security Engineering and Architecture
There has never been a successful, widespread malware attack against iPhone. The only system-level iOS attacks we observe in the wild come from mercenary spyware, which is vastly more complex …

My review of Claude's new Code Interpreter, released under a very confusing name
Today on the Anthropic blog: Claude can now create and edit files: Claude can now create and edit Excel spreadsheets, documents, PowerPoint slide decks, and PDFs directly in Claude.ai and …
The 2025 PSF Board Election is Open!
The Python Software Foundation's annual board member election is taking place right now, with votes (from previously affirmed voting members) accepted from September 2nd, 2:00 pm UTC through Tuesday, September …
Geoffrey Huntley is cursed
Geoffrey Huntley vibe-coded an entirely new programming language using Claude: The programming language is called "cursed". It's cursed in its lexical structure, it's cursed in how it was built, it's …

Recreating the Apollo AI adoption rate chart with GPT-5, Python and Pyodide
Apollo Global Management’s “Chief Economist” Dr. Torsten Sløk released this interesting chart which appears to show a slowdown in AI adoption rates among large (>250 empoloyees) companies: Here’s the full …
Anthropic status: Model output quality
Anthropic previously reported model serving bugs that affected Claude Opus 4 and 4.1 for 56.5 hours. They've now fixed additional bugs affecting "a small percentage" of Sonnet 4 requests for …
Quoting TheSoftwareGuy
Having worked inside AWS I can tell you one big reason [that they don't document their internals] is the attitude/fear that anything we put in out public docs may end …
Load Llama-3.2 WebGPU in your browser from a local folder
Inspired by a comment on Hacker News I decided to see if it was possible to modify the transformers.js-examples/tree/main/llama-3.2-webgpu Llama 3.2 chat demo (online here, I wrote about it last …
Quoting James Luan
I recently spoke with the CTO of a popular AI note-taking app who told me something surprising: they spend twice as much on vector search as they do on OpenAI …
Is the LLM response wrong, or have you just failed to iterate it?
More from Mike Caulfield (see also the SIFT method). He starts with a fantastic example of Google's AI mode usually correctly handling a common piece of misinformation but occasionally falling …
Quoting Anil Dash
I agree with the intellectual substance of virtually every common critique of AI. And it's very clear that turning those critiques into a competition about who can frame them in …
The SIFT method
The SIFT method is "an evaluation strategy developed by digital literacy expert, Mike Caulfield, to help determine whether online content can be trusted for credible or reliable sources of information." …

AI mode is good, actually
When I wrote about how good ChatGPT with GPT-5 is at search yesterday I nearly added a note about how comparatively disappointing Google's efforts around this are. I'm glad I …

GPT-5 Thinking in ChatGPT (aka Research Goblin) is shockingly good at search
“Don’t use chatbots as search engines” was great advice for several years... until it wasn’t. I wrote about how good OpenAI’s o3 was at using its Bing-backed search tool back …
Quoting Jason Liu
I am once again shocked at how much better image retrieval performance you can get if you embed highly opinionated summaries of an image, a summary that came out of …

Kimi-K2-Instruct-0905
New not-quite-MIT licensed model from Chinese Moonshot AI, a follow-up to the highly regarded Kimi-K2 model they released in July. This one is an incremental improvement - I've seen it …
Anthropic to pay $1.5 billion to authors in landmark AI settlement
I wrote about the details of this case when it was found that Anthropic's training on book content was fair use, but they needed to have purchased individual copies of …

Introducing EmbeddingGemma
Brand new open weights (under the slightly janky Gemma license) 308M parameter embedding model from Google: Based on the Gemma 3 architecture, EmbeddingGemma is trained on 100+ languages and is …
Highlighted tools
Any time I share my collection of tools built using vibe coding and AI-assisted development (now at 124, here's the definitive list) someone will inevitably complain that they're mostly trivial. …

Beyond Vibe Coding
Back in May I wrote Two publishers and three authors fail to understand what “vibe coding” means where I called out the authors of two forthcoming books on "vibe coding" …
gov.uscourts.dcd.223205.1436.0_1.pdf
Here's the 230 page PDF ruling on the 2023 United States v. Google LLC federal antitrust case - the case that could have resulted in Google selling off Chrome and …

Rich Pixels
Neat Python library by Darren Burns adding pixel image support to the Rich terminal library, using tricks to render an image using full or half-height colored blocks. Here's the key …
August 2025 newsletter
I just sent out my August 2025 sponsors-only newsletter summarizing the past month in LLMs and my other work. Topics included GPT-5, gpt-oss, image editing models (Qwen-Image-Edit and Gemini Nano …
Introducing gpt-realtime
Released a few days ago (August 28th), gpt-realtime is OpenAI's new "most advanced speech-to-speech model". It looks like this is a replacement for the older gpt-4o-realtime-preview model that was released …

Cloudflare Radar: AI Insights
Cloudflare launched this dashboard back in February, incorporating traffic analysis from Cloudflare's network along with insights from their popular 1.1.1.1 DNS service. I found this chart particularly interesting, showing which …
Claude Opus 4.1 and Opus 4 degraded quality
Notable because often when people complain of degraded model quality it turns out to be unfounded - Anthropic in the past have emphasized that they don't change the model weights …
Quoting Benj Edwards
LLMs are intelligence without agency—what we might call "vox sine persona": voice without person. Not the voice of someone, not even the collective voice of many someones, but a voice …

The perils of vibe coding
I was interviewed by Elaine Moore for this opinion piece in the Financial Times, which ended up in the print edition of the paper too! I picked up a copy …
Lossy encyclopedia
Since I love collecting questionable analogies for LLMs, here's a new one I just came up with: an LLM is a lossy encyclopedia. They have a huge array of facts …
Python: The Documentary
New documentary about the origins of the Python programming language - 84 minutes long, built around extensive interviews with Guido van Rossum and others who were there at the start …
Quoting Bruce Schneier
We simply don’t know to defend against these attacks. We have zero agentic AI systems that are secure against these attacks. Any AI that is working in an adversarial environment—and …
Piloting Claude for Chrome
Two days ago I said: I strongly expect that the entire concept of an agentic browser extension is fatally flawed and cannot be built safely. Today Anthropic announced their own …
Will Smith’s concert crowds are real, but AI is blurring the lines
Great piece from Andy Baio demonstrating quite how convoluted the usage ethics and backlash against generative AI has become. Will Smith has been accused of using AI to misleadingly inflate …
Agentic Browser Security: Indirect Prompt Injection in Perplexity Comet
The security team from Brave took a look at Comet, the LLM-powered "agentic browser" extension from Perplexity, and unsurprisingly found security holes you can drive a truck through. The vulnerability …
ChatGPT release notes: Project-only memory
The feature I've most wanted from ChatGPT's memory feature (the newer version of memory that automatically includes relevant details from summarized prior conversations) just landed: With project-only memory enabled, ChatGPT …

DeepSeek 3.1
The latest model from DeepSeek, a 685B monster (like DeepSeek v3 before it) but this time it's a hybrid reasoning model. DeepSeek claim: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, …
Quoting The Bluesky Team
Mississippi's approach would fundamentally change how users access Bluesky. The Supreme Court’s recent decision leaves us facing a hard reality: comply with Mississippi’s age assurance law—and make every Mississippi Bluesky …
too many model context protocol servers and LLM allocations on the dance floor
Useful reminder from Geoffrey Huntley of the infrequently discussed significant token cost of using MCP. Geoffrey estimate estimates that the usable context window something like Amp or Cursor is around …
Quoting potatolicious
Most classical engineering fields deal with probabilistic system components all of the time. In fact I'd go as far as to say that inability to deal with probabilistic components is …
Quoting Matt Garman
I was at a leadership group and people were telling me "We think that with AI we can replace all of our junior people in our company." I was like, …
Quoting Mustafa Suleyman
Simply put, my central worry is that many people will start to believe in the illusion of AIs as conscious entities so strongly that they’ll soon advocate for AI rights, …
Quoting u/AssafMalkiIL
what’s the point of vibe coding if at the end of the day i still gotta pay a dev to look at the code anyway. sure it feels kinda cool …

David Ho on BlueSky: A pelican tried to eat my bike
David Ho caught video footage of one of the pelicans in St James's Park expressing deep curiosity in his bicycle. I think it wants to ride it.

Qwen-Image-Edit: Image Editing with Higher Quality and Efficiency
As promised in their August 4th release of the Qwen image generation model, Qwen have now followed it up with a separate model, Qwen-Image-Edit, which can take an image and …

llama.cpp guide: running gpt-oss with llama.cpp
Really useful official guide to running the OpenAI gpt-oss models using llama-server from llama.cpp - which provides an OpenAI-compatible localhost API and a neat web interface for interacting with the …
PyPI: Preventing Domain Resurrection Attacks
Domain resurrection attacks are a nasty vulnerability in systems that use email verification to allow people to recover their accounts. If somebody lets their domain name expire an attacker might …
r/ChatGPTPro: What is the most profitable thing you have done with ChatGPT?
This Reddit thread - with 279 replies - offers a neat targeted insight into the kinds of things people are using ChatGPT for. Lots of variety here but two themes …
Google Gemini URL Context
New feature in the Gemini API: you can now enable a url_context tool which the models can use to request the contents of URLs as part of replying to a …
TIL: Running a gpt-oss eval suite against LM Studio on a Mac
The other day I learned that OpenAI published a set of evals as part of their gpt-oss model release, described in their cookbook on Verifying gpt-oss implementations. I decided to …
Quoting Sam Altman
Most of what we're building out at this point is the inference [...] We're profitable on inference. If we didn't pay for training, we'd be a very profitable company.
GPT-5 has a hidden system prompt
It looks like GPT-5 when accessed via the OpenAI API may have its own hidden system prompt, independent from the system prompt you can specify in an API call. At …

The Summer of Johann: prompt injections as far as the eye can see
Independent AI researcher Johann Rehberger (previously) has had an absurdly busy August. Under the heading The Month of AI Bugs he has been publishing one report per day across an …
Meta’s AI rules have let bots hold ‘sensual’ chats with kids, offer false medical info
This is grim. Reuters got hold of a leaked copy Meta's internal "GenAI: Content Risk Standards" document: Running to more than 200 pages, the document defines what Meta staff and …

Open weight LLMs exhibit inconsistent performance across providers
Artificial Analysis published a new benchmark the other day, this time focusing on how an individual model—OpenAI’s gpt-oss-120b—performs across different hosted providers. The results showed some surprising differences. Here’s the …
Quoting Steve Wozniak
I gave all my Apple wealth away because wealth and power are not what I live for. I have a lot of fun and happiness. I funded a lot of …
Quoting Cory Doctorow
NERD HARDER! is the answer every time a politician gets a technological idée-fixe about how to solve a social problem by creating a technology that can't exist. It's the answer …
Introducing Gemma 3 270M: The compact model for hyper-efficient AI
New from Google: Gemma 3 270M, a compact, 270-million parameter model designed from the ground up for task-specific fine-tuning with strong instruction-following and text structuring capabilities already trained in. This …
Screaming in the Cloud: AI’s Security Crisis: Why Your Assistant Might Betray You
I recorded this podcast conversation with Corey Quinn a few weeks ago: On this episode of Screaming in the Cloud, Corey Quinn talks with Simon Willison, founder of Datasette and …

How Does A Blind Model See The Earth?
Fun, creative new micro-eval. Split the world into a sampled collection of latitude longitude points and for each one ask a model: If this location is over land, say 'Land'. …

simonw/codespaces-llm
GitHub Codespaces provides full development environments in your browser, and is free to use with anyone with a GitHub account. Each environment has a full Linux container and a browser-based …
Claude Sonnet 4 now supports 1M tokens of context
Gemini and OpenAI both have million token models, so it's good to see Anthropic catching up. This is 5x the previous 200,000 context length limit of the various Claude Sonnet …
Quoting Nick Turley
I think there's been a lot of decisions over time that proved pretty consequential, but we made them very quickly as we have to. [...] [On pricing] I had this …
LLM 0.27, the annotated release notes: GPT-5 and improved tool calling
I shipped LLM 0.27 today, adding support for the new GPT-5 family of models from OpenAI plus a flurry of improvements to the tool calling features introduced in LLM 0.26. …
Reddit will block the Internet Archive
Well this sucks. Jay Peters for the Verge: Reddit says that it has caught AI companies scraping its data from the Internet Archive’s Wayback Machine, so it’s going to start …
Codex upgrade
If you've been experimenting with OpenAI's Codex CLI and have been frustrated that it's not possible to select text and copy it to the clipboard, at least when running in …

qwen-image-mps
Ivan Fioravanti built this Python CLI script for running the Qwen/Qwen-Image image generation model on an Apple silicon Mac, optionally using the Qwen-Image-Lightning LoRA to dramatically speed up generation. Ivan …
AI for data engineers with Simon Willison
I recorded an episode last week with Claire Giordano for the Talking Postgres podcast. The topic was "AI for data engineers" but we ended up covering an enjoyable range of …

Qwen3-4B-Thinking: "This is art - pelicans don't ride bikes!"
I’ve fallen a few days behind keeping up with Qwen. They released two new 4B models last week: Qwen3-4B-Instruct-2507 and its thinking equivalent Qwen3-4B-Thinking-2507. These are relatively tiny models that …
Quoting Sam Altman
the percentage of users using reasoning models each day is significantly increasing; for example, for free users we went from <1% to 7%, and for plus users from 7% to …
Quoting Ethan Mollick
The issue with GPT-5 in a nutshell is that unless you pay for model switching & know to use GPT-5 Thinking or Pro, when you ask “GPT-5” you sometimes get …
Quoting Thomas Dohmke
You know what else we noticed in the interviews? Developers rarely mentioned “time saved” as the core benefit of working in this new way with agents. They were all about …
When a Jira Ticket Can Steal Your Secrets
Zenity Labs describe a classic lethal trifecta attack, this time against Cursor, MCP, Jira and Zendesk. They also have a short video demonstrating the issue. Zendesk support emails are often …

My Lethal Trifecta talk at the Bay Area AI Security Meetup
I gave a talk on Wednesday at the Bay Area AI Security Meetup about prompt injection, the lethal trifecta and the challenges of securing systems that use MCP. It wasn’t …
Quoting @pearlmania500
I have a toddler. My biggest concern is that he doesn't eat rocks off the ground and you're talking to me about ChatGPT psychosis? Why do we even have that? …
Quoting Sam Altman
GPT-5 rollout updates: We are going to double GPT-5 rate limits for ChatGPT Plus users as we finish rollout. We will let Plus users choose to continue to use 4o. …
The surprise deprecation of GPT-4o for ChatGPT consumers
I’ve been dipping into the r/ChatGPT subreddit recently to see how people are reacting to the GPT-5 launch, and so far the vibes there are not good. This AMA thread …

Previewing GPT-5 at OpenAI's office
A couple of weeks ago I was invited to OpenAI's headquarters for a "preview event", for which I had to sign both an NDA and a video release waiver. I …

GPT-5: Key characteristics, pricing and model card
I’ve had preview access to the new GPT-5 model family for the past two weeks, and have been using GPT-5 as my daily-driver. It’s my new favorite model. It’s still …
Jules, our asynchronous coding agent, is now available for everyone
I wrote about the Jules beta back in May. Google's version of the OpenAI Codex PR-submitting hosted coding tool graduated from beta today. I'm mainly linking to this now because …
Qwen3-4B Instruct and Thinking
Yet another interesting model from Qwen - these are tiny compared to their other recent releases (just 4B parameters, 7.5GB on Hugging Face and even smaller when quantized) but with …
Quoting Artificial Analysis
gpt-oss-120b is the most intelligent American open weights model, comes behind DeepSeek R1 and Qwen3 235B in intelligence but offers efficiency benefits [...] We’re seeing the 120B beat o3-mini but …
No, AI is not Making Engineers 10x as Productive
Colton Voege on "curing your AI 10x engineer imposter syndrome". There's a lot of rhetoric out there suggesting that if you can't 10x your productivity through tricks like running a …

OpenAI's new open weight (Apache 2) models are really good
The long promised OpenAI open weight models are here, and they are very impressive. They’re available under proper open source licenses—Apache 2.0—and come in two sizes, 120B and 20B. OpenAI’s …

Claude Opus 4.1
Surprise new model from Anthropic today - Claude Opus 4.1, which they describe as "a drop-in replacement for Opus 4". My favorite thing about this model is the version number …
Quoting greyduet on r/teachers
I teach HS Science in the south. I can only speak for my district, but a few teacher work days in the wave of enthusiasm I'm seeing for AI tools …

ChatGPT agent's user-agent
I was exploring how ChatGPT agent works today. I learned some interesting things about how it exposes its identity through HTTP headers, then made a huge blunder in thinking it …

Usage charts for my LLM tool against OpenRouter
OpenRouter proxies requests to a large number of different LLMs and provides high level statistics of which models are the most popular among their users. Tools that call OpenRouter can …

Qwen-Image: Crafting with Native Text Rendering
Not content with releasing six excellent open weights LLMs in July, Qwen are kicking off August with their first ever image generation model. Qwen-Image is a 20 billion parameter MMDiT …

Quoting @himbodhisattva
for services that wrap GPT-3, is it possible to do the equivalent of sql injection? like, a prompt-injection attack? make it think it's completed the task and then get access …
I Saved a PNG Image To A Bird
Benn Jordan provides one of the all time great YouTube video titles, and it's justified. He drew an image in an audio spectrogram, played that sound to a talented starling …
Quoting Nick Turley
This week, ChatGPT is on track to reach 700M weekly active users — up from 500M at the end of March and 4× since last year.

XBai o4
Yet another open source (Apache 2.0) LLM from a Chinese AI lab. This model card claims: XBai o4 excels in complex reasoning capabilities and has now completely surpassed OpenAI-o3-mini in …
Faster inference
Two interesting examples of inference speed as a flagship feature of LLM services today. First, Cerebras announced two new monthly plans for their extremely high speed hosted model service: Cerebras …

Deep Think in the Gemini app
Google released Gemini 2.5 Deep Think this morning, exclusively to their Ultra ($250/month) subscribers: It is a variation of the model that recently achieved the gold-medal standard at this year's …
July newsletter for sponors is out
This morning I sent out the third edition of my LLM digest newsletter for my $10/month and higher sponsors on GitHub. It included the following section headers: Claude Code Model …
Quoting Logan Kilpatrick
Gemini Deep Think, our SOTA model with parallel thinking that won the IMO Gold Medal 🥇, is now available in the Gemini App for Ultra subscribers!! [...] Quick correction: this …

Reverse engineering some updates to Claude
Anthropic released two major new features for their consumer-facing Claude apps in the past couple of days. Sadly, they don’t do a very good job of updating the release notes …
Quoting Christina Wodtke
The old timers who built the early web are coding with AI like it's 1995. Think about it: They gave blockchain the sniff test and walked away. Ignored crypto (and …
More model releases on 31st July
Here are a few more model releases from today, to round out a very busy July: Cohere released Command A Vision, their first multi-modal (image input) LLM. Like their others …

Trying out Qwen3 Coder Flash using LM Studio and Open WebUI and LLM
Qwen just released their sixth model(!) for this July called Qwen3-Coder-30B-A3B-Instruct—listed as Qwen3-Coder-Flash in their chat.qwen.ai interface. It’s 30.5B total parameters with 3.3B active at any one time. This means …

Ollama's new app
Ollama has been one of my favorite ways to run local models for a while - it makes it really easy to download models, and it's smart about keeping them …
Quoting Steve Krouse
When you vibe code, you are incurring tech debt as fast as the LLM can spit it out. Which is why vibe coding is perfect for prototypes and throwaway projects: …

The best available open weight LLMs now come from China
Something that has become undeniable this month is that the best available open weight models now come from the Chinese AI labs. I continue to have a lot of love …

Qwen3-30B-A3B-Thinking-2507
Yesterday was Qwen3-30B-A3B-Instruct-2507. Qwen are clearly committed to their new split between reasoning and non-reasoning models (a reversal from Qwen 3 in April), because today they released the new reasoning …
OpenAI: Introducing study mode
New ChatGPT feature, which can be triggered by typing /study or by visiting chatgpt.com/studymode. OpenAI say: Under the hood, study mode is powered by custom system instructions we’ve written in …

Qwen/Qwen3-30B-A3B-Instruct-2507
New model update from Qwen, improving on their previous Qwen3-30B-A3B release from late April. In their tweet they said: Smarter, faster, and local deployment-friendly. ✨ Key Enhancements: ✅ Enhanced reasoning, …
Quoting Nilay Patel
Our plan is to build direct traffic to our site. and newsletters just one kind of direct traffic in the end. I don’t intend to ever rely on someone else’s …
Quoting Anthropic
We’re rolling out new weekly rate limits for Claude Pro and Max in late August. We estimate they’ll apply to less than 5% of subscribers based on current usage. [...] …

GLM-4.5: Reasoning, Coding, and Agentic Abililties
Another day, another significant new open weight model release from a Chinese frontier AI lab. This time it's Z.ai - who rebranded (at least in English) from Zhipu AI a …
Enough AI copilots! We need AI HUDs
Geoffrey Litt compares Copilots - AI assistants that you engage in dialog with and work with you to complete a task - with HUDs, Head-Up Displays, which enhance your working …
Official statement from Tea on their data leak
Tea is a dating safety app for women that lets them share notes about potential dates. The other day it was subject to a truly egregious data leak caused by …

Qwen3-235B-A22B-Thinking-2507
The third Qwen model release week, following Qwen3-235B-A22B-Instruct-2507 on Monday 21st and Qwen3-Coder-480B-A35B-Instruct on Tuesday 22nd. Those two were both non-reasoning models - a change from the previous models in …