この記事では、コーディングエージェントを使用する際に自動テストが不可欠であることが強調されています。従来のテスト作成に関する言い訳は通用せず、エージェントが数分でテストを整形できるため、テストはAI生成コードの信頼性を確保するために重要です。特に、エージェントが既存のコードベースに対してテストを実行することで、コードが本当に機能するかどうかを確認できます。著者は、エージェントに「最初にテストを実行する」というプロンプトを与えることで、テストスイートの存在を認識させ、将来的に新しい変更に対してもテストを実行するよう促す方法を提案しています。これにより、エージェントはテストの重要性を理解し、テストを拡張することが自然になります。 • 自動テストはコーディングエージェントにとって必須である。 • 従来のテスト作成に関する言い訳はもはや通用しない。 • AI生成コードの信頼性を確保するためにテストが重要。 • エージェントは既存のコードベースに対してテストを実行することで、コードの機能を確認できる。 • 「最初にテストを実行する」というプロンプトがエージェントにテストの重要性を認識させる。
Really interesting case-study from Andreas Kling on advanced, sophisticated use of coding agents for ambitious coding projects with critical code. After a few years hoping Swift's platform support outside of …

I’ve started a new project to collect and document Agentic Engineering Patterns—coding practices and patterns to help get the best results out of this new era of coding agent development …
The paper asked me to explain vibe coding, and I did so, because I think something big is coming there, and I'm deep in, and I worry that normal people …
The latest scourge of Twitter is AI bots that reply to your tweets with generic, banal commentary slop, often accompanied by a question to "drive engagement" and waste as much …
Nothing humbles you like telling your OpenClaw “confirm before acting” and watching it speedrun deleting your inbox. I couldn’t stop it from my phone. I had to RUN to my …
On February 5th Anthropic's Nicholas Carlini wrote about a project to use parallel Claudes to build a C compiler on top of the brand new Opus 4.6 Chris Lattner (Swift, …

Striking graph illustrating stock in the UK Raspberry Pi holding company spiking on Tuesday: The Telegraph credited excitement around OpenClaw: Raspberry Pi's stock price has surged 30pc in two days, …
Gabriel Chua (Developer Experience Engineer for APAC at OpenAI) provides his take on the confusing terminology behind the term "Codex", which can refer to a bunch of of different things …
We’ve made GPT-5.3-Codex-Spark about 30% faster. It is now serving at over 1200 tokens per second.
Andrej Karpathy tweeted a mini-essay about buying a Mac Mini ("The apple store person told me they are selling like hotcakes and everyone is confused") to tinker with Claws: I'm …
This new Canadian hardware startup just announced their first product - a custom hardware implementation of the Llama 3.1 8B model (from July 2024) that can run at a staggering …
I don't normally cover acquisition news like this, but I have some thoughts. It's hard to overstate the impact Georgi Gerganov has had on the local model space. Back in …
Long running agentic products like Claude Code are made feasible by prompt caching which allows us to reuse computation from previous roundtrips and significantly decrease latency and cost. [...] At …

The first in the Gemini 3.1 series, priced the same as Gemini 3 Pro ($2/million input, $12/million output under 200,000 tokens, $4/$18 for 200,000 to 1,000,000). They boast about its …

SWE-bench is one of the benchmarks that the labs love to list in their model releases. The official leaderboard is infrequently updated but they just did a full run of …
25+ years into my career as a programmer I think I may finally be coming around to preferring type hints or even strong typing. I resisted those in the past …
New opinion piece from Paul Ford in the New York Times. Unsurprisingly for a piece by Paul it's packed with quoteworthy snippets, but a few stood out for me in …
LLMs are eating specialty skills. There will be less use of specialist front-end and back-end developers as the LLM-driving skills become more important than the details of platform usage. Will …

Sonnet 4.6 is out today, and Anthropic claim it offers similar performance to November's Opus 4.5 while maintaining the Sonnet pricing of $3/million input and $15/million output tokens (the Opus …
First chick of the 2026 breeding season! Kākāpō Yasmine hatched an egg fostered from kākāpō Tīwhiri on Valentine's Day, bringing the total number of kākāpō to 237 – though it …
But the intellectually interesting part for me is something else. I now have something close to a magic box where I throw in a question and a first answer comes …

Given the threat of cognitive debt brought on by AI-accelerated software development leading to more projects and less deep understanding of how they work and what they actually do, it's …

Alibaba's Qwen just released the first two models in the Qwen 3.5 series - one open weights, one proprietary. Both are multi-modal for vision input. The open weight one is …

I'm a very heavy user of Claude Code on the web, Anthropic's excellent but poorly named cloud version of Claude Code where everything runs in a container environment managed by …
Steve Yegge's take on agent fatigue, and its relationship to burnout. Let's pretend you're the only person at your company using AI. In Scenario A, you decide you're going to …
I'm occasionally accused of using LLMs to write the content on my blog. I don't do that, and I don't think my writing has much of an LLM smell to …

We coined a new term on the Oxide and Friends podcast last month (primary credit to Adam Leventhal) covering the sense of psychological ennui leading into existential dread that many …
It's wild that the first commit to OpenClaw was on November 25th 2025, and less than three months later it's hit 10,000 commits from 600 contributors, attracted 196,000 GitHub stars …
This piece by Margaret-Anne Storey is the best explanation of the term cognitive debt I've seen so far. Cognitive debt, a term gaining traction recently, instead communicates the notion that …
Someone has to prompt the Claudes, talk to customers, coordinate with other teams, decide what to build next. Engineering is changing and great engineers are more important than ever.
The retreat challenged the narrative that AI eliminates the need for junior developers. Juniors are more profitable than they have ever been. AI tools get them past the awkward initial …
Someone asked if there was an Anthropic equivalent to OpenAI's IRS mission statements over time. Anthropic are a "public benefit corporation" but not a non-profit, so they don't have the …

As a USA 501(c)(3) the OpenAI non-profit has to file a tax return each year with the IRS. One of the required fields on that tax return is to “Briefly …
OpenAI announced a partnership with Cerebras on January 14th. Four weeks later they're already launching the first integration, "an ultra-fast model for real-time coding in Codex". Despite being named GPT-5.3-Codex-Spark …
Claude Code was made available to the general public in May 2025. Today, Claude Code’s run-rate revenue has grown to over $2.5 billion; this figure has more than doubled since …
One of the sub-threads of the AI energy usage discourse has been the impact new data centers have on the cost of electricity to nearby residents. Here's detailed analysis from …

New from Google. They say it's "built to push the frontier of intelligence and solve modern challenges across science, research, and engineering". It drew me a really good SVG of …
Scott Shambaugh helps maintain the excellent and venerable matplotlib Python charting library, including taking on the thankless task of triaging and reviewing incoming pull requests. A GitHub account called @crabby-rathbun …
In my post about my Showboat project I used the term "overseer" to refer to the person who manages a coding agent. It turns out that's a term tied to …
An AI-generated report, delivered directly to the email inboxes of journalists, was an essential tool in the Times’ coverage. It was also one of the first signals that conservative media …
OpenAI's adoption of Skills continues to gain ground. You can now use Skills directly in the OpenAI API with their shell tool. You can zip skills up and upload them …
)
This is a huge new MIT-licensed model: 754B parameters and 1.51TB on Hugging Face twice the size of GLM-4.7 which was 368B and 717GB (4.5 and 4.6 were around that …

Charles Leifer has been maintaining pysqlite3 - a fork of the Python standard library's sqlite3 module that makes it much easier to run upgraded SQLite versions - since 2018. He's …
New paper by Damon McMillan exploring challenging LLM context tasks involving large SQL schemas (up to 10,000 tables) across different models and file formats: Using SQL generation as a proxy …
Aruna Ranganathan and Xingqi Maggie Ye from Berkeley Haas School of Business report initial findings in the HBR from their April to December 2025 study of 200 employees at a …
People on the orange site are laughing at this, assuming it's just an ad and that there's nothing to it. Vulnerability researchers I talk to do not think this is …
Mitchell Hashimoto's new system to help address the deluge of worthless AI-generated PRs faced by open source projects now that the friction involved in contributing has dropped so low. He …
New "research preview" from Anthropic today: you can now access a faster version of their frontier model Claude Opus 4.6 by typing /fast in Claude Code... but at a cost …
I am having more fun programming than I ever have, because so many more of the programs I wish I could find the time to write actually exist. I wish …

Last week I hinted at a demo I had seen from a team implementing what Dan Shapiro called the Dark Factory level of AI adoption, where no human even looks …
I don't know why this week became the tipping point, but nearly every software engineer I've talked to is experiencing some degree of mental health crisis. [...] Many people assuming …
When I want to quickly implement a one-off experiment in a part of the codebase I am unfamiliar with, I get codex to do extensive due diligence. Codex explores relevant …
Some really good and unconventional tips in here for getting to a place with coding agents where they demonstrably improve your workflow and productivity. I particularly liked: Reproduce your own …

Two major new model releases today, within about 15 minutes of each other. Anthropic released Opus 4.6. Here's its pelican: OpenAI release GPT-5.3-Codex, albeit only via their Codex app, not …

Mistral just released Voxtral Transcribe 2 - a family of two new models, one open weights, for transcribing audio to text. This is the latest in their Whisper-like model family, …
I just sent the January edition of my sponsors-only monthly newsletter. If you are a sponsor (or if you start a sponsorship now) you can access it here. In the …
This is the difference between Data and a large language model, at least the ones operating right now. Data created art because he wanted to grow. He wanted to become …

OpenAI just released a new macOS app for their Codex coding agent. I've had a few days of preview access - it's a solid app that provides a nice UI …
I talked to Cade Metz for this New York Times piece on OpenClaw and Moltbook. Cade reached out after seeing my blog post about that from the other day. In …
I've been running OpenClaw using Docker on my Mac. Here are the first in my ongoing notes on how I set that up and the commands I'm using to administer …
Originally in 2019, GPT-2 was trained by OpenAI on 32 TPU v3 chips for 168 hours (7 days), with $8/hour/TPUv3 back then, for a total cost of approx. $43K. It …
Getting agents using Beads requires much less prompting, because Beads now has 4 months of “Desire Paths” design, which I’ve talked about before. Beads has evolved a very complex command-line …

The hottest project in AI right now is Clawdbot, renamed to Moltbot, renamed to OpenClaw. It’s an open source implementation of the digital personal assistant pattern, built by Peter Steinberger …
Chris Ashworth is the creator and CEO of QLab, a macOS software package for “cue-based, multimedia playback” which is designed automate lighting and audio for live theater productions. I recently …
New Datasette alpha this morning. Key new features: Datasette's Request object can now handle multipart/form-data file uploads via the new await request.form(files=True) method. I plan to use this for a …
Dan Shapiro proposes a five level model of AI-assisted programming, inspired by the five (or rather six, it's zero-indexed) levels of driving automation. Spicy autocomplete, aka original GitHub Copilot or …

embedding-shapes was so infuriated by the hype around Cursor's FastRender browser project - thousands of parallel agents producing ~1.6 million lines of Rust - that they were inspired to take …

Kimi K2 landed in July as a 1 trillion parameter open weight LLM. It was joined by Kimi K2 Thinking in November which added reasoning capabilities. Now they've made it …
Someone asked on Hacker News if I had any tips for getting coding agents to write decent quality tests. Here's what I said: I work in Python which helps a …

One of my favourite features of ChatGPT is its ability to write and execute code in a container. This feature launched as ChatGPT Code Interpreter nearly three years ago, was …

Paul Kinlan is a web platform developer advocate at Google and recently turned his attention to coding agents. He quickly identified the importance of a robust sandbox for agents to …

Jenny Wen, Design Lead at Anthropic (and previously Director of Design at Figma) gave a provocative keynote at Hatch Conference in Berlin last September. Jenny argues that the Design Process …
If you tell a friend they can now instantly create any app, they’ll probably say “Cool! Now I need to think of an idea.” Then they will forget about it, …
[...] i was too busy with work to read anything, so i asked chatgpt to summarize some books on state formation, and it suggested circumscription theory. there was already the …

I haven't been paying much attention to the state-of-the-art in speech generation models other than noting that they've got really good, so I can't speak for how notable this new …
Most people's mental model of Claude Code is that "it's just a TUI" but it should really be closer to "a small game engine". For each frame our pipeline constructs …
Late last year Richard Weiss found something interesting while poking around with the just-released Claude Opus 4.5: he was able to talk the model into regurgitating a document which was …
Previous work estimating the energy and water cost of LLMs has generally focused on the cost per prompt using a consumer-level system such as ChatGPT. Simon P. Couch notes that …
Detailed and thoughtful description of an open-book and open-chatbot exam run by Ploum at École Polytechnique de Louvain for an "Open Source Strategies" class. Students were told they could use …
Plenty of people have mused about what a new programming language specifically designed to be used by LLMs might look like. Jordan Hubbard (co-founder of FreeBSD, with serious stints at …

Wilson Lin at Cursor has been doing some experiments to see how far you can push a large fleet of "autonomous" coding agents: This post describes what we've learned from …
On 15th January Black Forest Labs, a lab formed by the creators of the original Stable Diffusion, released black-forest-labs/FLUX.2-klein-4B - an Apache 2.0 licensed 4 billion parameter version of their …
[On agents using CLI tools in place of REST APIs] To save on context window, yes, but moreso to improve accuracy and success rate when multiple tool calls are involved, …

The long-rumored introduction of ads to ChatGPT just became a whole lot more concrete: In the coming weeks, we’re also planning to start testing ads in the U.S. for the …
This is the standardization effort I've most wanted in the world of LLMs: a vendor-neutral specification for the JSON API that clients can use to talk to hosted LLMs. Open …
When we optimize responses using a reward model as a proxy for “goodness” in reinforcement learning, models sometimes learn to “hack” this proxy and output an answer that only “looks …
Claude Cowork defaults to allowing outbound HTTP traffic to only a specific list of domains, to help protect the user against prompt injection attacks that exfiltrate their data. Prompt Armor …
This is outstanding news, especially given our decision to withdraw from that NSF grant application back in October. We are thrilled to announce that Anthropic has entered into a two-year …
Classic prompt injection attack: When asked to summarize the user’s recent mail, a prompt injection in an untrusted email manipulated Superhuman AI to submit content from dozens of other sensitive …

New from Anthropic today is Claude Cowork, a “research preview” that they describe as “Claude Code for the rest of your work”. It’s currently available only to Max subscribers ($100 …
I'm glad someone was brave enough to say this. There is a lot of anti-AI sentiment in the software development community these days. Much of it is justified, but if …
Last month I wrote about porting JustHTML from Python to JavaScript using Codex CLI and GPT-5.2 in a few hours while also buying a Christmas tree and watching Knives Out …
Also note that the python visualizer tool has been basically written by vibe-coding. I know more about analog filters -- and that's not saying much -- than I do about …
Provocative experiment from Drew Breunig, who designed a new library for time formatting ("3 hours ago" kind of thing) called "whenwords" that has no code at all, just a carefully …

I joined a recording of the Oxide and Friends podcast on Tuesday to talk about 1, 3 and 6 year predictions for the tech industry. This is my second appearance …
I picked up a few interesting tidbits from this Wall Street Journal piece on Google's recent hard won success with Gemini. Here's the origin of the name "Nano Banana": Naina …
[...] the reality is that 75% of the people on our engineering team lost their jobs here yesterday because of the brutal impact AI has had on our business. And …
AGI is here! When exactly it arrived, we’ll never know; whether it was one company’s Pro or another company’s Pro Max (Eddie Bauer Edition) that tip-toed first across the line …
This guide to the current sandboxing landscape by Luis Cardoso is comprehensive, dense and absolutely fantastic. He starts by differentiating between containers (which share the host kernel), microVMs (their own …
I joined the Oxide and Friends podcast last year to predict the next 1, 3 and 6 years(!) of AI developments. With hindsight I did very badly, but they're inviting …
It genuinely feels to me like GPT-5.2 and Opus 4.5 in November represent an inflection point - one of those moments where the models get incrementally better in a way …
Something I like about our weird new LLM-assisted world is the number of people I know who are coding again, having mostly stopped as they moved into management roles or …
I'm not joking and this isn't funny. We have been trying to build distributed agent orchestrators at Google since last year. There are various options, not everyone is aligned... I …
Depending on how you measure it, the tempo of Harder, Better, Faster, Stronger appears to be 123.45 beats per minute. This is one of those things that's so cool I'm …
My experience is that real AI adoption on real problems is a complex blend of: domain context on the problem, domain experience with AI tooling, and old-fashioned IT issues. I’m …
I sent the December edition of my sponsors-only monthly newsletter. If you are a sponsor (or if you start a sponsorship now) you can access a copy here. In the …
[Claude Code] has the potential to transform all of tech. I also think we’re going to see a real split in the tech industry (and everywhere code is written) between …

This is the third in my annual series reviewing everything that happened in the LLM space over the past 12 months. For previous years see Stuff we figured out about …

It looks like OpenAI's Codex cloud (the cloud version of their Codex coding agent) was quietly rebranded to Codex web at some point in the last few days. Here's a …
[...] The puzzle is still there. What’s gone is the labor. I never enjoyed hitting keys, writing minimal repro cases with little insight, digging through debug logs, or trying to …
In essence a language model changes you from a programmer who writes lines of code, to a programmer that manages the context the model has access to, prunes irrelevant things, …
The hard part of computer programming isn't expressing what we want the machine to do in code. The hard part is turning human thinking -- with all its wooliness and …
Jevons paradox is coming to knowledge work. By making it far cheaper to take on any type of task that we can possibly imagine, we’re ultimately going to be doing …
Today in extremely niche projects, I got fed up of Claude Code creating GitHub Actions workflows for me that used stale actions: actions/setup-python@v4 when the latest is actions/setup-python@v6 for example. …

I just sent out the latest edition of the newsletter version of this blog. It's a long one! Turns out I wrote a lot of stuff in the past 10 …
In advocating for LLMs as useful and important technology despite how they're trained I'm beginning to feel a little bit like John Cena in Pluribus. Pluribus spoiler (episode 6) Given …
A year ago, Claude struggled to generate bash commands without escaping issues. It worked for seconds or minutes at a time. We saw early signs that it may become broadly …

Rob Pike (that Rob Pike) is furious. Here’s a Bluesky link for if you have an account there and a link to it in my thread viewer if you don’t. …

I’ve been having an absurd amount of fun recently using LLMs for cooking. I started out using them for basic recipes, but as I’ve grown more confident in their culinary …

I just had my first success using a browser agent - in this case the Claude in Chrome extension - to solve an actual problem. A while ago I set …
In 2025, Reinforcement Learning from Verifiable Rewards (RLVR) emerged as the de facto new major stage to add to this mix. By training LLMs against automatically verifiable rewards across a …
Sam Rose is one of my favorite authors of explorable interactive explanations - here's his previous collection. Sam joined ngrok in September as a developer educator. Here's his first big …

The latest in OpenAI's Codex family of models (not the same thing as their Codex CLI or Codex Cloud coding agent tools). GPT‑5.2-Codex is a version of GPT‑5.2 further optimized …
Anthropic have turned their skills mechanism into an "open standard", which I guess means it lives in an independent agentskills/agentskills GitHub repository now? I wouldn't be surprised to see this …
Mehmet Ince describes a very elegant chain of attacks against the PostHog analytics platform, combining several different vulnerabilities (now all reported and fixed) to achieve RCE - Remote Code Execution …
Anil Madhavapeddy is running an Advent of Agentic Humps this year, building a new useful OCaml library every day for most of December. Inspired by Emil Stenström's JustHTML and my …

It continues to be a busy December, if not quite as busy as last year. Today’s big news is Gemini 3 Flash, the latest in Google’s “Flash” line of faster …

OpenAI shipped an update to their ChatGPT Images feature - the feature that gained them 100 million new users in a week when they first launched it back in March, …
New release of my s3-credentials CLI tool for managing credentials needed to access just one S3 bucket. Here are the release notes in full: New commands get-bucket-policy and set-bucket-policy. #91 …
Oh, so we're seeing other people now? Fantastic. Let's see what the "competition" has to offer. I'm looking at these notes on manifest.json and content.js. The suggestion to remove scripting …
I’ve been watching junior developers use AI coding assistants well. Not vibe coding—not accepting whatever the AI spits out. Augmented coding: using AI to accelerate learning while maintaining quality. [...] …
Slop lost to "brain rot" for Oxford Word of the Year 2024 but it's finally made it this year thanks to Merriam-Webster! Merriam-Webster’s human editors have chosen slop as the …

I recently came across JustHTML, a new Python library for parsing HTML released by Emil Stenström. It’s a very interesting piece of software, both as a useful library and as …
Brian Merchant has been collecting personal stories for his series AI Killed My Job - previously covering tech workers, translators, and artists - and this latest piece includes anecdotes from …
If the part of programming you enjoy most is the physical act of writing code, then agents will feel beside the point. You’re already where you want to be, even …
How to use a skill (progressive disclosure): After deciding to use a skill, open its SKILL.md. Read only enough to follow the workflow. If SKILL.md points to extra folders such …

One of the things that most excited me about Anthropic’s new Skills mechanism back in October is how easy it looked for other platforms to implement. A skill is just …
I released a new version of my LLM Python library and CLI tool for interacting with Large Language Models. Highlights from the release notes: New OpenAI models: gpt-5.1, gpt-5.1-chat-latest, gpt-5.2 …

OpenAI reportedly declared a “code red” on the 1st of December in response to increasingly credible competition from the likes of Google’s Gemini 3. It’s less than two weeks later …
This thought-provoking essay from Johann Rehberger directly addresses something that I’ve been worrying about for quite a while: in the absence of any headline-grabbing examples of prompt injection vulnerabilities causing …

I've never been particularly invested dark v.s. light mode but I get enough people complaining that this site is "blinding" that I decided to see if Claude Code for web …

Two new models from Mistral today: Devstral 2 and Devstral Small 2 - both focused on powering coding agents such as Mistral's newly released Mistral Vibe which I wrote about …

I talked to Brendan Samek about Canada Spends, a project from Build Canada that makes Canadian government financial data accessible and explorable using a combination of Datasette, a neat custom …
Announced today as a new foundation under the parent umbrella of the Linux Foundation (see also the OpenJS Foundation, Cloud Native Computing Foundation, OpenSSF and many more). The AAIF was …

Here's the Apache 2.0 licensed source code for Mistral's new "Vibe" CLI coding agent, released today alongside Devstral 2. It's a neat implementation of the now standard terminal coding agent …
I found the problem and it's really bad. Looking at your log, here's the catastrophic command that was run: rm -rf tests/ patches/ plan/ ~/ See that ~/ at the …
Martin Kleppmann makes the case for formal verification languages (things like Dafny, Nagini, and Verus) to finally start achieving more mainstream usage. Code generated by LLMs can benefit enormously from …
Now I want to talk about how they're selling AI. The growth narrative of AI is that AI will disrupt labor markets. I use "disrupt" here in its most disreputable, …
Thoughtful guidance from Bryan Cantrill, who evaluates applications of LLMs against Oxide's core values of responsibility, rigor, empathy, teamwork, and urgency.
What to try first? Run Claude Code in a repo (whether you know it well or not) and ask a question about how something works. You'll see how it looks …
Chris Lewis decompiles N64 games. He wrote about this previously in Using Coding Agents to Decompile Nintendo 64 Games, describing his efforts to decompile Snowboard Kids 2 (released in 1999) …
If you work slowly, you will be more likely to stick with your slightly obsolete work. You know that professor who spent seven years preparing lecture notes twenty years ago? …
Launched today at WIRED’s The Big Interview event, this manifesto (of which I'm a founding signatory) pushes for a positive framework for thinking about building hyper-personalized AI-powered software. This part …
Anthropic just acquired the company behind the Bun JavaScript runtime, which they adopted for Claude Code just in July. Their announcement includes an impressive revenue update on Claude Code: In …

Four new models from Mistral today: three in their "Ministral" smaller model series (14B, 8B, and 3B) and a new Mistral Large 3 MoE model with 675B parameters, 41B active. …
Richard Weiss managed to get Claude 4.5 Opus to spit out this 14,000 token document which Claude called the "Soul overview". Richard says: While extracting Claude 4.5 Opus' system message …

Two new open weight (MIT licensed) models from DeepSeek today: DeepSeek-V3.2 and DeepSeek-V3.2-Speciale, both 690GB, 685B parameters. Here's the PDF tech report. DeepSeek-V3.2 is DeepSeek's new flagship model, now running …
I just send out the November edition of my sponsors-only monthly newsletter. If you are a sponsor (or if you start a sponsorship now) you can access a copy here. …
I am increasingly worried about AI in the video game space in general. [...] I'm not sure that the CEOs and the people making the decisions at these sorts of …
It's ChatGPT's third birthday today. It's fun looking back at Sam Altman's low key announcement thread from November 30th 2022: today we launched ChatGPT. try talking with it here: chat.openai.com …
Matt Webb coins the term context plumbing to describe the kind of engineering needed to feed agents the right context at the right time: Context appears at disparate sources, by …
Large language models (LLMs) can be useful tools, but they are not good at creating entirely new Wikipedia articles. Large language models should not be used to generate new Wikipedia …
In June 2025 Sam Altman claimed about ChatGPT that "the average query uses about 0.34 watt-hours". In March 2020 George Kamiya of the International Energy Agency estimated that "streaming a …

I've been having a lot of fun hacking on my Bluesky Thread Viewer JavaScript tool with Claude Code recently. Here it renders a thread (complete with demo video) talking about …
To evaluate the model’s capability in processing long-context inputs, we construct a video “Needle-in- a-Haystack” evaluation on Qwen3-VL-235B-A22B-Instruct. In this task, a semantically salient “needle” frame—containing critical visual evidence—is inserted …
New on Hugging Face, a specialist mathematical reasoning LLM from DeepSeek. This is their entry in the space previously dominated by proprietary models from OpenAI and Google DeepMind, both of …
PromptArmor demonstrate a concerning prompt injection chain in Google's new Antigravity IDE: In this attack chain, we illustrate that a poisoned web source (an integration guide) can manipulate Gemini into …
Substantial LLVM contribution from Trail of Bits. Timing attacks against cryptography algorithms are a gnarly problem: if an attacker can precisely time a cryptographic algorithm they can often derive details …
New plugin release adding support for Claude Opus 4.5, including the new thinking_effort option: llm install -U llm-anthropic llm -m claude-opus-4.5 -o thinking_effort low 'muse on pelicans' This took longer …

Here's a delightful project by Tom Gally, inspired by my pelican SVG benchmark. He asked Claude to help create more prompts of the form Generate an SVG of [A] [doing] …
If the person is unnecessarily rude, mean, or insulting to Claude, Claude doesn't need to apologize and can insist on kindness and dignity from the person it’s talking with. Even …

Anthropic released Claude Opus 4.5 this morning, which they call “best model in the world for coding, agents, and computer use”. This is their attempt to retake the crown for …
Armin Ronacher presents a cornucopia of lessons learned from building agents over the past few months. There are several agent abstraction libraries available now (my own LLM library is edging …

Olmo is the LLM series from Ai2—the Allen institute for AI. Unlike most open weight models these are notable for including the full training data, training process and checkpoints along …

Hot on the heels of Tuesday’s Gemini 3 Pro release, today it’s Nano Banana Pro, also known as Gemini 3 Pro Image. I’ve had a few days of preview access …
Previously, when malware developers wanted to go and monetize their exploits, they would do exactly one thing: encrypt every file on a person's computer and request a ransome to decrypt …

Hot on the heels of yesterday's Gemini 3 Pro release comes a new model from OpenAI called GPT-5.1-Codex-Max. (Remember when GPT-5 was meant to bring in a new era of …
New release of my LLM plugin for Google's Gemini models: Support for nested schemas in Pydantic, thanks Bill Pugh. #107 Now tests against Python 3.14. Support for YouTube URLs as …

Inspired by this conversation on Hacker News I decided to upgrade MacWhisper to try out NVIDIA Parakeet and the new Automatic Speaker Recognition feature. It appears to work really well! …

Google's other major release today to accompany Gemini 3 Pro. At first glance Antigravity is yet another VS Code fork Cursor clone - it's a desktop application you install that …
Three years ago, we were impressed that a machine could write a poem about otters. Less than 1,000 days later, I am debating statistical methodology with an agent that built …

Google released Gemini 3 Pro today. Here’s the announcement from Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu, their developer blog announcement from Logan Kilpatrick, the Gemini 3 Pro Model Card, …
Nolan Lawson asks if LLM assistance means that the category of tiny open source libraries like his own blob-util is destined to fade away. Why take on additional supply chain …
With AI now, we are able to write new programs that we could never hope to write by hand before. We do it by specifying objectives (e.g. classification accuracy, reward …
New release of my llm-anthropic plugin: Support for Claude's new structured outputs feature for Sonnet 4.5 and Opus 4.1. #54 Support for the web search tool using -o web_search 1 …
Neat MLX project by Senstella bringing NVIDIA's Parakeet ASR (Automatic Speech Recognition, like Whisper) model to to Apple's MLX framework. It's packaged as a Python CLI tool, so you can …
I was confused about whether the new "adaptive thinking" feature of GPT-5.1 meant they were moving away from the "router" mechanism where GPT-5 in ChatGPT automatically selected a model for …

OpenAI announced GPT-5.1 yesterday, calling it a smarter, more conversational ChatGPT. Today they've added it to their API. We actually got four new models today: gpt-5.1 gpt-5.1-chat-latest gpt-5.1-codex gpt-5.1-codex-mini There …

Max Woolf provides an exceptional deep dive into Google's Nano Banana aka Gemini 2.5 Flash Image model, still the best available image manipulation LLM tool three months after its initial …
On Monday, this Court entered an order requiring OpenAI to hand over to the New York Times and its co-plaintiffs 20 million ChatGPT user conversations [...] OpenAI is unaware of …

Almost every time I share a new example of an SVG of a pelican riding a bicycle a variant of this question pops up: how do you know the labs …
The fact that MCP is a difference surface from your normal API allows you to ship MUCH faster to MCP. This has been unlocked by inference at runtime Normal APIs …

Robert Glaser took my pelican riding a bicycle benchmark and applied an agentic loop to it, seeing if vision models could draw a better pelican if they got the chance …

I've been upgrading a ton of Datasette plugins recently for compatibility with the Datasette 1.0a20 release from last week - 35 so far. A lot of the work is very …
Netflix asks partners to consider the following guiding principles before leveraging GenAI in any creative workflow: The outputs do not replicate or substantially recreate identifiable characteristics of unowned or copyrighted …

beetle_b ran this prompt against a bunch of recent LLMs: Write a POV-Ray file that shows a pelican riding on a bicycle. This turns out to be a harder challenge …

OpenAI partially released a new model yesterday called GPT-5-Codex-Mini, which they describe as "a more compact and cost-efficient version of GPT-5-Codex". It’s currently only available via their Codex CLI tool …
The big advantage of MCP over OpenAPI is that it is very clear about auth. [...] Maybe an agent could read the docs and write code to auth. But we …
I have AiDHD It has never been easier to build an MVP and in turn, it has never been harder to keep focus. When new features always feel like they're …