Your Content Has Already Been Scraped. The AI Blind Spot Exposed by Cookpad's Backlash
Chefs raged at Cookpad for scraping recipes. But AI companies have been doing the same at far greater scale. A technical investigation reveals the blind spot of the AI era.
Column
kkm
Backend Engineer / AWS / Django
Chefs raged at Cookpad for scraping recipes. But AI companies have been doing the same at far greater scale. A technical investigation reveals the blind spot of the AI era.
Cookpad's new feature "Recipe Scrap" sparked a massive backlash. The feature uses AI to automatically extract recipes from external websites and save them within the Cookpad app. Food researchers and recipe creators, led by the popular cooking influencer Ryuji, pushed back in unison, saying "This shows zero respect for recipe creators."
The criticism boils down to one simple point: "Don't take other people's content and turn it into value for your own platform."
This anger is justified. But there's another entity doing the exact same thing on a far larger scale, far more aggressively. And almost nobody is angry at them.
Something Far Bigger Than Cookpad Is Happening Every Day
Cookpad's Recipe Scrap works like this: a user enters a URL, AI extracts the ingredients and steps, and saves them within the app. A link back to the original post is included.
Meanwhile, AI companies like OpenAI, Google, and Anthropic use crawlers to automatically collect every kind of content on the internet as training data for their AI models. No link back. No notification.
| Category | Cookpad's Recipe Scrap | AI Companies' Training Data Collection |
|---|---|---|
| Collection Method | User enters URL → Server fetches content | Crawlers auto-scan → Mass collection |
| Scope | One page at a time, as specified by user | Virtually every page on the internet |
| Treatment of Original Content | Near-verbatim display with source link | Absorbed into model no source link |
| Notification to Creators | None | None |
| Opt-out Mechanism | Unclear | Can block via robots.txt |
| Scale | 5 per week per user (free tier) | Trillions of pages |
This isn't just about scale. AI company crawling extends to every type of content imaginable, not just recipes.
- ▸ Written works: The New York Times sued OpenAI and Microsoft, alleging its articles were used without permission to train ChatGPT. Author Sarah Silverman also sued OpenAI and Meta
- ▸ Images: Getty Images sued Stability AI, claiming millions of stock photos were used without authorization to train image-generating AI
- ▸ Music: Major record labels are filing successive lawsuits against AI music generation services over unauthorized use of songs for training
- ▸ Code: Public repositories on GitHub were used as training data for GitHub Copilot. Developer Matthew Butterick filed a class action lawsuit against Microsoft and others
The recipes on Sirogohan.com, the subtitles from Ryuji's YouTube videos -- they're almost certainly included in LLM training data. And yet, there's no record of any food creator raising their voice about that.
I Checked the Technical Defenses on the Critics' Own Websites
Website owners do have a way to block AI crawlers. The standard method is to place a robots.txt file on your site specifying "this crawler is not allowed to access my content." Major AI crawlers -- OpenAI's GPTBot, Google's Google-Extended, Anthropic's ClaudeBot -- have all stated they comply with robots.txt.
So what about the sites belonging to the food creators who criticized Cookpad? I checked them directly.
| Site | Owner | robots.txt | AI Crawler Blocked? | AI Training Explicitly Prohibited? |
|---|---|---|---|---|
| Sirogohan.com | Tadasuke Tomita | Does not exist (404 error) | No | No |
| bazurecipe.com | Ryuji | Exists, but only lists AmazonAdBot | No | No |
Sirogohan.com doesn't have a robots.txt file at all. Any crawler that visits the site can collect every page without restriction. Bazurecipe.com has a robots.txt, but it only contains a single line allowing AmazonAdBot. There's no mention whatsoever of GPTBot, ClaudeBot, Google-Extended, Bytespider, or any other AI crawlers.
Their terms of service and privacy policies also contain no language prohibiting the use of their data for AI training. Sirogohan.com does state "unauthorized reproduction or duplication is prohibited," but this wording was clearly not written with AI crawlers in mind.
This isn't about blaming them for not taking precautions. The point is this: they either don't realize their content is being used to train AI, or they do realize it but haven't taken concrete steps to prevent it -- and yet they're directing all their anger at Cookpad. It's the asymmetry that stands out.
The People Who Adapted Instead of Protesting
Facing the exact same structural problem, some chose not to protest but to adapt: software engineers.
In 2021, GitHub used source code from public repositories as training data to develop GitHub Copilot. Code written by engineers around the world was used, without their consent, to train a commercial AI product.
And that very Copilot began taking their jobs.
- ▸ Meta laid off 16,000 people and redirected $135 billion toward AI investment
- ▸ Atlassian said "we're hiring more engineers thanks to AI," then cut 900 engineers five months later
- ▸ System integrators started facing pushback: "AI can do this in 3 days for $5,000 -- why are you quoting 4 months and $80,000?"
- ▸ Junior engineer hiring has plummeted. The reason: "The work that new hires used to do? AI does it now"
Having your own code used to train an AI that then takes your job. This goes far beyond having a recipe scraped. Their very livelihoods are under direct threat.
Yet most engineers chose to move in the direction of mastering AI. They adopted Copilot, started writing code with ChatGPT, and began figuring out how to coexist with AI. Even those who lost their jobs started acquiring AI skills to find their next opportunity.
Engineers didn't choose adaptation because they were noble. By the time GitHub Copilot was announced, their code had already been consumed. Being angry was simply too late. In cooking terms: by the time they noticed, all their recipes had already been read, and new dishes made from them were being served right in front of them. The only option left was to use those same tools to cook something even better. It wasn't so much "acceptance" as "there was no other choice."
Why Could They Only Get Angry at Cookpad?
The answer is simple: Cookpad was a "visible enemy."
It's a Japanese-language service, run by a Japanese company, and you can visibly see your recipe being imported. They have an official X (Twitter) account you can reply to and get a response. Criticize them, and you'll get sympathy. Media will cover it.
OpenAI's crawlers, on the other hand, are invisible. There's no way to confirm whether your recipe was used to train GPT-5. There's nowhere to direct your anger. File a complaint in English with a San Francisco company, and you won't hear back.
We can get angry at "visible enemies" but not at "invisible" ones. The result: Cookpad -- the smaller, arguably more well-behaved party (they include links, it's personal bookmarking) -- takes the full brunt of the backlash, while AI companies continue their far larger and more aggressive training as if nothing happened.
The Anger Isn't Misdirected. It's Just Incomplete
To be clear, the criticism of Cookpad is justified. I also see problems with a design that scrapes other people's content to promote premium memberships.
But there's something I hope this incident helps people understand.
The content you've published on the web has already been used to train AI. Recipes, photos, blog posts, YouTube subtitles -- all of it. If you want to prevent that, you need to set up robots.txt, explicitly prohibit AI training in your terms of service, and take both technical and legal measures. Getting angry at Cookpad alone won't stop the much larger tide.
And here's another thing: even if you set up robots.txt today, you can't undo data that's already been consumed. Publishing content in the age of AI is increasingly something you have to do on the assumption that "someone will train on this."
Engineers accepted this reality and moved to the side of wielding AI as a tool. These are people whose own code was used to train an AI that then threatened their jobs -- and they still chose the path of adaptation.
As one of those engineers, let me say just one thing. Every time someone says "AI makes it so easy to build a website" or "programmers won't be needed anymore," our work gets chipped away a little more. Project estimates keep dropping, hiring freezes mount, and layoff headlines roll in month after month. Tech stacks we spent years building become obsolete overnight. Best practices we internalized get replaced by Copilot. It never stops. Yet here we are, using AI with one hand while running for our lives from the AI closing in behind us with the other. It's easy to imagine being overtaken before long. But we keep running, trying to stay ahead just a little longer.
So I truly understand why food creators are angry at Cookpad. The fury of having your content used without permission -- we know that feeling better than anyone. But I can't help feeling a twinge of loneliness that the outrage only surfaced "when their recipes were scraped." When our work was being replaced by AI day after day, we barely heard anyone speak up for us.
Getting angry and adapting -- neither alone is enough in this era. And that anger should extend beyond your own field to every creator and engineer affected by the same structural forces. That, I believe, is the true scale of this problem.
Sources
- Cookpad "Recipe Scrap" Controversy Timeline (This Site)
- The New York Times: "The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work"
- GitHub Copilot Litigation
- Getty Images: "Statement"
- Ryuji's X Post
- Meta AI Investment and Layoffs (This Site)
- Atlassian 1,600 Layoffs (This Site)
- System Integrator Price Disruption (This Site)