Content crawling, AI harm and Cloudflare

On July 1st, Matthew Prince, CEO of Cloudflare, posted a blog titled Content Independence Day: no AI crawl without compensation!.

The most important part of the blog is this:

Cloudflare, along with a majority of the world's leading publishers and AI companies, is changing the default to block AI crawlers unless they pay creators for their content.

He said that starting July 1st, AI bots can't crawl a site protected by Cloudflare unless the site owner explicitly allows AI crawlers to crawl their site.

The operative word is "changing the default". In June, OpenAI can crawl the neeto.com website, but in July, if OpenAI tries to crawl neeto.com, the AI bots will be blocked. They'll remain blocked unless I(the content owner) explicitly go and turn on the flag and allow the AI Bots to crawl the site.

However, Matthew made an exception. By default, Googlebots are allowed. Well, Google has two bots. Googlebot crawls the web for Google search and Google-Extended crawls the web to capture data for Gemini. First of all, Matthew is not blocking Google-Extend crawls. Secondly, even if he does it, it won't matter because, as per the terms and conditions of Google, the Googlebot that powers their search shares data with Gemini.

So why did Matthew make an exception to Google? Well, in today's world, it's hard to survive if you block Google. If you block Google, then Google AI and Google search won't find you, and that's not good for your business. Hence, Matthew made an exception.

This move by Matthew is obviously great news for Google. Google gets the data from the creators but other AI companies will not be getting the data because by default they are blocked. Now, tell me how many site owners will log in to the Cloudflare dashboard and find the flag where they need to allow AI bots.

People won't do it because there is no direct incentive.

In the old world (pre-AI world), there was a give-and-take relationship with Google. Google will take data from neeto.com and in this way Google becomes the aggregator of data. If I need some information, I'll go to Google. In return, Google will send me links.

The AI tools consume the data but don't return anything. In fact, that's why a bunch of authors and publishers banded together and sued Anthropic a few months back.

Anthropic bought millions of physical books. Stripped the bindings, cut pages, scanned them into PDFs, and then fed those PDFs to LLM. They also downloaded over 7 million pirated books from sources like LibGen, PiliMI, and Books3 dataset.

We all know that books are copyrighted. I can't scan pages of a book and start selling them. Amazon will ban me. The publisher's point was that what Anthropic is doing is even legal.

The case went to a judge. To me it seems like a pretty open and shut case. Anthropic scanned the books without explicit permission from the authors. Seems like they are guilty. This is a clear case of harm done by AI.

Don't take my word. Listen to David Baldacci. He is one of the most renowned authors in the world. His 60+ novels are published in over 45 languages and published in over 80 countries, having sold over 130 million copies worldwide. AI companies have crawled his various books and now people are using AI and publishing books imitating his writing style without putting any effort at all. The rate at which people are copying and publishing is such that some book vendors are placing a limit on the number of books one can publish in a week.

Obviously, they are not writing those books. They are using AI to copy David Baldacci's style and putting it on the store. David, in a recent testimony, said that he feels as if someone "backed up a truck to his imagination and stole everything he ever created. What AI does is take what authors produce and provide shortcuts".

You should listen to his full testimony. It's only 6 minutes, but it's valuable to hear from the people whose work is being stolen by AI.

If you were a judge how would you rule? Before I tell you how the judge ruled, I need to explain the "fair use" policy outlined in the Copyright Act. Copyright website has a few things to say about fair use.

The most important section over there is:

Additionally, "transformative" uses are more likely to be considered fair.

Transformative - that's the keyword Anthropic's lawyers were using. See if I take an author's book and copy it word for word; then I'm violating copyright. However, if I learn from that book and then provide my interpretation of what I have learned, that's not a violation of copyright, but rather I'm providing "transformed" information.

If you ask AI "how the atomic bomb was built," it would not copy word by word from any book. Rather, it would give you an answer which is "transformed" from various books written on the topic of "how the atomic bomb was built".

Now that you know about "transformative" part of "fair use" and if you are a judge how would you rule. Personally, I'm very sympathetic to all the authors whose content was used to build AI, but the way copyright law is written, it allows "transformative" use. And that's what the judge said. The judge said that Antrhopic is not violating copyrights of the book authors.

That's a huge win not only for Anthropic but for the whole AI community.

So if I'm an author then what are my choices. Remember previously with google the deal was simple. Using robots.txt I would govern what google can crawl and in return google gives me links. No need to involve the government. Google respected robots.txt and everyone played by the rule.

In mid 2024 some folks noted that Perplexity AI was not honoring robots.txt. Perplexity officially accepted that they are not honoring robots.txt, but the executives took a posture that if you want a better search, then you have to allow us. Since robots.txt is not a government-mandated law, no one can do anything about it. Perplexity kept crawling the sites there were banned in robots.txt even after this matter was brought to public.

As you can see Google had setup an honor system and everything worked fine. Then these AI cowboys came and decided that the old norms do not apply to them.

Publishers and content creators were getting frustrated. Their content is getting scraped. These AI companies are worth billions, but the content authors don't reap any benefits.

So that's the current environment. Matthew thinks he would be the savior of all these content creators. Cloudflare today handles 20% of web traffic. Soon it would handle 30% and more. Matthew came up with this default ban idea so that the content creators who are using cloudflare are protected.

Matthew is also allowing the content creators to set a price for crawl. In this way they more they crawl the more they need to pay. To me that seems fair. After all these AI companies are benefitting from the content and they are not bringing any benefits to the content creators.

Before I finish this long article, I have a thought. OpenAI should buy Cloudflare. By default, Cloudflare has blocked all AI bots except Google Bot. If OpenAI buys Cloudflare, then obviously OpenAI would be able to get the new content, but it can starve other AI companies of the new content.

OpenAI is fighting on multiple fronts. It's competing with Google search, Microsoft search and other AI companies. Buying cloudflare will not help in fighting Google but it will help in cutting off content for Bing and other AI companies.I assumeg that in the future Microsoft and OpenAI will not be partners.

Or perhaps Microsoft should acquire Cloudflare and cut off its supply to all AI companies, including OpenAI.