| | |
| Type of business | 501(c)(3) non-profit |
|---|---|
| Founded | 2007 |
| Headquarters | San Francisco, California; Los Angeles, California, United States |
| Founder | Gil Elbaz |
| Key people | Peter Norvig, Rich Skrenta, Eva Ho |
| URL | commoncrawl |
Content license | Apache 2.0 (software) [ clarification needed ] |
Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. [1] [2]
Common Crawl was founded by Gil Elbaz. [1] [2] It is funded by the Elbaz Family Foundation Trust and significant donations from the AI industry. [3]
Contents archived by Common Crawl are mirrored [4] [ better source needed ] and made available online [5] [ non-primary source needed ] in the Wayback Machine. They are used by researchers, as well as AI companies to train large language models. [3]
In November 2025, an investigation by The Atlantic revealed that Common Crawl lied when it claimed it respected paywalls in its scraping and requests from publishers to have their content removed from its databases. [6] [3]
Advisors to the non-profit have included Peter Norvig and Joi Ito. [7]
By 2013, sites like TinEye were building their products off of Common Crawl. [8]
As of 2016, the Common Crawl dataset includes copyrighted work and is distributed from the US under fair use claims. Researchers in other countries have made use of techniques such as shuffling sentences or referencing the Common Crawl dataset to work around copyright law in other legal jurisdictions. [9] [ better source needed ]
A filtered version of Common Crawl was used to train OpenAI's GPT-3 language model, announced in 2020. [10] [ better source needed ] In 2023, it began receiving significant financial support from AI companies, including Anthropic and OpenAI, each of which donated $250,000. [3]
As of 2024, Common Crawl had been cited in more than 10,000 academic studies. [11]
In November 2025, an investigation by technology journalist Alex Reisner for The Atlantic revealed that Common Crawl lied when it claimed it respected paywalls in its scraping and requests from publishers to have their content removed from its databases. [3] It included misleading results in the public search function on its website that showed no entries for websites that had requested their archives be removed, when in fact those sites were still included in its scrapes used by AI companies. [3]
Google's version of the Common Crawl is called the Colossal Clean Crawled Corpus, or C4 for short. It was constructed for the training of the T5 language model series in 2019. [12] [ better source needed ] There are some concerns over copyrighted content in the C4. [13] One study found that 45% of content was now explicitly restricted by websites who do not want it to be scraped without compensation to be used for purposes like AI training by for-profit companies. [11]
the majority of our data is derived from raw Common Crawl with only quality-based filtering.