Common Crawl

Last updated
Common Crawl
Common Crawl logo.svg
Type of business 501(c)(3) non-profit
Founded2007
Headquarters San Francisco, California; Los Angeles, California, United States
Founder Gil Elbaz
Key people Peter Norvig, Rich Skrenta, Eva Ho
URL commoncrawl.org
Content license
Apache 2.0 (software) [ clarification needed ]

Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. [1] [2]

Contents

Common Crawl was founded by Gil Elbaz. [1] [2] It is funded by the Elbaz Family Foundation Trust and significant donations from the AI industry. [3]

Contents archived by Common Crawl are mirrored [4] [ better source needed ] and made available online [5] [ non-primary source needed ] in the Wayback Machine. They are used by researchers, as well as AI companies to train large language models. [3]

In November 2025, an investigation by The Atlantic revealed that Common Crawl lied when it claimed it respected paywalls in its scraping and requests from publishers to have their content removed from its databases. [6] [3]

History

Advisors to the non-profit have included Peter Norvig and Joi Ito. [7]

By 2013, sites like TinEye were building their products off of Common Crawl. [8]

As of 2016, the Common Crawl dataset includes copyrighted work and is distributed from the US under fair use claims. Researchers in other countries have made use of techniques such as shuffling sentences or referencing the Common Crawl dataset to work around copyright law in other legal jurisdictions. [9] [ better source needed ]

A filtered version of Common Crawl was used to train OpenAI's GPT-3 language model, announced in 2020. [10] [ better source needed ] In 2023, it began receiving significant financial support from AI companies, including Anthropic and OpenAI, each of which donated $250,000. [3]

As of 2024, Common Crawl had been cited in more than 10,000 academic studies. [11]

In November 2025, an investigation by technology journalist Alex Reisner for The Atlantic revealed that Common Crawl lied when it claimed it respected paywalls in its scraping and requests from publishers to have their content removed from its databases. [3] It included misleading results in the public search function on its website that showed no entries for websites that had requested their archives be removed, when in fact those sites were still included in its scrapes used by AI companies. [3]

Colossal Clean Crawled Corpus

Google's version of the Common Crawl is called the Colossal Clean Crawled Corpus, or C4 for short. It was constructed for the training of the T5 language model series in 2019. [12] [ better source needed ] There are some concerns over copyrighted content in the C4. [13] One study found that 45% of content was now explicitly restricted by websites who do not want it to be scraped without compensation to be used for purposes like AI training by for-profit companies. [11]

See also

References

  1. 1 2 Rosanna Xia (February 5, 2012). "Tech entrepreneur Gil Elbaz made it big in L.A." Los Angeles Times. Retrieved July 31, 2014.
  2. 1 2 "Gil Elbaz and Common Crawl". NBC News. April 4, 2013. Retrieved July 31, 2014.
  3. 1 2 3 4 5 6 Reisner, Alex (2025-11-04). "The Company Quietly Funneling Paywalled Articles to AI Developers". The Atlantic . Retrieved 2025-11-14.
  4. Leetaru, Kalev (January 28, 2016). "The Internet Archive Turns 20: A Behind the Scenes Look at Archiving the Web" . Forbes (Contributor). Archived from the original on October 16, 2017. Retrieved October 16, 2017.
  5. "Internet Archive: Digital Library of Free & Borrowable Texts, Movies, Music & Wayback Machine". archive.org. Retrieved 2025-05-26.
  6. Knibbs, Kate. "Publishers Target Common Crawl In Fight Over AI Training Data". Wired. ISSN   1059-1028 . Retrieved 2025-12-10.
  7. Tom Simonite (January 23, 2013). "A Free Database of the Entire Web May Spawn the Next Google". MIT Technology Review. Archived from the original on June 26, 2014. Retrieved July 31, 2014.
  8. Brandom, Russell (2013-03-01). "Common Crawl: going after Google on a non-profit budget". The Verge. Retrieved 2025-12-10.
  9. Schäfer, Roland (May 2016). "CommonCOW: Massively Huge Web Corpora from CommonCrawl Data and a Method to Distribute them Freely under Restrictive EU Copyright Laws". Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). Portorož, Slovenia: European Language Resources Association (ELRA): 4501.
  10. Brown, Tom; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini (2020-06-01). "Language Models Are Few-Shot Learners". p. 14. arXiv: 2005.14165 [cs.CL]. the majority of our data is derived from raw Common Crawl with only quality-based filtering.
  11. 1 2 Roose, Kevin (July 19, 2024). "The Data That Powers A.I. Is Disappearing Fast". New York Times.
  12. Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". Journal of Machine Learning Research. 21 (140): 1–67. arXiv: 1910.10683 . ISSN   1533-7928.
  13. Hern, Alex (2023-04-20). "Fresh concerns raised over sources of training material for AI systems". The Guardian. ISSN   0261-3077 . Retrieved 2023-04-21.