Diffbot

Last updated
Diffbot
Company type Private company
Industry Internet
FounderMike Tung
Headquarters,
U.S.
Area served
Worldwide
Key people
  • Mike Tung (CEO)
Services Web APIs, Enterprise Search, Web Scraping, Web Crawling
Website www.diffbot.com

Diffbot is a developer of machine learning and computer vision algorithms and public APIs for extracting data from web pages / web scraping to create a knowledge base.

Contents

The company has gained interest from its application of computer vision technology to web pages, wherein it visually parses a web page for important elements and returns them in a structured format. [1] In 2015 Diffbot announced it was working on its version of an automated "Knowledge Graph" by crawling the web and using its automatic web page extraction to build a large database of structured web data. [2] In 2019 Diffbot released their Knowledge Graph which has since grown to include over 2 billion entities (corporations, people, articles, products, discussions, and more), and 10 trillion "facts."

The company's products allow software developers to analyze web home pages and article pages, [3] and extract the "important information" while ignoring elements deemed not core to the primary content. [4]

In August 2012 the company released its Page Classifier API, which automatically categorizes web pages into specific "page types". [5] As part of this, Diffbot analyzed 750,000 web pages shared on the social media service Twitter and revealed that photos, followed by articles and videos, are the predominant web media shared on the social network. [6]

In September 2020 the company released a Natural Language Processing API for automatically building Knowledge Graphs from text. [7] [8] The company raised $2 million in funding in May 2012 from investors including Andy Bechtolsheim and Sky Dayton. [9]

Diffbot's customers include Adobe, AOL, Cisco, DuckDuckGo, eBay, Instapaper, Microsoft, Onswipe and Springpad. [4] [5] [10]

See also

Related Research Articles

<span class="mw-page-title-main">LinkedIn</span> Professional network website

LinkedIn is a business and employment-focused social media platform that works through websites and mobile apps. It was launched on May 5, 2003 by Reid Hoffman and Eric Ly. Since December 2016, LinkedIn has been a wholly owned subsidiary of Microsoft. The platform is primarily used for professional networking and career development, and allows jobseekers to post their CVs and employers to post jobs. From 2015, most of the company's revenue came from selling access to information about its members to recruiters and sales professionals. LinkedIn has more than 1 billion registered members from over 200 countries and territories.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

<span class="mw-page-title-main">Web API</span> HTTP-based application programming interface used in web development

A web API is an application programming interface (API) for either a web server or a web browser. As a web development concept, it can be related to a web application's client side. A server-side web API consists of one or more publicly exposed endpoints to a defined request–response message system, typically expressed in JSON or XML by means of an HTTP-based web server. A server API (SAPI) is not considered a server-side web API, unless it is publicly accessible by a remote web application.

A search engine results page (SERP) is a webpage that is displayed by a search engine in response to a query by a user. The main component of a SERP is the listing of results that are returned by the search engine in response to a keyword query.

<span class="mw-page-title-main">DBpedia</span> Online database project

DBpedia is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web using OpenLink Virtuoso. DBpedia allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets.

The Facebook Platform is the set of services, tools, and products provided by the social networking service Facebook for third-party developers to create their own applications and services that access data in Facebook.

Social media measurement, also called social media controlling, is the management practice of evaluating successful social media communications of brands, companies, or other organizations.

Daylife was an online publishing company that offered cloud-based tools for web publishers, marketers and developers. It provided digital media management tools and content feeds to publishers, brand marketers and developers. Daylife was founded in 2006, raised $15 million from several investors, including Getty Images, and was acquired in 2012 by NewsCred. The company was headquartered in downtown New York City.

Freebase was a large collaborative knowledge base consisting of data composed mainly by its community members. It was an online collection of structured data harvested from many sources, including individual, user-submitted wiki contributions. Freebase aimed to create a global resource that allowed people to access common information more effectively. It was developed by the American software company Metaweb and run publicly beginning in March 2007. Metaweb was acquired by Google in a private sale announced on 16 July 2010. Google's Knowledge Graph is powered in part by Freebase.

<span class="mw-page-title-main">TweetDeck</span> Social media dashboard application of X (formerly Twitter)

X Pro, formerly known as TweetDeck, is a paid proprietary social media dashboard for management of X accounts. Originally an independent app, TweetDeck was subsequently acquired by Twitter Inc. and integrated into Twitter's interface. It had long ranked as one of the most popular Twitter clients by percentage of tweets posted, alongside the official Twitter web client and the official apps for iPhone and Android.

<span class="mw-page-title-main">API</span> Software interface between computer programs

An application programming interface (API) is a way for two or more computer programs or components to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how to build or use such a connection or interface is called an API specification. A computer system that meets this standard is said to implement or expose an API. The term API may refer either to the specification or to the implementation. Whereas a system's user interface dictates how its end-users interact with the system in question, its API dictates how to write code that takes advantage of that system's capabilities.

<span class="mw-page-title-main">Yummly</span> American recipe website and mobile app

Yummly is an American website and mobile app that provides users recipes via recommendations and a search engine. Yummly uses a knowledge graph to offer a semantic web search engine for food, cooking and recipes.

<span class="mw-page-title-main">Google Fusion Tables</span> Data management web service

Google Fusion Tables was a web service provided by Google for data management. Fusion tables was used for gathering, visualising and sharing data tables. Data are stored in multiple tables that Internet users can view and download.

Scrapy is a free and open-source web-crawling framework written in Python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general-purpose web crawler. It is currently maintained by Zyte, a web-scraping development and services company.

MuleSoft, LLC. is a software company headquartered in San Francisco, California, that provides integration software for connecting applications, data and devices, founded in 2006. The company's Anypoint Platform of integration products is designed to integrate software as a service (SaaS), on-premises software, legacy systems and other platforms.

Approov (formerly CriticalBlue) is a Scottish software company based in Edinburgh that is primarily active in two areas of technology: anti-botnet and automated threat prevention for mobile businesses, and software optimization tools and services for Android and Linux platforms.

Data scraping is a technique where a computer program extracts data from human-readable output coming from another program.

Crashlytics was a Boston, Massachusetts-based software company founded in May 2011 by entrepreneurs Wayne Chang and Jeff Seibert. Crashlytics helps collecting, analyzing and organizing app crash reports.

<span class="mw-page-title-main">Netlify</span> American cloud computing company

Netlify is a remote-first cloud computing company that offers a development platform that includes build, deploy, and serverless backend services for web applications and dynamic websites. The platform is built on open web standards, making it possible to integrate build tools, web frameworks, APIs, and various web technologies into a unified developer workflow.

Foursquare Labs Inc., commonly known as Foursquare, is a geolocation technology company and data cloud platform based in the United States. Founded by Dennis Crowley and Naveen Selvadurai in 2009, the company rose to prominence with the launch of its local search-and-discovery mobile app. The app, Foursquare City Guide, popularized the concept of real-time location sharing and checking-in.

References

  1. "Diffbot Lets Developers Navigate Code the Way Our Eyes See the World". TheNextWeb. August 25, 2011. Retrieved April 21, 2013.
  2. "Startup Unleashes Its Clone of Google's 'Knowledge Graph'". Wired. June 4, 2015. Retrieved June 15, 2015.
  3. "Diffbot Helps Apps Read the Web Like Humans". GigaOm. August 25, 2011. Retrieved March 14, 2013.
  4. 1 2 "Investors Back Diffbot's Visual Learning Robot for Web Content". The Wall Street Journal. May 31, 2012. Retrieved March 14, 2013.
  5. 1 2 "DiffBot's new API brilliantly reveals what's hiding behind any link". August 16, 2012. Retrieved March 14, 2013.
  6. "Twitter: A Day in the Life". Mashable . August 16, 2012. Retrieved March 14, 2013.
  7. "New AI Tool Maps the Families of the Bible, A Song of Ice and Fire". Datanami. 2020-09-17. Retrieved 2022-06-08.
  8. Peter, Alex. "Web Scraping" . Retrieved 28 March 2021.
  9. "Diffbot raises $2 million to help apps understand the open, unstructured web". TheVerge. May 31, 2012. Retrieved March 14, 2013.
  10. "Diffbot Bests Google's Knowledge Graph To Feed The Need For Structured Data". Forbes. June 4, 2015. Retrieved June 15, 2015.