Diffbot is a developer of
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
and
computer vision
Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...
algorithms and public
API
An application programming interface (API) is a way for two or more computer programs to communicate with each other. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how ...
s for extracting data from web pages /
web scraping
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping ...
to create a
knowledge base
A knowledge base (KB) is a technology used to store complex structured and unstructured information used by a computer system. The initial use of the term was in connection with expert systems, which were the first knowledge-based systems.
Ori ...
.
The company has gained interest from its application of computer vision technology to web pages, wherein it visually parses a web page for important elements and returns them in a
structured format.
In 2015 Diffbot announced it was working on its version of an automated "
Knowledge Graph
The Google Knowledge Graph is a knowledge base from which Google serves relevant information in an infobox beside its search results. This allows the user to see the answer in a glance. The data is generated automatically from a variety of so ...
" by crawling the web and using its automatic web page extraction to build a large database of structured web data.
In 2019 Diffbot released their Knowledge Graph which has since grown to include over 2 billion entities (corporations, people, articles, products, discussions, and more), and 10 trillion "facts."
The company's products allow software developers to analyze web home pages and article pages,
and extract the "important information" while ignoring elements deemed not core to the primary content.
In August 2012 the company released its Page Classifier API, which automatically categorizes web pages into specific "page types".
As part of this, Diffbot analyzed 750,000 web pages shared on the social media service
Twitter
Twitter is an online social media and social networking service owned and operated by American company Twitter, Inc., on which users post and interact with 280-character-long messages known as "tweets". Registered users can post, like, and ...
and revealed that photos, followed by articles and videos, are the predominant web media shared on the social network.
In September 2020 the company released a Natural Language Processing API for automatically building Knowledge Graphs from text.
The company raised $2 million in funding in May 2012 from investors including
Andy Bechtolsheim and
Sky Dayton
Sky Dylan Dayton (born August 8, 1971) is an American entrepreneur and investor. He is the founder of Internet service provider EarthLink, co-founder of eCompanies, and the founder of Boingo.
Early life
Dayton's father was the sculptor Wend ...
.
Diffbot's customers include
Adobe
Adobe ( ; ) is a building material made from earth and organic materials. is Spanish for ''mudbrick''. In some English-speaking regions of Spanish heritage, such as the Southwestern United States, the term is used to refer to any kind of e ...
,
AOL
AOL (stylized as Aol., formerly a company known as AOL Inc. and originally known as America Online) is an American web portal and online service provider based in New York City. It is a brand marketed by the current incarnation of Yahoo (2017â ...
,
Cisco
Cisco Systems, Inc., commonly known as Cisco, is an American-based multinational digital communications technology conglomerate corporation headquartered in San Jose, California. Cisco develops, manufactures, and sells networking hardware, ...
,
DuckDuckGo
DuckDuckGo (DDG) is an internet search engine that emphasizes protecting searchers' privacy and avoiding the filter bubble of personalized search results. DuckDuckGo does not show search results from content farms. It uses various APIs of o ...
,
eBay
eBay Inc. ( ) is an American multinational e-commerce company based in San Jose, California, that facilitates consumer-to-consumer and business-to-consumer sales through its website. eBay was founded by Pierre Omidyar in 1995 and became a ...
,
Instapaper
Instapaper is a social bookmarking service that allows web content to be saved so it can be "read later" on a different device, such as an e-reader, smartphone, or tablet. The service was founded in 2008 by Marco Arment. In April 2013, Arment ...
,
Microsoft
Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washing ...
, Onswipe and
Springpad
Springpad was a free online application and web service that allowed its registered users to save, organize and share collected ideas and information. As users added content to their Springpad accounts, the application automatically identified a ...
.
See also
*
GPT-3
Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt.
The architecture is a standard ...
References
External links
*{{Official website
Knowledge Graph
Applied machine learning
Web scraping
Web crawlers
Web archiving
Knowledge graphs