Distributed search engine

Last updated

A distributed search engine is a search engine where there is no central server. Unlike traditional centralized search engines, work such as crawling, data mining, indexing, and query processing is distributed among several peers in a decentralized manner where there is no single point of control.

Contents

History

Presearch

Started in 2017, Presearch is an ERC20 powered (PRE) search engine powered by a distributed network of community operated nodes which aggregate results from a variety of sources. This powers the searches at presearch.com/ This is planned to be a precursor where each node collaborates on a global decentralised index. [1] Presearch averages 5 million searches per day and has 2.2 million registered users. On Sept 1, 2021, Presearch was added as a default option to the search engine list on Android for the EU. [2] On May 27, 2022, Presearch officially transitioned from its Testnet to a Mainnet. This means all search traffic through the service now runs over Presearch’s decentralized network of volunteer-run nodes. [3]

YaCy

On December 15, 2003 Michael Christen announced development of a P2P-based search engine, eventually named YaCy, on the heise online forums. [4] [5]

Dews

A theoretical design for a distributed search engine discussed in academic literature. [6]

Seeks

Seeks was an open source websearch proxy and collaborative distributed tool for websearch. It ceased to have a usable release in 2016.

InfraSearch

In April 2000 several programmers (including Gene Kan, Steve Waterhouse) built a prototype P2P web search engine based on Gnutella called InfraSearch. The technology was later acquired by Sun Microsystems and incorporated into the JXTA project. [7] It was meant to run inside the participating websites' databases creating a P2P network that could be accessed through the InfraSearch website. [8] [9] [10]

Opencola

On May 31, 2000 Steelbridge Inc. announced development of OpenCOLA a collaborative distributive open source search engine. [11] It runs on the user's computer and crawls the web pages and links the user puts in their opencola folder and shares resulting index over its P2P network. [12]

Faroo

In February 2001 Wolf Garbe published an idea of a peer-to-peer search engine, [13] started the Faroo prototype in 2004, [14] and released it in 2005. [15] [16]

Goals

The goals of building a distributed search engine include:

1. to create an independent search engine powered by the community;

2. to make the search operation open and transparent by relying on open-source software;

3. to distribute the advertising revenue to node maintainers, which may help create more robust web infrastructure;

4. to allow researchers to contribute to the development of open-source and publicly-maintainable ranking algorithms and to oversee the training of the algorithm parameters.

Challenges

1. The amount of data to be processed is enormous. The size of the visible web is estimated at 5PB spread around 10 billion pages.

2. The latency of the distributed operation must be competitive with the latency of the commercial search engines.

3. A mechanism that prevents malicious users from corrupting the distributed data structures or the rank needs to be developed.

See also

Related Research Articles

Gnutella is a peer-to-peer network protocol. Founded in 2000, it was the first decentralized peer-to-peer network of its kind, leading to other, later networks adopting the model.

<span class="mw-page-title-main">Peer-to-peer</span> Type of decentralized and distributed network architecture

Peer-to-peer (P2P) computing or networking is a distributed application architecture that partitions tasks or workloads between peers. Peers are equally privileged, equipotent participants in the network. This forms a peer-to-peer network of nodes.

<span class="mw-page-title-main">Distributed hash table</span> Decentralized distributed system with lookup service

A distributed hash table (DHT) is a distributed system that provides a lookup service similar to a hash table. Key–value pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. The main advantage of a DHT is that nodes can be added or removed with minimum work around re-distributing keys. Keys are unique identifiers which map to particular values, which in turn can be anything from addresses, to documents, to arbitrary data. Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows a DHT to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.

Distributed web crawling is a distributed computing technique whereby Internet search engines employ many computers to index the Internet via web crawling. Such systems may allow for users to voluntarily offer their own computing and bandwidth resources towards crawling web pages. By spreading the load of these tasks across many computers, costs that would otherwise be spent on maintaining large computing clusters are avoided.

BitTorrent is a communication protocol for peer-to-peer file sharing (P2P), which enables users to distribute data and electronic files over the Internet in a decentralized manner.

WASTE is a peer-to-peer and friend-to-friend protocol and software application developed by Justin Frankel at Nullsoft in 2003 that features instant messaging, chat rooms, and file browsing/sharing capabilities. The name WASTE is a reference to Thomas Pynchon's novel The Crying of Lot 49. In the novel, W.A.S.T.E. is an underground postal service.

An anonymous P2P communication system is a peer-to-peer distributed application in which the nodes, which are used to share resources, or participants are anonymous or pseudonymous. Anonymity of participants is usually achieved by special routing overlay networks that hide the physical location of each node from other participants.

The Kad network is a peer-to-peer (P2P) network which implements the Kademlia P2P overlay protocol. The majority of users on the Kad Network are also connected to servers on the eDonkey network, and Kad Network clients typically query known nodes on the eDonkey network in order to find an initial node on the Kad network.

Gene Kan was a British-born Chinese American peer-to-peer file-sharing programmer who was among the first programmers to produce an open-source version of the file-sharing application that implemented the Gnutella protocol. Kan worked together with Spencer Kimball on the program called "gnubile" licensed under the GNU General Public License. Kan graduated from the University of California, Berkeley in 1997 with a major in electrical engineering and computer science, and was a member of the student club the eXperimental Computing Facility (XCF).

Social peer-to-peer processes are interactions with a peer-to-peer dynamic. These peers can be humans or computers. Peer-to-peer (P2P) is a term that originated from the popular concept of the P2P distributed computer application architecture which partitions tasks or workloads between peers. This application structure was popularized by file sharing systems like Napster, the first of its kind in the late 1990s.

<span class="mw-page-title-main">YaCy</span>

YaCy is a free distributed search engine, built on the principles of peer-to-peer (P2P) networks created by Michael Christen in 2003. The engine is written in Java and distributed on several hundred computers, as of September 2006, so-called YaCy-peers. Each YaCy-peer independently crawls through the Internet, analyzes and indexes found web pages, and stores indexing results in a common database which is shared with other YaCy-peers using principles of peer-to-peer. It is a search engine that everyone can use to build a search portal for their intranet and to help search the public internet clearly.

Peer-to-peer file sharing (P2P) systems like Gnutella, KaZaA, and eDonkey/eMule, have become extremely popular in recent years, with the estimated user population in the millions. An academic research paper analyzed Gnutella and eMule protocols and found weaknesses in the protocol; many of the issues found in these networks are fundamental and probably common on other P2P networks. Users of file sharing networks, such as eMule and Gnutella, are subject to monitoring of their activity. Clients may be tracked by IP address, DNS name, software version they use, files they share, queries they initiate, and queries they answer to. Clients may also share their private files to the network without notice due to inappropriate settings.

<span class="mw-page-title-main">Perfect Dark (P2P)</span> Peer to peer software

Perfect Dark (パーフェクトダーク) is a peer-to-peer file-sharing (P2P) application from Japan designed for use with Microsoft Windows. It was launched in 2006. Its author is known by the pseudonym Kaichō. Perfect Dark was developed with the intention for it to be the successor to both Winny and Share software. While Japan's Association for Copyright of Computer Software reported that in January 2014, the number of nodes connected on Perfect Dark was less than on Share, but more than on Winny, Netagent in 2018 reported Winny being the largest with 50 000 nodes followed by Perfect Dark with 30 000 nodes followed by Share with 10 000. Netagent asserts that the number of nodes on Perfect Dark have fallen since 2015 while the numbers of Winny hold steady. Netagent reports that users of Perfect Dark are most likely to share books/manga.

<span class="mw-page-title-main">Osiris (software)</span>

Osiris Serverless Portal System is a freeware program used to create web portals distributed via peer-to-peer networking (P2P) and autonomous from centralized servers. It is available for Microsoft Windows and Linux operating systems.

Peer-to-peer web hosting is using peer-to-peer networking to distribute access to webpages. This is differentiated from the client–server model which involves the distribution of web data between dedicated web servers and user-end client computers. Peer-to-peer web hosting may also take the form of P2P web caches and content delivery networks.

Seeks is a free and open-source project licensed under the GNU Affero General Public License version 3 (AGPL-3.0-or-later). It exists to create an alternative to the current market-leading search engines, driven by user concerns rather than corporate interests. The original manifesto was created by Emmanuel Benazera and Sylvio Drouin and published in October 2006. The project was under active development until April 2014, with both stable releases of the engine and revisions of the source code available for public use. In September 2011, Seeks won an innovation award at the Open World Forum Innovation Awards. The Seeks source code has not been updated since April 28, 2014 and no Seeks nodes have been usable since February 6, 2016.

The following outline is provided as an overview of and topical guide to search engines.

<span class="mw-page-title-main">Twister (software)</span> Blog software

Twister is a decentralised, experimental peer-to-peer microblogging program. The system uses end-to-end encryption to safeguard communications. It is based on both BitTorrent- and Bitcoin-like protocols and has been likened to a distributed version of Twitter.

<span class="mw-page-title-main">InterPlanetary File System</span> Content-addressable, peer-to-peer hypermedia distribution protocol

The InterPlanetary File System (IPFS) is a protocol, hypermedia and file sharing peer-to-peer network for storing and sharing data in a distributed file system. IPFS uses content-addressing to uniquely identify each file in a global namespace connecting IPFS hosts.

<span class="mw-page-title-main">ZeroNet</span> Peer to peer web hosting

ZeroNet is a decentralized web-like network of peer-to-peer users, created by Tamas Kocsis in 2015, programming for the network was based in Budapest, Hungary; is built in Python; and is fully open source. Instead of having an IP address, sites are identified by a public key. The private key allows the owner of a site to sign and publish changes, which propagate through the network. Sites can be accessed through an ordinary web browser when using the ZeroNet application, which acts as a local webhost for such pages. In addition to using bitcoin cryptography, ZeroNet uses trackers from the BitTorrent network to negotiate connections between peers. ZeroNet is not anonymous by default, but it supports routing traffic through the Tor network.

References

  1. "Presearch is a Decentralized Search Engine".
  2. 297shares; 4.3kreads (2021-09-01). "Google Adds Presearch As A Default Option on Android Devices in EU". Search Engine Journal. Retrieved 2021-11-10.
  3. Kan, Michael (2022-05-26). "The Next Google? Decentralized Search Engine 'Presearch' Exits Testing Phase". PC Magazine.
  4. "YaCy: News". Archived from the original on 2005-11-24.
  5. Michael Christen. "Ich entwickle eine P2P-basierende Suchmaschine. Wer macht mit?". heise online.
  6. Ahmed, Reaz; Bari, Md. Faizul; Haque, Rakibul; Boutaba, Raouf; Mathieu, Bertrand (2014). "DEWS: A decentralized engine for Web search". 10th International Conference on Network and Service Management (CNSM) and Workshop. pp. 254–259. doi:10.1109/CNSM.2014.7014168. ISBN   978-3-901882-67-8. S2CID   1659299.
  7. Justin Hibbard. "Can peer-to-peer grow up?". Red Herring.[ permanent dead link ]
  8. Simon Foust. "Move Over Yahoo, Here Comes InfraSearch". Dmusic . Archived from the original on 2000-10-13.
  9. Sean M. Dugan. "Peer-to-peer networking is poised to revolutionize the Internet once again". InfoWorld . Archived from the original on 2000-10-18.
  10. John Borland. "Napster-like technology takes Web search to new level". Cnet.
  11. David Akin. "Software launched with a little pop". Financial Post .[ dead link ]
  12. Paul Heltzel. "OpenCola-Have Some Code and a Smile". Technology Review .
  13. Wolf Garbe. "BINGOOO - Die Transformation des World Wide Web zur virtuellen Datenbank" (in German). Wirtschaftinformatik. Archived from the original on 2014-02-02. Retrieved 2010-12-21. ... Wir setzen dem das Konzept einer verteilten Peer-to-Peer-Suchmaschine entgegen [We counter with the concept of a distributed peer-to-peer search engine] ...
  14. Bernard Lunn. "Technical Q&A With FAROO Founder". ReadWriteWeb. Archived from the original on 2011-02-14. ... When I started to work on the first prototype in 2004 ...
  15. "FAROO: History". Archived from the original on 2008-03-22.
  16. "Revisited: Deriving crawler start points from visited pages by monitoring HTTP traffic". Faroo.