Privacy in file sharing networks

Last updated

Peer-to-peer file sharing (P2P) systems like Gnutella, KaZaA, and eDonkey/eMule, have become extremely popular in recent years, with the estimated user population in the millions. An academic research paper analyzed Gnutella and eMule protocols and found weaknesses in the protocol; many of the issues found in these networks are fundamental and probably common on other P2P networks. [1] Users of file sharing networks, such as eMule and Gnutella, are subject to monitoring of their activity. Clients may be tracked by IP address, DNS name, software version they use, files they share, queries they initiate, and queries they answer to. [1] Clients may also share their private files to the network without notice due to inappropriate settings. [2]

Contents

Much is known about the network structure, routing schemes, performance load and fault tolerance of P2P systems in general. [3] It might be surprising, but the eMule protocol does not provide much privacy to the users, although it is a P2P protocol which is supposed to be decentralized. [4]

The Gnutella and eMule protocols

The eMule protocol

eMule is one of the clients which implements the eDonkey network. The eMule protocol consists of more than 75 types of messages. When an eMule client connects to the network, it first gets a list of known eMule servers which can be obtained from the Internet. Despite the fact that there are millions of eMule clients, there are only small amount of servers. [5] [6] The client connects to a server with TCP connection. That stays open as long as the client is connected to the network. Upon connecting, the client sends a list of its shared files to the server. By this the server builds a database with the files that reside on this client. [7] The server also returns a list of other known servers. The server returns an ID to the client, which is a unique client identifier within the system. The server can only generate query replies to clients which are directly connected to it. The download is done by dividing the file into parts and asking each client a part.[ citation needed ]

The Gnutella protocol

Gnutella protocol v0.4

In Gnutella protocol V0.4 all the nodes are identical, and every node may choose to connect to every other. [8] The Gnutella protocol consist of 5 message types: query for tile search. Query messages use a flooding mechanism, i.e. each node that receives a query forwards it on all of its adjacent graph node links. [9] A node that receives a query and has the appropriate file replies with a query hit message. A hop count field in the header limits the message lifetime.[ citation needed ] Ping and Pong messages are used for detecting new nodes that can be linked to the actual file download performed by opening TCP connection and using the HTTP GET mechanism. [10]

Gnutella protocol v0.6

Gnutella protocol V0.6 includes several modifications: A node has one of two operational modes: "leaf node" or "ultrapeer".[ citation needed ] Initially each node starts in a leaf node mode in which it can only connect to ultrapeers. The leaf nodes send query to an ultrapeer, the ultrapeer forwards the query and waits for the replies. When a node has enough bandwidth and uptime, the node may become an ultrapeer.[ citation needed ] Ultrapeers send periodically a request for their clients to send a list with the shared files they have. If a query arrives with a search string that matches one of the files in the leaves, the ultrapeer replies and pointing to the specific leaf.[ citation needed ]

Tracking initiators and responders

In version 0.4 of the Gnutella protocol, an ultrapeer which receives a message from a leaf node (message with hop count zero) knows for sure that the message was originated from that leaf node.[ citation needed ]

In version 0.6 of the protocol, If an ultrapeer receives a message from an ultrapeer with hop count zero then it knows that the message originated by the ultrapeer or by one of its leaves (The average number of the leaves nodes that are connected to an ultrapeer is 200).[ citation needed ]

Tracking a single node

Many clients of Gnutella have an HTTP monitor feature. This feature allows sending information about the node to any node which supports an empty HTTP request, and receiving on response.[ citation needed ] Research shows that a simple crawler which is connected to Gnutella network can get from an initial entry point a list of IP addresses which are connected to that entry point.[ citation needed ] Then the crawler can continue to inquire for other IP addresses. An academic research performed the following experiment: At NYU, a regular Gnucleus software client that was connected to the Gnutella network as a leaf node, with distinctive listening TCP port 44121. At the Hebrew University, Jerusalem, Israel, a crawler ran looking for client listening with port 44121. In less than 15 minutes the crawler found the IP address of the Gnucleus client in NYU with the unique port.[ citation needed ]

IP address harvesting

If a user is connected to the Gnutella network within, say, the last 24 hours, that user's IP address can be easily harvested by hackers, since the HTTP monitoring feature can collect about 300,000 unique addresses within 10 hours.[ citation needed ]

Tracking nodes by GUID creation

A Globally unique identifier (GUID) is a 16 bytes field in the Gnutella message header, which uniquely identifies every Gnutella message. The protocol does not specify how to generate the GUID.[ citation needed ]

Gnucleus on Windows uses the Ethernet MAC address used as the GUID 6 lower bytes. Therefore, Windows clients reveal their MAC address when sending queries. [11]

In the JTella 0.7 client software the GUID is created using the Java random number without an initialization. Therefore, on each session, the client creates a sequence of queries with the same repeating IDs. Over time, a correlation between the user queries can be found.[ citation needed ]

Collecting miscellaneous information users

The monitoring facility of Gnutella reveals an abundance of precious information on its users. It is possible to collect the information about the software vendor and the version that the clients use. Other statistical information about the client is available as well: capacity, uptime, local files etc.[ citation needed ]

In Gnutella V0.6, information about client software can be collected (even if the client does not support HTTP monitoring). The information is found in the first two messages connection handshake.[ citation needed ]

Tracking users by partial information

Some Gnutella users have a small look-alike set, which makes it easier to track them by knowing this very partial information.[ citation needed ]

Tracking users by queries

An academic research team performed the following experiment: the team ran five Gnutella as ultrapeer (in order to listen to other nodes’ queries). The team revealed about 6% of the queries.[ citation needed ]

Usage of hash functions

SHA-1 hashes refer to SHA-1 of files not search strings.

Half of the search queries are strings and half of them are the output of a hash function (SHA-1) applied on the string. Although the usage of hash function is intended to improve the privacy, an academic research showed that the query content can be exposed easily by a dictionary attack: collaborators ultrapeers can gradually collect common search strings, calculate their hash value and store them into a dictionary. When a hashed query arrives, each collaborated ultrapeer can check matches with the dictionary and expose the original string accordingly.[ citation needed ] [12]

Measures

A common countermeasure used is concealing a user's IP address when downloading or uploading content by using anonymous networks, such as I2P - The Anonymous Network. There is also data encryption and the use of indirect connections (mix networks) to exchange data between peers. [13] Thus all traffic is anonymized and encrypted. Unfortunately, anonymity and safety come at the price of much lower speeds, and due to the nature of those networks being internal networks there currently still is less content. However, this will change, once there are more users.[ citation needed ]

See also

Related Research Articles

Gnutella is a peer-to-peer network protocol. Founded in 2000, it was the first decentralized peer-to-peer network of its kind, leading to other, later networks adopting the model.

<span class="mw-page-title-main">Peer-to-peer</span> Type of decentralized and distributed network architecture

Peer-to-peer (P2P) computing or networking is a distributed application architecture that partitions tasks or workloads between peers. Peers are equally privileged, equipotent participants in the network. This forms a peer-to-peer network of nodes.

eDonkey2000

eDonkey2000 was (is) a peer-to-peer file sharing application developed by US company MetaMachine, using the Multisource File Transfer Protocol. It supported both the eDonkey2000 network and the Overnet network.

<span class="mw-page-title-main">Distributed hash table</span> Decentralized distributed system with lookup service

A distributed hash table (DHT) is a distributed system that provides a lookup service similar to a hash table. Key–value pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. The main advantage of a DHT is that nodes can be added or removed with minimum work around re-distributing keys. Keys are unique identifiers which map to particular values, which in turn can be anything from addresses, to documents, to arbitrary data. Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows a DHT to scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.

WASTE is a peer-to-peer and friend-to-friend protocol and software application developed by Justin Frankel at Nullsoft in 2003 that features instant messaging, chat rooms, and file browsing/sharing capabilities. The name WASTE is a reference to Thomas Pynchon's novel The Crying of Lot 49. In the novel, W.A.S.T.E. is an underground postal service.

gtk-gnutella

gtk-gnutella is a peer-to-peer file sharing application which runs on the gnutella network. gtk-gnutella uses the GTK+ toolkit for its graphical user interface. Released under the GNU General Public License, gtk-gnutella is free software.

<span class="mw-page-title-main">GNUnet</span> Framework for decentralized, peer-to-peer networking which is part of the GNU Project

GNUnet is a software framework for decentralized, peer-to-peer networking and an official GNU package. The framework offers link encryption, peer discovery, resource allocation, communication over many transports and various basic peer-to-peer algorithms for routing, multicast and network size estimation.

<span class="mw-page-title-main">Gnutella2</span>

Gnutella2, often referred to as G2, is a peer-to-peer protocol developed mainly by Michael Stokes and released in 2002.

Query flooding is a method to search for a resource on a peer-to-peer network. It is simple and scales very poorly and thus is rarely used. Early versions of the Gnutella protocol operated by query flooding; newer versions use more efficient search algorithms.

An anonymous P2P communication system is a peer-to-peer distributed application in which the nodes, which are used to share resources, or participants are anonymous or pseudonymous. Anonymity of participants is usually achieved by special routing overlay networks that hide the physical location of each node from other participants.

Kademlia is a distributed hash table for decentralized peer-to-peer computer networks designed by Petar Maymounkov and David Mazières in 2002. It specifies the structure of the network and the exchange of information through node lookups. Kademlia nodes communicate among themselves using UDP. A virtual or overlay network is formed by the participant nodes. Each node is identified by a number or node ID. The node ID serves not only as identification, but the Kademlia algorithm uses the node ID to locate values.

The Invisible Internet Project (I2P) is an anonymous network layer that allows for censorship-resistant, peer-to-peer communication. Anonymous connections are achieved by encrypting the user's traffic, and sending it through a volunteer-run network of roughly 55,000 computers distributed around the world. Given the high number of possible paths the traffic can transit, a third party watching a full connection is unlikely. The software that implements this layer is called an "I2P router", and a computer running I2P is called an "I2P node". I2P is free and open sourced, and is published under multiple licenses.

An overlay network is a computer network that is layered on top of another network.

<span class="mw-page-title-main">Magnet URI scheme</span> Scheme that defines the format of magnet links

Magnet is a URI scheme that defines the format of magnet links, a de facto standard for identifying files (URN) by their content, via cryptographic hash value rather than by their location.

The eDonkey Network is a decentralized, mostly server-based, peer-to-peer file sharing network created in 2000 by US developers Jed McCaleb and Sam Yagan that is best suited to share big files among users, and to provide long term availability of files. Like most sharing networks, it is decentralized, as there is no central hub for the network; also, files are not stored on a central server but are exchanged directly between users based on the peer-to-peer principle.

The Kad network is a peer-to-peer (P2P) network which implements the Kademlia P2P overlay protocol. The majority of users on the Kad Network are also connected to servers on the eDonkey network, and Kad Network clients typically query known nodes on the eDonkey network in order to find an initial node on the Kad network.

GnucDNA was a software library for building peer-to-peer applications. It provides developers with a common layer to create their own Gnutella or Gnutella2 client or network. As a separate component, GnucDNA can be updated independently of the client, passing down improvements to the applications already using it.

Tapestry is a peer-to-peer overlay network which provides a distributed hash table, routing, and multicasting infrastructure for distributed applications. The Tapestry peer-to-peer system offers efficient, scalable, self-repairing, location-aware routing to nearby resources.

eMule Free peer-to-peer file sharing application for Microsoft Windows.

eMule is a free peer-to-peer file sharing application for Microsoft Windows. Started in May 2002 as an alternative to eDonkey2000, eMule now connects to both the eDonkey network and the Kad network. The distinguishing features of eMule are the direct exchange of sources between client nodes, fast recovery of corrupted downloads, and the use of a credit system to reward frequent uploaders. Furthermore, eMule transmits data in zlib-compressed form to save bandwidth.

<span class="mw-page-title-main">Phex</span>

Phex is a peer-to-peer file sharing client for the gnutella network, released under the terms of the GNU General Public License, so Phex is free software. Phex is based on Java SE 5.0 or later.

References

  1. 1 2 Bickson, Danny; Malkhi, Dahlia (2004). "A Study of Privacy in File Sharing Networks". Archived from the original on 12 October 2013. Retrieved 12 February 2013.{{cite web}}: CS1 maint: bot: original URL status unknown (link)
  2. Liu, Bingshuang; Liu, Zhaoyang; Zhang, Jianyu; Wei, Tao; Zou, Wei (2012-10-15). "How many eyes are spying on your shared folders?". Proceedings of the 2012 ACM workshop on Privacy in the electronic society. WPES '12. Raleigh, North Carolina, USA: Association for Computing Machinery. pp. 109–116. doi:10.1145/2381966.2381982. ISBN   978-1-4503-1663-7. S2CID   13840840.
  3. Eng Keong Lua Jon Crowcroft. "A Survey and Comparison of Peer-to-Peer Overlay Network Schemes". IEEE Communications Surveys & Tutorials. 7 (2): 72–93.
  4. Silva, Pedro Moreira da (19 June 2017). "Mistrustful P2P: Deterministic privacy-preserving P2P file sharing model to hide user content interests in untrusted peer-to-peer networks". Computer Networks. 120: 87–104. doi:10.1016/j.comnet.2017.04.005.
  5. "Top Project Listings". sourceforge.net. Retrieved 2021-09-18.
  6. "Safe Server List for eMule. Generated: September 17 2021 18:28:20 UTC+3". www.emule-security.org. Retrieved 2021-09-18.
  7. Yoram Kulbak and Danny Bickson. "The eMule protocol specification". EMule Project.
  8. "privacy in file sharing". inba.info. Retrieved 2020-10-23.
  9. Yingwu Zhu; Yiming Hu (2006-12-01). "Enhancing Search Performance on Gnutella-Like P2P Systems". IEEE Transactions on Parallel and Distributed Systems. 17 (12): 1482–1495. doi:10.1109/tpds.2006.173. ISSN   1045-9219. S2CID   496918.
  10. "Gnutella Protocol Development". rfc-gnutella.sourceforge.net. Retrieved 2020-11-12.
  11. Courtney, Kylan. (2012). Information and internet privacy handbook. Murdock, Keon. (1st ed.). Delhi [India]: College Publishing House. ISBN   978-81-323-1280-2. OCLC   789644329.
  12. Zink, Thomas (October 2020). "Analysis and Efficient Classification of P2P File Sharing Traffic". Universität Konstanz.
  13. "Don't get confused! Check Out the Simple Steps to Configure the Right IP Address". wincah.com. 2023.

Further reading