Data exhaust

Last updated

Data exhaust or exhaust data is the trail of data left by the activities of an Internet or other computer system users during their online activity, behavior, and transactions. This is part of a broader category of unconventional data [1] that includes geospatial, network, and time-series data and may be useful for predictive analytics. Every visited website, clicked link, and even hovering with a mouse is collected, leaving behind a trail of data. [2] An enormous amount of often raw data are created, which can be in the form of cookies, temporary files, logfiles, storable choices, and more. [3] This information can help to improve the online experience, for example through customized content. It can be used to improve tracking trends and studying data exhaust also improves the user interface and the layout design. On the other hand, they can also compromise privacy, as they offer a valuable insight into the user's habits. For example, as the world's most popular website, Google, uses this data exhaust to refine the predictive value of their products. [4]

Contents

The data that is collected by companies is often information that does not seem immediately useful. Although the information is not used by the company right away, it can be stored for future use or sold to someone else who can use the information. The data can help with quality control, performance, and revenue. [5] Unlike primary content, these data are not purposefully created by the user, who is often unaware of their very existence. A bank for example would consider as primary data information concerning the sums and parties of a transaction, whilst secondary data might include the percentage of transactions carried out at a cash machine instead of a real bank. [6]

Medical exhaust data

Most medical devices emit some form of exhaust data, such as many pacemakers, dialysis machines, and cameras used during surgery. [7] The majority of this data is never captured, and is primarily abandoned after the surgery is completed, or the device makes its next routine check. Some issues have arisen regarding the use of the data captured by devices like pacemakers. This can lead to larger issues surrounding the use of this exhaust data. [8] Using electronic health records (EMR) for research poses a large number of challenges, the most prevalent being the amount of data there is. This surplus of data is too much for people to sort through and analyze, thus creating a need for algorithms. [9]

Solutions

Although data exhaust is not a new concept, the ubiquity of internet-enabled gadgetry has exacerbated the scope and impacts of our passive digital trail. The collection and distribution of data thus generated is not illegal, but there are steps that must be taken to ensure that the use of this data is ethical. In order to ensure privacy of users, when the information is sold it can be anonymized. Also, users can be given the opportunity to opt-out of the selling of their information if they choose. Lastly, to build trust, websites can update their privacy policies so that they include all the data in which they will be collecting about the user. [10]

See also

Related Research Articles

<span class="mw-page-title-main">Privacy</span> Seclusion from unwanted attention

Privacy is the ability of an individual or group to seclude themselves or information about themselves, and thereby express themselves selectively.

In connection-oriented communication, a data stream is the transmission of a sequence of digitally encoded signals to convey information. Typically, the transmitted symbols are grouped into a series of packets.

<span class="mw-page-title-main">Machine learning</span> Study of algorithms that improve automatically through experience

Machine learning (ML) is an umbrella term for solving problems for which development of algorithms by human programmers would be cost-prohibitive, and instead the problems are solved by helping machines "discover" their "own" algorithms, without needing to be explicitly told what to do by any human-developed algorithms. Recently, generative artificial neural networks have been able to surpass results of many previous approaches. Machine-learning approaches have been applied to large language models, computer vision, speech recognition, email filtering, agriculture and medicine, where it is too costly to develop algorithms to perform the needed tasks.

Information privacy is the relationship between the collection and dissemination of data, technology, the public expectation of privacy, contextual information norms, and the legal and political issues surrounding them. It is also known as data privacy or data protection.

Internet privacy involves the right or mandate of personal privacy concerning the storage, re-purposing, provision to third parties, and display of information pertaining to oneself via the Internet. Internet privacy is a subset of data privacy. Privacy concerns have been articulated from the beginnings of large-scale computer sharing and especially relate to mass surveillance enabled by the emergence of computer technologies.

Health technology is defined by the World Health Organization as the "application of organized knowledge and skills in the form of devices, medicines, vaccines, procedures, and systems developed to solve a health problem and improve quality of lives". This includes pharmaceuticals, devices, procedures, and organizational systems used in the healthcare industry, as well as computer-supported information systems. In the United States, these technologies involve standardized physical objects, as well as traditional and designed social means and methods to treat or care for patients.

A click path or clickstream is the sequence of hyperlinks one or more website visitors follows on a given site, presented in the order viewed. A visitor's click path may start within the website or at a separate third party website, often a search engine results page, and it continues as a sequence of successive webpages visited by the user. Click paths take call data and can match it to ad sources, keywords, and/or referring domains, in order to capture data.

<span class="mw-page-title-main">Digital footprint</span> Ones unique set of traceable digital activities

Digital footprint or digital shadow refers to one's unique set of traceable digital activities, actions, contributions, and communications manifested on the Internet or digital devices. Digital footprints can be classified as either passive or active. The former is composed of a user's web-browsing activity and information stored as cookies. The latter is often released deliberately by a user to share information on websites or social media. While the term usually applies to a person, a digital footprint can also refer to a business, organization or corporation.

<span class="mw-page-title-main">Virtual assistant</span> Software agent

A virtual assistant (VA) is a software agent that can perform a range of tasks or services for a user based on user input such as commands or questions, including verbal ones. Such technologies often incorporate chatbot capabilities to simulate human conversation, such as via online chat, to facilitate interaction with their users. The interaction may be via text, graphical interface, or voice - as some virtual assistants are able to interpret human speech and respond via synthesized voices.

Web tracking is the practice by which operators of websites and third parties collect, store and share information about visitors’ activities on the World Wide Web. Analysis of a user's behaviour may be used to provide content that enables the operator to infer their preferences and may be of interest to various parties, such as advertisers. Web tracking can be part of visitor management.

The social data revolution is the shift in human communication patterns towards increased personal information sharing and its related implications, made possible by the rise of social networks in the early 2000s. This phenomenon has resulted in the accumulation of unprecedented amounts of public data.

<span class="mw-page-title-main">Dataveillance</span> Monitoring and collecting online data and metadata

Dataveillance is the practice of monitoring and collecting online data as well as metadata. The word is a portmanteau of data and surveillance. Dataveillance is concerned with the continuous monitoring of users' communications and actions across various platforms. For instance, dataveillance refers to the monitoring of data resulting from credit card transactions, GPS coordinates, emails, social networks, etc. Using digital media often leaves traces of data and creates a digital footprint of our activity. Unlike sousveillance, this type of surveillance is not often known and happens discreetly. Dataveillance may involve the surveillance of groups of individuals. There exist three types of dataveillance: personal dataveillance, mass dataveillance, and facilitative mechanisms.

Crowdsensing, sometimes referred to as mobile crowdsensing, is a technique where a large group of individuals having mobile devices capable of sensing and computing collectively share data and extract information to measure, map, analyze, estimate or infer (predict) any processes of common interest. In short, this means crowdsourcing of sensor data from mobile devices.

<span class="mw-page-title-main">Artificial intelligence in healthcare</span> Overview of the use of artificial intelligence in healthcare

Artificial intelligence in healthcare is an overarching term used to describe the use of machine-learning algorithms and software, or artificial intelligence (AI), to mimic human cognition in the analysis, presentation, and comprehension of complex medical and health care data, or to exceed human capabilities by providing new ways to diagnose, treat, or prevent disease. Specifically, AI is the ability of computer algorithms to approximate conclusions based solely on input data.

The industrial internet of things (IIoT) refers to interconnected sensors, instruments, and other devices networked together with computers' industrial applications, including manufacturing and energy management. This connectivity allows for data collection, exchange, and analysis, potentially facilitating improvements in productivity and efficiency as well as other economic benefits. The IIoT is an evolution of a distributed control system (DCS) that allows for a higher degree of automation by using cloud computing to refine and optimize the process controls.

A data management platform (DMP) is a software platform used for collecting and managing data. They allow businesses to identify audience segments, which can be used to target specific users and contexts in online advertising campaigns. DMPs may use big data and artificial intelligence algorithms to process and analyze large data sets about users from various sources. Some advantages of using DMPs include data organization, increased insight on audiences and markets, and effective advertisement budgeting. On the other hand, DMPs often have to deal with privacy concerns due to the integration of third-party software with private data. This technology is continuously being developed by global entities such as Nielsen and Oracle.

Privacy settings are "the part of a social networking website, internet browser, piece of software, etc. that allows you to control who sees information about you". With the growing prevalence of social networking services, opportunities for privacy exposures also grow. Privacy settings allow a person to control what information is shared on these platforms.

Click tracking is when user click behavior or user navigational behavior is collected in order to derive insights and fingerprint users. Click behavior is commonly tracked using server logs which encompass click paths and clicked URLs. This log is often presented in a standard format including information like the hostname, date, and username. However, as technology develops, new software allows for in depth analysis of user click behavior using hypervideo tools. Given that the internet can be considered a risky environment, research strives to understand why users click certain links and not others. Research has also been conducted to explore the user experience of privacy with making user personal identification information individually anonymized and improving how data collection consent forms are written and structured.

Soft privacy technologies fall under the category of PETs, Privacy-enhancing technologies, as methods of protecting data. Soft privacy is a counterpart to another subcategory of PETs, called hard privacy. Soft privacy technology has the goal of keeping information safe, allowing services to process data while having full control of how data is being used. To accomplish this, soft privacy emphasizes the use of third-party programs to protect privacy, emphasizing auditing, certification, consent, access control, encryption, and differential privacy. Since evolving technologies like the internet, machine learning, and big data are being applied to many long-standing fields, we now need to process billions of datapoints every day in areas such as health care, autonomous cars, smart cards, social media, and more. Many of these fields rely on soft privacy technologies when they handle data.

Automated decision-making (ADM) involves the use of data, machines and algorithms to make decisions in a range of contexts, including public administration, business, health, education, law, employment, transport, media and entertainment, with varying degrees of human oversight or intervention. ADM involves large-scale data from a range of sources, such as databases, text, social media, sensors, images or speech, that is processed using various technologies including computer software, algorithms, machine learning, natural language processing, artificial intelligence, augmented intelligence and robotics. The increasing use of automated decision-making systems (ADMS) across a range of contexts presents many benefits and challenges to human society requiring consideration of the technical, legal, ethical, societal, educational, economic and health consequences.

References

  1. "What is Unconventional Data? - Definition from EU Glossary" . Retrieved 2019-04-28.
  2. Kosciejew, M. (2013). The individual and big data. Feliciter, 59(6), 47
  3. "What is Data Exhaust? - Definition from Techopedia". Techopedia.com. Retrieved 2018-11-01.
  4. Zuboff, Shoshana (2015). "Big other: Surveillance Capitalism and the Prospects of an Information Civilization". Journal of Information Technology. 30: 75–89. doi:10.1057/jit.2015.5. S2CID   15329793.
  5. "What is Data Exhaust and What Can You Do With It?". www.datasciencecentral.com. Retrieved 2018-11-01.
  6. "5 things you need to know about data exhaust".
  7. Rob, Kitchin (2014-08-26). The data revolution : big data, open data, data infrastructures & their consequences. Los Angeles, California. ISBN   978-1446287484. OCLC   871211376.{{cite book}}: CS1 maint: location missing publisher (link)
  8. "Our Medical Data Must Become Free". WIRED. Retrieved 2017-10-12.
  9. "Healthcare data everywhere: the waste problem - AI Med". AI Med. 2018-05-09. Retrieved 2018-11-01.
  10. "Dealing with data exhaust. - Free Online Library". www.thefreelibrary.com. Retrieved 2018-11-01.