Session (web analytics)

Last updated

In web analytics, a session, or visit is a unit of measurement of a user's actions taken within a period of time or with regard to completion of a task. Sessions are also used in operational analytics and provision of user-specific recommendations. There are two primary methods used to define a session: time-oriented approaches based on continuity in user activity and navigation-based approaches based on continuity in a chain of requested pages.

Contents

Definition

The definition of "session" varies, particularly when applied to search engines. [1] Generally, a session is understood to consist of "a sequence of requests made by a single end-user during a visit to a particular site". [2] In the context of search engines, "sessions" and "query sessions" have at least two definitions. [1] A session or query session may be all queries made by a user in a particular time period [3] or it may also be a series of queries or navigations with a consistent underlying user need. [4] [5]

Uses

Sessions per user can be used as a measurement of website usage. [6] [7] Other metrics used within research and applied web analytics include session length, [8] and user actions per session. [9] Session length is seen as a more accurate alternative to measuring page views. [10]

Reconstructed sessions have also been used to measure total user input, including to measure the number of labour hours taken to construct Wikipedia. [11] Sessions are also used for operational analytics, data anonymization, identifying networking anomalies, and synthetic workload generation for testing servers with artificial traffic. [12] [13]

Session reconstruction

an illustration of the different criteria used by different session reconstruction approaches. Time vs. Navigation orientation (Session reconstruction).svg
an illustration of the different criteria used by different session reconstruction approaches.

Essential to the use of sessions in web analytics is being able to identify them. This is known as "session reconstruction". Approaches to session reconstruction can be divided into two main categories: time-oriented, and navigation-oriented. [14]

Time-oriented approaches

Time-oriented approaches to session reconstruction look for a set period of user inactivity commonly called an "inactivity threshold." Once this period of inactivity is reached, the user is assumed to have left the site or stopped using the browser entirely and the session is ended. Further requests from the same user are considered a second session. A common value for the inactivity threshold is 30 minutes and sometimes described as the industry standard. [15] [16] Some have argued that a threshold of 30 minutes produces artifacts around naturally long sessions and have experimented with other thresholds. [17] [18] Others simply state: "no time threshold is effective at identifying [sessions]". [19]

One alternative that has been proposed is using user-specific thresholds rather than a single, global threshold for the entire dataset. [20] [21] This has the problem of assuming that the thresholds follow a bimodal distribution, and is not suitable for datasets that cover a long period of time. [17]

Navigation-oriented approaches exploit the structure of websites - specifically, the presence of hyperlinks and the tendency of users to navigate between pages on the same website by clicking on them, rather than typing the full URL into their browser. [14] One way of identifying sessions by looking at this data is to build a map of the website: if the user's first page can be identified, the "session" of actions lasts until they land on a page which cannot be accessed from any of the previously-accessed pages. This takes into account backtracking, where a user will retrace their steps before opening a new page. [22] A simpler approach, which does not take backtracking into account, is to simply require that the HTTP referer of each request be a page that is already in the session. If it is not, a new session is created. [23] This class of heuristics "exhibits very poor performance" on websites that contain framesets. [24]

Related Research Articles

<span class="mw-page-title-main">Web crawler</span> Software which systematically browses the World Wide Web

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

CiteSeerX is a public search engine and digital library for scientific and academic papers, primarily in the fields of computer and information science.

A recommender system, or a recommendation system, is a subclass of information filtering system that provides suggestions for items that are most pertinent to a particular user. Recommender systems are particularly useful when an individual needs to choose an item from a potentially overwhelming number of items that a service may offer.

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed ; the more items added, the larger the probability of false positives.

In cryptography, a private information retrieval (PIR) protocol is a protocol that allows a user to retrieve an item from a server in possession of a database without revealing which item is retrieved. PIR is a weaker version of 1-out-of-n oblivious transfer, where it is also required that the user should not get information about other database items.

<span class="mw-page-title-main">Non-photorealistic rendering</span> Style of rendering

Non-photorealistic rendering (NPR) is an area of computer graphics that focuses on enabling a wide variety of expressive styles for digital art, in contrast to traditional computer graphics, which focuses on photorealism. NPR is inspired by other artistic modes such as painting, drawing, technical illustration, and animated cartoons. NPR has appeared in movies and video games in the form of cel-shaded animation as well as in scientific visualization, architectural illustration and experimental animation.

<span class="mw-page-title-main">MonetDB</span> Open source column-oriented relational database management system

MonetDB is an open-source column-oriented relational database management system (RDBMS) originally developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It is designed to provide high performance on complex queries against large databases, such as combining tables with hundreds of columns and millions of rows. MonetDB has been applied in high-performance applications for online analytical processing, data mining, geographic information system (GIS), Resource Description Framework (RDF), text retrieval and sequence alignment processing.

Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values.

Cold start is a potential problem in computer-based information systems which involves a degree of automated data modelling. Specifically, it concerns the issue that the system cannot draw any inferences for users or items about which it has not yet gathered sufficient information.

A web query or web search query is a query that a user enters into a web search engine to satisfy their information needs. Web search queries are distinctive in that they are often plain text and boolean search directives are rarely used. They vary greatly from standard query languages, which are governed by strict syntax rules as command languages with keyword or positional parameters.

Collaborative search engines (CSE) are Web search engines and enterprise searches within company intranets that let users combine their efforts in information retrieval (IR) activities, share information resources collaboratively using knowledge tags, and allow experts to guide less experienced people through their searches. Collaboration partners do so by providing query terms, collective tagging, adding comments or opinions, rating search results, and links clicked of former (successful) IR activities to users having the same or a related information need.

Folksonomy is a classification system in which end users apply public tags to online items, typically to make those items easier for themselves or others to find later. Over time, this can give rise to a classification system based on those tags and how often they are applied or searched for, in contrast to a taxonomic classification designed by the owners of the content and specified when it is published. This practice is also known as collaborative tagging, social classification, social indexing, and social tagging. Folksonomy was originally "the result of personal free tagging of information [...] for one's own retrieval", but online sharing and interaction expanded it into collaborative forms. Social tagging is the application of tags in an open online environment where the tags of other users are available to others. Collaborative tagging is tagging performed by a group of users. This type of folksonomy is commonly used in cooperative and collaborative projects such as research, content repositories, and social bookmarking.

Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data may, for example, consist of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment for each item. The goal of constructing the ranking model is to rank new, unseen lists in a similar way to rankings in the training data.

<span class="mw-page-title-main">Tomasz Imieliński</span> Polish-American computer scientist (born 1954)

Tomasz Imieliński is a Polish-American computer scientist, most known in the areas of data mining, mobile computing, data extraction, and search engine technology. He is currently a professor of computer science at Rutgers University in New Jersey, United States.

Patrick Eugene O'Neil was an American computer scientist, an expert on databases, and a professor of computer science at the University of Massachusetts Boston.

<span class="mw-page-title-main">Log-structured merge-tree</span> Data structure

In computer science, the log-structured merge-tree is a data structure with performance characteristics that make it attractive for providing indexed access to files with high insert volume, such as transactional log data. LSM trees, like other search trees, maintain key-value pairs. LSM trees maintain data in two or more separate structures, each of which is optimized for its respective underlying storage medium; data is synchronized between the two structures efficiently, in batches.

Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. This family of methods became widely known during the Netflix prize challenge due to its effectiveness as reported by Simon Funk in his 2006 blog post, where he shared his findings with the research community. The prediction results can be improved by assigning different regularization weights to the latent factors based on items' popularity and users' activeness.

<span class="mw-page-title-main">Gautam Das (computer scientist)</span> Indian computer scientist

Gautam Das is a computer scientist in the field of databases research. He is an ACM Fellow and IEEE Fellow.

Wei Wang is a Chinese-born American computer scientist. She is the Leonard Kleinrock Chair Professor in Computer Science and Computational Medicine at University of California, Los Angeles and the director of the Scalable Analytics Institute (ScAi). Her research specializes in big data analytics and modeling, database systems, natural language processing, bioinformatics and computational biology, and computational medicine.

Click tracking is when user click behavior or user navigational behavior is collected in order to derive insights and fingerprint users. Click behavior is commonly tracked using server logs which encompass click paths and clicked URLs. This log is often presented in a standard format including information like the hostname, date, and username. However, as technology develops, new software allows for in depth analysis of user click behavior using hypervideo tools. Given that the internet can be considered a risky environment, research strives to understand why users click certain links and not others. Research has also been conducted to explore the user experience of privacy with making user personal identification information individually anonymized and improving how data collection consent forms are written and structured.

References

  1. 1 2 Gayo-Avello 2009, p. 1824.
  2. Arlitt 2000, p. 2.
  3. Donato, Bonchi & Chi 2010, p. 324.
  4. Gayo-Avello 2009, p. 1825.
  5. Lam, Russell & Tang 2007, p. 147.
  6. Weischdel & Huizingh 2006, p. 464.
  7. Catledge & Pitkow 1995, p. 5.
  8. Jansen & Spink 2006, p. 10.
  9. Jansen, Spink & Saracevic 2000, p. 12.
  10. Khoo et al. 2008, p. 377.
  11. Geiger & Halfaker 2014, p. 1.
  12. Meiss et al. 2009, p. 177.
  13. Arlitt 2000, p. 8.
  14. 1 2 Spiliopoulou et al. 2003, p. 176.
  15. Ortega & Aguillo 2010, p. 332.
  16. Eickhoff et al. 2014, p. 3.
  17. 1 2 Mehrzadi & Feitelson 2012, p. 3.
  18. He, Goker & Harper 2002, p. 733.
  19. Jones & Klinkner 2008, p. 2.
  20. Murray, Lin & Chowdhury 2006, p. 3.
  21. Mehrzadi & Feitelson 2012, p. 1.
  22. Cooley, Mobasher & Srivastava 1999, p. 19.
  23. Cooley, Mobasher & Srivastava 1999, p. 23.
  24. Berendt et al. 2003, p. 179.

Bibliography