Google Flu Trends (GFT) was a web service operated by Google. It provided estimates of influenza activity for more than 25 countries. By aggregating Google Search queries, it attempted to make accurate predictions about flu activity. This project was first launched in 2008 by Google.org to help predict outbreaks of flu. [1]
Google Flu Trends stopped publishing current estimates on 9 August 2015. Historical estimates are still available for download, and current data are offered for declared research purposes. [2]
The idea behind Google Flu Trends was that, by monitoring millions of users’ health tracking behaviors online, the large number of Google search queries gathered can be analyzed to reveal if there is the presence of flu-like illness in a population. Google Flu Trends compared these findings to a historic baseline level of influenza activity for its corresponding region and then reports the activity level as either minimal, low, moderate, high, or intense. These estimates have been generally consistent with conventional surveillance data collected by health agencies, both nationally and regionally.
Roni Zeiger helped develop Google Flu Trends. [3]
Google Flu Trends was described as using the following method to gather information about flu trends. [4] [5]
First, a time series is computed for about 50 million common queries entered weekly within the United States from 2003 to 2008. A query's time series is computed separately for each state and normalized into a fraction by dividing the number of each query by the number of all queries in that state. By identifying the IP address associated with each search, the state in which this query was entered can be determined.
A linear model is used to compute the log-odds of Influenza-like illness (ILI) physician visit and the log-odds of ILI-related search query:
P is the percentage of ILI physician visit and Q is the ILI-related query fraction computed in previous steps. β0 is the intercept and β1 is the coefficient, while ε is the error term.[ citation needed ]
Each of the 50 million queries is tested as Q to see if the result computed from a single query could match the actual history ILI data obtained from the U.S. Centers for Disease Control and Prevention (CDC). This process produces a list of top queries which gives the most accurate predictions of CDC ILI data when using the linear model. Then the top 45 queries are chosen because, when aggregated together, these queries fit the history data the most accurately. Using the sum of top 45 ILI-related queries, the linear model is fitted to the weekly ILI data between 2003 and 2007 so that the coefficient can be gained. Finally, the trained model is used to predict flu outbreak across all regions in the United States.
This algorithm has been subsequently revised by Google, partially in response to concerns about accuracy, and attempts to replicate its results have suggested that the algorithm developers "felt an unarticulated need to cloak the actual search terms identified". [6]
Google Flu Trends tries to avoid privacy violations by only aggregating millions of anonymous search queries, without identifying individuals that performed the search. [1] [7] Their search log contains the IP address of the user, which could be used to trace back to the region where the search query is originally submitted. Google runs programs on computers to access and calculate the data, so no human is involved in the process. Google also implemented the policy to anonymize IP address in their search logs after 9 months. [8]
However, Google Flu Trends has raised privacy concerns among some privacy groups. Electronic Privacy Information Center and Patient Privacy Rights sent a letter to Eric Schmidt in 2008, then the CEO of Google. [9] They conceded that the use of user-generated data could support public health effort in significant ways, but expressed their worries that "user-specific investigations could be compelled, even over Google's objection, by court order or Presidential authority".
An initial motivation for GFT was that being able to identify disease activity early and respond quickly could reduce the impact of seasonal and pandemic influenza. One report was that Google Flu Trends was able to predict regional outbreaks of flu up to 10 days before they were reported by the CDC (Centers for Disease Control and Prevention). [10]
In the 2009 flu pandemic Google Flu Trends tracked information about flu in the United States. [11] In February 2010, the CDC identified influenza cases spiking in the mid-Atlantic region of the United States. However, Google's data of search queries about flu symptoms was able to show that same spike two weeks prior to the CDC report being released.[ citation needed ]
“The earlier the warning, the earlier prevention and control measures can be put in place, and this could prevent cases of influenza,” said Dr. Lyn Finelli, lead for surveillance at the influenza division of the CDC. “From 5 to 20 percent of the nation's population contract the flu each year, leading to roughly 36,000 deaths on average.” [10]
Google Flu Trends is an example of collective intelligence that can be used to identify trends and calculate predictions. The data amassed by search engines is significantly insightful because the search queries represent people's unfiltered wants and needs. “This seems like a really clever way of using data that is created unintentionally by the users of Google to see patterns in the world that would otherwise be invisible,” said Thomas W. Malone, a professor at the Sloan School of Management at MIT. “I think we are just scratching the surface of what's possible with collective intelligence.” [10]
The initial Google paper stated that the Google Flu Trends predictions were 97% accurate comparing with CDC data. [4] However subsequent reports asserted that Google Flu Trends' predictions have been very inaccurate, especially in two high-profile cases. Google Flu Trends failed to predict the 2009 spring pandemic [12] and over the interval 2011–2013 it consistently overestimated relative flu incidence, [6] predicting twice as many doctors' visits over one interval in the 2012-2012 flu season as the CDC recorded. [6] [13] A 2022 study published (with commentaries) in the International Journal of Forecasting [14] found that Google Flu Trends was outperformed by the recency heuristic, an instance of so-called "naive" forecasting, where the predicted flu incidence equals the most recently observed flu incidence. For all weeks from March 18, 2007, to August 9, 2015 (the horizon for which Google Flu Trends predictions are available), the mean absolute error of Google Flu Trends was 0.38 and of the recency heuristic 0.20 (both in percentage points; linear regression with a single predictor, the most recently observed flu incidence, had a mean absolute error of also 0.20, and the benchmark of random prediction had 1.80).
One source of problems is that people making flu-related Google searches may know very little about how to diagnose flu; searches for flu or flu symptoms may well be researching disease symptoms that are similar to flu, but are not actually flu. [15] Furthermore, analysis of search terms reportedly tracked by Google, such as "fever" and "cough", as well as effects of changes in their search algorithm over time, have raised concerns about the meaning of its predictions. [6] In fall 2013, Google began attempting to compensate for increases in searches due to prominence of flu in the news, which was found to have previously skewed results. [16] However, one analysis concluded that "by combining GFT and lagged CDC data, as well as dynamically recalibrating GFT, we can substantially improve on the performance of GFT or the CDC alone." [6] A later study also demonstrates that Google search data can indeed be used to improve estimates, reducing the errors seen in a model using CDC data alone by up to 52.7 per cent. [17]
By re-assessing the original GFT model, researchers uncovered that the model was aggregating queries about different health conditions, something that could lead to an over-prediction of ILI rates; in the same work, a series of more advanced linear and nonlinear better-performing approaches to ILI modelling have been proposed. [18]
However, followup work was able to substantially improve the accuracy of GFT through the use of a random forest regression model trained on both the incidence of influenza-like illness and the output of the original GFT model. [19]
Similar projects such as the flu-prediction project [20] by the Institute of Cognitive Science at Universitat Osnabrück carry the basic idea forward, by combining social media data e.g. Twitter with CDC data, and structural models that infer the spatial and temporal spreading [21] of the disease.
In statistics, the logistic model is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression is estimating the parameters of a logistic model. Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable or a continuous variable. The corresponding probability of the value labeled "1" can vary between 0 and 1, hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. See § Background and § Definition for formal mathematics, and § Example for a worked example.
In mathematics, a time series is a series of data points indexed in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.
Oseltamivir, sold under the brand name Tamiflu, is an antiviral medication used to treat and prevent influenza A and influenza B, viruses that cause the flu. Many medical organizations recommend it in people who have complications or are at high risk of complications within 48 hours of first symptoms of infection. They recommend it to prevent infection in those at high risk, but not the general population. The Centers for Disease Control and Prevention (CDC) recommends that clinicians use their discretion to treat those at lower risk who present within 48 hours of first symptoms of infection. It is taken by mouth, either as a pill or liquid.
Influenza vaccines, colloquially known as flu shots, are vaccines that protect against infection by influenza viruses. New versions of the vaccines are developed twice a year, as the influenza virus rapidly changes. While their effectiveness varies from year to year, most provide modest to high protection against influenza. Vaccination against influenza began in the 1930s, with large-scale availability in the United States beginning in 1945.
Public health surveillance is, according to the World Health Organization (WHO), "the continuous, systematic collection, analysis and interpretation of health-related data needed for the planning, implementation, and evaluation of public health practice." Public health surveillance may be used to track emerging health-related issues at an early stage and find active solutions in a timely manner. Surveillance systems are generally called upon to provide information regarding when and where health problems are occurring and who is affected.
In virology, influenza A virus subtype H1N1 (A/H1N1) is a subtype of influenza A virus. Major outbreaks of H1N1 strains in humans include the 1918 Spanish flu pandemic, the 1977 Russian flu pandemic and the 2009 swine flu pandemic. It is an orthomyxovirus that contains the glycoproteins hemagglutinin (H) and neuraminidase (N), antigens whose subtypes are used to classify the strains of the virus as H1N1, H1N2 etc. Hemagglutinin causes red blood cells to clump together and binds the virus to the infected cell. Neuraminidase is a type of glycoside hydrolase enzyme which helps to move the virus particles through the infected cell and assist in budding from the host cells.
An influenza pandemic is an epidemic of an influenza virus that spreads across a large region and infects a large proportion of the population. There have been six major influenza epidemics in the last 140 years, with the 1918 flu pandemic being the most severe; this is estimated to have been responsible for the deaths of 50–100 million people. The 2009 swine flu pandemic resulted in under 300,000 deaths and is considered relatively mild. These pandemics occur irregularly.
Google Trends is a website by Google that analyzes the popularity of top search queries in Google Search across various regions and languages. The website uses graphs to compare the search volume of different queries over time.
Social search is a behavior of retrieving and searching on a social searching engine that mainly searches user-generated content such as news, videos and images related search queries on social media like Facebook, LinkedIn, Twitter, Instagram and Flickr. It is an enhanced version of web search that combines traditional algorithms. The idea behind social search is that instead of ranking search results purely based on semantic relevance between a query and the results, a social search system also takes into account social relationships between the results and the searcher. The social relationships could be in various forms. For example, in LinkedIn people search engine, the social relationships include social connections between searcher and each result, whether or not they are in the same industries, work for the same companies, belong the same social groups, and go the same schools, etc.
Gunther Eysenbach is a German-Canadian researcher on healthcare, especially health policy, eHealth, and consumer health informatics.
Pandemrix is an influenza vaccine for influenza pandemics, such as the 2009 flu pandemic. The vaccine was developed by GlaxoSmithKline (GSK) and patented in September 2006.
Influenza-like illness (ILI), also known as flu-like syndrome or flu-like symptoms, is a medical diagnosis of possible influenza or other illness causing a set of common symptoms. These include fever, shivering, chills, malaise, dry cough, loss of appetite, body aches, nausea, and sneezing typically in connection with a sudden onset of illness. In most cases, the symptoms are caused by cytokines released by immune system activation, and are thus relatively non-specific.
The 2009 swine flu pandemic, caused by the H1N1/swine flu/influenza virus and declared by the World Health Organization (WHO) from June 2009 to August 2010, was the third recent flu pandemic involving the H1N1 virus. The first identified human case was in La Gloria, Mexico, a rural town in Veracruz. The virus appeared to be a new strain of H1N1 that resulted from a previous triple reassortment of bird, swine, and human flu viruses which further combined with a Eurasian pig flu virus, leading to the term "swine flu".
The pandemic H1N1/09 virus is a swine origin influenza A virus subtype H1N1 strain that was responsible for the 2009 swine flu pandemic. This strain is often called swine flu by the public media due to the prevailing belief that it originated in pigs. The virus is believed to have originated around September 2008 in central Mexico.
Urban computing is an interdisciplinary field which pertains to the study and application of computing technology in urban areas. This involves the application of wireless networks, sensors, computational power, and data to improve the quality of densely populated areas. Urban computing is the technological framework for smart cities.
Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity may lead to a higher false discovery rate. Though used sometimes loosely partly due to a lack of formal definition, the best interpretation is that it is a large body of information that cannot be comprehended when used in small amounts only.
Tobias Preis is Professor of Behavioral Science and Finance at Warwick Business School and a fellow of the Alan Turing Institute. He is a computational social scientist focussing on measuring and predicting human behavior with online data. At Warwick Business School he directs the Data Science Lab together with his colleague Suzy Moat. Preis holds visiting positions at Boston University and University College London. In 2011, he worked as a senior research fellow with H. Eugene Stanley at Boston University and with Dirk Helbing at ETH Zurich. In 2009, he was named a member of the Gutenberg Academy. In 2007, he founded Artemis Capital Asset Management GmbH, a proprietary trading firm which is based in Germany. He was awarded a PhD in physics from the Johannes Gutenberg University of Mainz in Germany.
Infoveillance is a type of syndromic surveillance that specifically utilizes information found online. The term, along with the term infodemiology, was coined by Gunther Eysenbach to describe research that uses online information to gather information about human behavior.
Infodemiology was defined by Gunther Eysenbach in the early 2000s as information epidemiology. It is an area of science research focused on scanning the internet for user-contributed health-related content, with the ultimate goal of improving public health. Later, it is also defined as the science of mitigating public health problems resulting from an infodemic.
Roni Rosenfeld is an Israeli-American computer scientist and computational epidemiologist, currently serving as the head of the Machine Learning Department at Carnegie Mellon University. He is an international expert in machine learning, infectious disease forecasting, statistical language modeling and artificial intelligence.