Springer International Publishing AG 2017
Plamen Angelov , Yannis Manolopoulos , Lazaros Iliadis , Asim Roy and Marley Vellasco (eds.) Advances in Big Data Advances in Intelligent Systems and Computing 10.1007/978-3-319-47898-2_1
Abstract
The enormous volumes of data generated by web users are the basis of several research activities in a new innovative field of research: online forecasting. Online forecasting is associated with the proper computation of web users data with the aim to arrive at accurate predictions of the future in several areas of human socio-economic activity. In this paper an algorithm is applied in order to predict the results of the Greek referendum held in 2015, using as input the data generated by users of the Google search engine. The proposed algorithm allows us to predict the results of the referendum with great accuracy. We strongly believe that due to the high internet penetration, as well as, the high usage of web search engines, the proper analysis of data generated by web search users reveals useful information about people preferences and/or future actions in several areas of human activity.
Introduction
Almost a decade ago, Google opened to the public the web users preferences in relation to their searching behavior. Several researchers realized that the proper processing of the web users search behavior may allow them to reveal useful information about the users needs, wants, concerns and in general about their feelings and preferences (Ettredge et al. ).
Web users generate data almost in all web activities, such as visiting a website, buying online, sending/receiving emails and participating in social networks. In cases where the popularity of such activities is high, then there is plenty of room for researchers and companies to use these data in order to reach valuable conclusions not only for web users, but for the general population. The most indicative case of user generated data is web search, since is characterized by high popularity among web users and by an almost monopolized market structure since Google Search engine holds more than 85 % of the market (source: www.statista.com ).
A recent study published by Eurostat indicates that the 59 % of Europeans use web search services to find information relevant to goods and services. As the percentages of internet penetration and use of web search increase, relevant generated data regarding web search behavior, become statistically significant. Thus, forecasting based on web search data is becoming increasingly more accurate.
Within this context, the aim of this paper is to explore whether there is a correlation between the users web search preferences during a time period before the Greek referendum, held in July 2015, and the actual results of the referendum. In particular in this paper an algorithm is applied in order to analyze the data generated by users of the Google engine, aiming to predict the actual results of the referendum.
The paper is structured as follows: Sect. the main findings of this paper are discussed.
Literature Review
Online forecasting based on users web search data is becoming as one of the most promising fields in the research area of forecasting. Several efforts have been carried out by Googles own researchers which have attempted predictions using search term popularity in a number of areas ranging from home, automobile and retail sales to travel behavior (Bangwayo-Skeete and Skeete ).
With respect to elections, an initial approach in (Pion and Hamel ) provided predictions for the 2010 UK elections by applying twice the concept behind Galtons predictive wisdom of the crowds.
The Proposed Algorithm
The proposed algorithm is applied on the data generated by the users of the Google search engine. Each time that a user, searches the Web with the Google search engine, the relevant data such as, the typed word or phrase, the date, the time, the location and data related to his/her profile are stored by Google. The data are analyzed by Google and some of them become publicly available by the Google Trends service. In particular Google Trends returns a normalized averaged number that corresponds to the volume of daily searches for a specific term compared to the rest of the search terms.
The proposed algorithm uses the search popularity of selected word/phrases, as provided by the Google Trends, in order to analyze the feelings, intentions and thoughts of the web users in relation to these word/phrases, aiming to predict their future behavior. Early versions of the proposed algorithm have been applied, in several elections races (Polykalas et al. ). The algorithm consists of four main phases: initial, words set, noise elimination and runs. At the initial phase the examined time period before the event under study is determined and the geographic restrictions for the web search users is set. During the next phase the popularity of selected words/phrases relevant to the case study is examined, in order to determine the set of words/phrases that will be used as input data. As stated earlier, Google Trends returns a normalized value of the popularity of each word/phrase typed by web search users. We call this popularity the Web Interest (WI) of each typed word/phrase. The WI of each examined word/phrase should fulfill two criteria in order for the relevant word/phrase to be part of the selected words/phrases. The first one concerns the variance of the relevant WI during the determined time period, while the second one is related to the absolute values of the WI during the determined time period. If the WI varies significant during the determined period and the value of the WI is comparable with the WI of previously selected words/phrases, then the examined word/phrase is selected as algorithm input. Several words/phrases should be examined during this phase, in order to include, in the final set of word/phrases, all potential words/phrases than meet the aforementioned two conditions. Having determined the geographical restrictions, the time period and the final set of the words/phrases, the next phase is related to the elimination of potential noise. The noise elimination phase consists of three different sub-phases. The first one deals with the elimination of noise generated by indecisive or confused web users. An indecisive/confused web user is defined as the web user who is searching, at the same time, for words/phrases that show contradicted feelings or unpredictable future behavior. Further explanation is given for this sub-phase in the next section where the proposed algorithm is applied to our case study. The second sub-phase is related to the examination of previous (if any) relevant events similar to the one under study. If there are similar historical events then, the relevant data that were generated by web search engine users, as well as, the relevant historical actual results are used as feedback to the current data used for the case under study. The third sub-phase concerns the exclusion of the influence generated by non-representative facts during the determined time period. In order to determine the non-representative facts, a day to day examination of the WI of the selected words is required. If a selected word presents very high variation during a short period (high increase followed by high decrease within 12 days), which is not followed by a respective variation of the WI of the other selected word/phrases, then these WI values should not be considered as valid input values (in practical terms this means that another event, related to the main one such as a TV interview, a scandal etc., that drew high media attention has occurred and has skewed the respective WIs). The last phase of the proposed algorithm contains the final runs of the proposed algorithm, which in turns generate the final results. A normalization of the final results is required only if the number of different set of word/phrases used in the proposed algorithm is less than the relevant actual set of tendencies under study.