Discovering future of the social trends using social media tools

Social media has been widely used in our daily lives, which, in essence, can be considered as a magic box, providing great insights about world trend topics. It is a fact that inferences gained from social media platforms such as Twitter, Faceboook or etc. can be employed in a variety of different fields. Computer science technologies involving data mining, natural language processing (NLP), text mining and machine learning are recently utilized for social media analysis. A comprehensive analysis of social web can discover the trends of the public on any field. For instance, it may help to understand political tendencies, cultural or global believes etc. Twitter is one of the most dominant and popular social media tools, which also provides huge amount of data. Accordingly, this study proposes a new methodology, employing Twitter data, to infer some meaningful information to remarks prominent trend topics successfully. Experimental results verify the feasibility of the proposed approach.


Introduction
Social media environment is a prominent era to put the feelings and ideas bravely. Somehow people put their characters and make comments easier than their normal lives because they can mention their ideas by using social media materials like re-tweets. However billions of people use social media all around the world. According to recent statistical studies, the number of social network users is expressed in billions and increasing day by day [1]. Scholars, advertisers and political activists see massive online social networks as a representation of social interactions that can be used to study the propagation of ideas, social bond dynamics and viral marketing, among others [2]. Consequently, this makes a huge library to infer valuable information all around the world on many different fields from politics to science. Under the concepts of opinion mining, sentimental analysis and clustering inferences can be fetched [3]. Analyzes gained from social media mining may ease to predict about nature events, health, and politics etc. In this study, Twitter is utilized to handle trend topics around the world or on any specific subject. It is the most popular text-messaging service to share ideas, which also allows users to analyse messages using specific tools. One way to describe Twitter is as a micro-blogging service that allows people to communicate with short, 140-character messages that roughly correspond to thoughts or ideas. In that regard, you could think of Twitter as being akin to a free, high-speed, global text-messaging service. In other words, it's a glorified piece of valuable infrastructure that enables rapid and easy communication [4]. Twitter does not require people to be friends but they just need to share similar topics under hash tags. It has a decentralized structure it is not a friends graphs but an interest graph, which provides better possibilities for data mining realm. Twitter's decentralized structure makes hash tags as labels and people come together around these labels independently from each others. Accordingly, in this study, it is preferred to employ Twitter shares as web source to analyze and make inferences by using statistical and machine learning methods.

Previous Studies
There many studies on Twitter mining because it gives us insight what is going on around the world and more that what is going to happen next. Hence Twitter data should be analyzed with many aspects. For instance together with tweets, it can be considered which mentioned the same people, replies and re-tweets. However if topic derivation is done through a two-step matrix factorization process it can be conducted using a number of experiments on several Twitter datasets to reveal both the individual and integrated effects of the various features being considered [5]. In another study, it is focused on that when an earthquake occurs, people make many Twitter posts related to the earthquake, which enables detection of earthquake occurrence promptly, simply by observing the tweets. So accordingly the real-time interaction of events such as earthquakes in Twitter is investigated and an algorithm is proposed to monitor tweets and to detect a target event. To detect a target event, a classifier of tweets is devised based on features such as the keywords in a tweet, the number of words, and their context. Subsequently, they produce a probabilistic spatiotemporal model for the target event that can find the centre and the trajectory of the event location. [6]. Alternatively, Twitter data can be used for understanding diffusing of an illness or where a virus emergences. In a comprehensive study author present a more general approach that discovers many different ailments as well as can learn symptom and treatments obtained from tweets. To create structured information from the data, they develop a new topic model that organizes health terms into ailments, including associated symptoms and treatments [7]. Another study makes predictions over stock markets depending on societies' mood by analysing the Twitter texts. The study claim that the economics tells us those emotions can profoundly affect individual behaviour and decision-making. This also applies to societies at large, i.e. can societies experience mood states that affect their collective decision making. By extension the public mood is correlated or even predictive of economic indicators. The study investigate whether measurements of collective mood states derived from large-scale Twitter feeds are correlated to the value of the Dow Jones Industrial Average (DJIA) over time. They analyse the text content of daily Twitter feeds by two mood tracking tools, namely "Opinion-Finder" that measures positive vs. negative mood and "Google-Profile" of Mood States (GPOMS) that measures mood in terms of 6 dimensions (Calm, Alert, Sure, Vital, Kind, and Happy). The results indicate that the accuracy of DJIA predictions can be significantly improved by the inclusion of specific public mood dimensions but not others [8]. Another study subject on Twitter data is sentiment analysis.
Micro blogging has recently become a very popular communication tool among Internet users. Millions of users share opinions on different aspects of life every day. Therefore, micro blogging websites are rich sources of data for opinion mining and sentiment analysis. Because micro blogging has appeared relatively recently, there are a few research works that were devoted to this topic. In this paper, author focuses on employing Twitter, the most popular micro blogging platform, for the task of sentiment analysis. It is showed that how to automatically collect a corpus for sentiment analysis and opinion mining purposes. We perform linguistic analysis of the collected corpus and explain discovered phenomena. Using the corpus, a sentiment classifier is built that is able to determine positive, negative and neutral sentiments for a document. Experimental evaluations show that proposed techniques are efficient and perform better than previously proposed methods. In this research, it is worked with English sentences; however, it is claimed that the proposed technique can be used with any other language [9].

Methods
In this study five different machine learning algorithms are used to classify the data obtained from tweets to decide whether the reviews are positive or negative. One of the method is naive bayes used in similar studies [10]; which can successfully make text classification especially with correct parameter optimizations [11]. Navie Bayes is a Bayesian classifiers assign the most likely class to a given example described by its feature vector. Learning such classifiers can be greatly simplified by assuming that features are independent given class, that is is a feature vector and is a class. Despite this unrealistic assumption, the resulting classifier known as naive Bayes is remarkably successful in practice, often competing with much more sophisticated techniques Naive Bayes has proven effective in many practical applications, including text classification, medical diagnosis, and systems performance management [12] Support Vector Machine SVM; is a learning machine used as a tool for data classification, function approximation, etc, due to its generalization ability and has found success in many applications [13]. Feature of SVM is that it minimizes and upper bound of generalization error through maximizing the margin between separating hyper plane and dataset. SVM has an extra advantage of automatic model selection in the sense that both the optimal number and locations of the basis functions are automatically obtained during training. The performance of SVM largely depends on the kernel [14] Random Forest classifier; consists of a combination of tree classifiers where each classifier is generated using a random vector sampled independently from the input vector, and each tree casts a unit vote for the most popular class to classify an input vector [15].
Decision tree classifier; recursively partition the instance space using hyper planes that are orthogonal to axes. The model is built from a root node which represents an attribute and the instance space split is based on function of attribute values (split values are chosen differently for different algorithms), most frequently using its values. Then each new sub-space of the data is split into new sub-spaces iteratively until an end criterion is met and the terminal nodes (leaf nodes) are each assigned a class label that represents the classification outcome (the class of all or majority of the instances contained in the sub-space) [16].
KNN is one of the most widely used lazy learning approaches. Given a set of n training examples, upon receiving a new instance to predict, the kNN classifier will identify k nearest neighbouring training examples of the new instance and then assign the class label holding by the most number of neighbours to the new instance [17]. KNN is a typical supervised algorithm. Reducing or eliminating statistical redundancy between the components of high-dimensional vector data enables a lowerdimensional representation without significant loss of information. Recognizing the limitations of principal component analysis (PCA), researchers in the statistics and neural network communities have developed nonlinear extensions of PCA [18] The other method is commonly used in web page and text classifications is the support vector machine SVM which is used in similar studies [19] and mostly gives the best results especially in bigger data sets [20]. In this study SVM gives the highest F1 score. The other method that is used in this study is the random forest approach which gives optimal results on classification and regression [21,12]. KNN and decision trees algorithms are also commonly used methods which are used in text classifications [23,24,25 and 26].
In this study as illustrated in Fig. 1, the tweets are handled from tech review corpus which has tweet id's, topics and labels. Labels are one of the two choices of negative (0), positive (1). After tweets are collected with tweet id's the pre-processing applied to the all text and features are extracted. According to the relevant words the bag of words are created to apply to all tweets. Then the tweets are separated into train and test data out of 1000 tweets. In the pre-processing stage regular expression and Python "nltk" library is used. In one case the maximum feature dimensionality reduction is applied to all methods and in the other case PCA is additionally applied for dimensionality reduction to train and test data set to prevent from being sparse matrix of data set. In both cases performance result are handled and compared. 20% of the tweets are selected randomly for test set and rest of the tweets are selected as training set. The work flow of our study can be seen on Figure 1. In this study 5 main methods in machine learning applied to the data set and models are trained. The first model naive bayes applied and then the other main method are applied sequentially and performance results are handled to evaluate which model best fit and gives the better results.

Results & Findings
Here for the evaluation purpose accuracy, precision, recall and f1 score values are calculated. According to the result that can be seen on Table-1 and Table-2 the predictions are made in two different cases. In the first case the maximum feature dimensionality reduction is applied and performance results are calculated and then in the second case the additionally the PCA dimensionality reduction is also applied to data set and performance values are calculated from scratch. In the first case the best results are handled with naïve bayes method as can be seen on Table-1. On the other hand when dimensionality reduction is applied the best results are handled with SVM method which also has the best F1 score. So it can be concluded that the sentiment analysis can be made over at least % 71 success prediction rates. This shows that the methods can be applied for sentiment analysis to tweets and reviews or similar text content and good results can be handled to predict future trends through these indicator results. The results demonstrate that dimensionality reduction had impact more on naive bayes and SVM than the other methods. Similar results are handled with and without PCA for random forest, decision trees and kNN because they are not heavily depended on dimensionality. As random forest itself already performs a fair regularization without assuming linearity, it is not necessarily an advantage on this method and similar for the others. The results are shown when five different methods are applied to same data set in different cases. PCA is an effective dimentionality reduction technique. On the other hand the linear maximum feature is also a prominent dimentionality reduction technique [27] that give efficient results.

Conclusion & Future Works
As a conclusion it can be said that the small text data like tweets can be analyzed with preprocessing techniques and the feelings can be caught by machine learning methods with more then %71 success. The results which include feeling that positive or negative can be an indicator for future plans and actions. Consequently, some further evaluation and prediction can be performed based on these results. This study demonstrates that naïve bayes and SVM methods work more successful and give better results than the other methods applied on the data set. The decision trees were another successful method that suit for the text classification of this review data set. However an important part in this study is the dimensionality reduction that makes the methods work fast and accurately. In the study the linear maximum features and PCA is used to be able to get good results but beside that some other techniques like linear discriminant analysis (LDA), self-organized map (SOM) and feature embeddings can also be applied to see their effects on the model. This study gives good classification results with sentiment analysis and works very efficiently. Although the neural network can be applied by implementing deep structured learning to give better results. Deep learning works with many hidden layers which give the power of understanding the myth behind input data so it can conclude more accurate results. So with different data sets much dimensionality reduction can be tested and then prominent machine learning models can be applied by combining or the new strong learning model that deep learning neural networks can be applied to make a more successful model. As a result in this study many prominent machine learning algorithms applied to the reviews to make a sentimental analysis and good results are handled. This study shows us that text contents like tweets can be analyzed with pre-processing and machine learning techniques to make future predictions on trends. The model works successfully and gives good accuracy results.