Problem description and motivation.

The question we want to answer is “Can we use models to predict whether a tweet about Clash Royale will either be approval, disapproval, or neutral towards the recent game updates and balance?” Knowing how the player base reacts to each game update is often a crucial demand for every game developer. However, traditional methods of survey usually take a period of time and do not provide instant feedback which are considered more valuable often, and survey is also very much a passive way of collecting data. Other methods like analysis of revenue gain, or player count changes are even more on the inefficient side, and would require a long wait period before these data can even be collected. Thus, analysis of Twitter data would provide a quick and easy access of the needed data that offers some insights into the immediate reaction of the players as soon as an update is released. Furthermore, building a machine learning model classifier could very well take into account the fact that they’re a mixture of opinions through designing labels and categories to represent different attitudes. Although none analysis have been found directly related to data that of the game “Clash Royale”, similar analysis on game reviews have appeared at least a few times over the course of investigation, for example, “Twitter sentiment analysis of game reviews using machine learning techniques”1 is a fairly similar one which we will use as comparison throughout this analysis. 1 Kiran, T. D. V. et al. “Twitter sentiment analysis of game reviews using machine learning techniques.” (2016).

Describe the data.

Using the Python library ‘Tweepy’, we extracted 397 relevant tweets with the search queries “@ClashRoayle Update” and “@ClashRoyale Balance”, these two are the main phrases that players typically include as they tweet about the updates. We’ve restricted the search results to tweets only for easier analysis purpose and have used the parameter “tweet_mode = extended” to disengage the 140 characters limit of the Twitter API. Then, we manually labeled the sentiment of the tweets as either ‘0’, for being neutral about the update or informative matters like reporting a bug, ‘1’ for being in favor/approval of the update, and lastly ‘2’ for being disapproving/hating the update. Then we’ve made the tweets into a 70-30 split for a training and testing set. Some strengths about our data are that they’re quite relevant and recent (tweets within 7 days) to the topic; they’re all manually labeled, which allows us to use supervised learning methods; and lastly, most tweets have “strong opinions”, especially for the negative ones where even vulgar languages are used, which in turns is actually advantageous for NLP. The limitations are in a sense related to the strengths. Manual labeling are affected by human inconsistency (which is reduced through having both individuals labeling the data together), recent data means that the data are collected within a short period of time which make them more prone to being biased, lastly, since tweets contains many ambiguous wordings involving internet slang, abbreviations, and sometime sarcasms, these would’ve likely reduced the performance of our analysis in some degree. In comparison, the game review analysis had 21,000 tweets as their data, yet they still reported very similar concerns about the limitation of their data and the difficulties with tackling these problems.

Exploratory data analysis.

After manually labeling all the tweets, we have firstly created a bar plot demonstrating the distribution of people’s attitude toward the game to have a better understanding with our data. It turns out to be around 40% tweets are either irrelevant or neutral, and the ratio for both positive tweets and negative tweets are 30%. Then, we have performed a series of preprocessing steps to transform our data into a computer-understandable form while extracting key information from the raw dataset. This includes: Tokenization, which we splitted the tweets into individual words; URL, special character and stop word removal2, which helps us exclude unnecessary (meaningless) words that have no real effect on the result; Lemmatization and Lowercase the words, which convert all vocabulary to their simplest form to avoid the duplication in meaning. Lastly, we have converted these lists of words into a word count matrix by using Count Vectorization and a word ratio matrix by using TF-IDF so that we have numerical representations of our tweets that can be easily fitted into a model.

Machine Learning Model

As a main goal for our investigation, we want to explore the best model that can categorize the tweets most accurately. Thus, we have constructed four different models in which we can make comparisons among them to choose the “best”. The first two models are built based on the Naive Bayes model. One of them is using the count vector as input and the other one is using the TF-IDF as input. The Naive Bayes algorithm has constructed a posterior distribution which allows us to predict the most 2 We have removed the Top 100 English stop words by applying Python package nltk.stopwords

probable class with the given dataset (count vector). As a condition, it is a supervised model which requires the training dataset to be labeled. Since our data is properly pre-labeled, the Naive Bayes model is an appropriate choice for our research question. As strengths, Naive Bayes is highly related to every word feature. However, if we have too many unnecessary words, this can also be a weakness which decreases the significance of key words. To solve this weakness, we have also constructed a Classification Decision Tree with a depth of five. With a Classification Tree, we can visualize how each class is being categorized under some key words. By using a depth of five, it is still related to some of the word features while not getting overfitting. As a biggest weakness, the Classification Tree only makes predictions based on a small part of word features. But it is still appropriate especially for us to visualize our model and it works better when we have limited word features. Lastly, we have fitted in a support vector machine model in order to compare with our existing models. Similar to the Naive Bayes algorithm, SVM also minimizes the error while splitting classes with given input. It is another great classification model which has been applied in many other NLP researches. For each model, we compute both its test accuracy and train accuracy, as well as its confusion matrix to test their overall behavior and the evidence of overfitting. In addition, we have introduced another variable named “accuracy for class 1(positive) & 2(negative)” as an indicator to models’ behavior, which tests the probability of correctly identifying the positive and negative tweets.

Results and Conclusions.

It is evident that our models were, although not in the best forms possible, but still to some degree assisted with predicting the corresponding attitude of a tweet about the Clash Royale update. Out of all the methods, the Naive Bayes classifier with the countvectorizer performed the best (Highest overall accuracy and accuracy for class 1 & 2). It has a train accuracy of 92%, a test accuracy of 67% and a test accuraacy on Class 1 & 2 of 61%. Yet all models demonstrate signs of overfitting, seen by a large difference between the train and test accuracy. This is most likely due to the small dataset that we’ve fed into our models and would be better addressed with more data. In regards to the initial question, we can claim to have at least found a more efficient way to determine the general reactions and feedback of the players after an update of Clash Royale, where there are certainly still rooms for improvements. Perhaps a smarter EDA process such as removing repeated tweets and filtering spam would be a start to improving the accuracy.