Data Science is currently an integral part of the game design covering its whole lifecycle. Game designers and producers need insights into player behaviour, and this information is used from the earliest sketches of the game all the way to the soft launch and real launch. After the launch data analysis plays even more important role in marketing and possible adjustment of game features. Of course, all these analysis results will be used while designing new games.
Record everything
In order to acquire all that insight into the game, we need data scientists to pinpoint all actions and events in game that needs to be recorded in databases. This has to be carefully planned according to the needs of later analysis. That is why data scientist has to participate actively in all game design and development from the beginning of the project.
Basically any choice player makes should be recorded, and all UI actions. Depending on the type of game it may be relevant to record some events or actions in game environment to give context for player actions - e.g. how fast player reacts to certain event.
There are several aspects of data that should be recorded - not just player data, game scores, purchases or game events. One important aspect is all such data that is used to evaluate UX, since it can be used to iron out any wrinkles that affect a smooth gaming experience. Bad UX can ruin an otherwise good game. There are other types of data too, but that is a topic of another post, later.
Unreliable internet
There is one thing that you have to consider carefully when implanting these "recording hooks". All this data is collected through internet, and your players may use any kind of internet connection. They may be using old and slow devices, also. This means that you have no guarantee that all data packages from player's device ever reach your backend databases, if they are sent at all. It is safe to assume, that at least certain percentage of data packages are lost during transmission.
Biggest reasons for missing or corrupted data packages are
- 2G/3G/4G mobile connections
- Old and unreliable wired connections
- Heavy traffic
- Computer or mobile device which is clearly under minimum specification needed.
Last of those items above is something which happens usually with Android-games. People install them on old devices which do not have enough memory or processing power. Game starts to choke up in graphics intensive scenes, and then some data package transmission are dropped.
Actually this problem appears not only in games, but in any kind of application that uses internet for data communications. In some cases you can solve this with two-way communication, where every data package has to be acknowledged as received, otherwise it will be resent etc. However, this makes communication system more complex and will increase the load on both server and client. Anyway, there is no perfect solution.
So you should build your recording events in such manner, that you can reconstruct at least the most important events with the help of other recorded events - if it happens that data is incomplete. I have encountered these situations myself, and with careful planning it can diminish the problem significantly.
Where did players go?
As an example, I had a case where a small group of players did not seem to be playing at all in a sense that they had no recorded scores. This aroused my curiosity and I wanted to see what they were doing and maybe understand why this happened. Fortunately I had implanted all kinds of game events with data recording hooks, and I had plenty of data to base my analysis.
I quickly found out, that actually most of these non-scoring players were actually playing. I could easily reconstruct this based on fact that there were recorded events of game progress, although the game end and score records were missing. Of course it is always possible that they quit the game before game end.
It was clear that some of them had been playing, but it was not clear how much. So, I continued to dig deeper, and found out that part of these players must have played at least one, perhaps more games through because they had achieved certain bonuses, which player gains only by playing the game to the end.
All this work was worthwhile. Of this non-scoring group only less than 2% were left totally unexplained, 5% clearly showed no signs of play. 23% showed signs of play progress, but it was not clear how far they had actually progressed. Rest of them (70%) could be reconstructed with reliable number of game sessions. Based on these findings I could update automatic database queries for data analysis algorithms to get correct statistics.
Data is always incomplete
I have worked for almost a decade analysing medical and bioscience data and designing data acquisition and storage systems for those fields of scientific research. In those fields you have to accept the fact that data is always incomplete. Due to human related issues, you are always missing some parts of data on many individuals. It is quite easy to believe that in games you could get accurate and complete data on all player actions, since everything happens within the computer. But it just does not happen that way.
Lesson you can learn from the incident described above is, that in computer and mobile games you really should record player events and actions as much as possible in all reasonable points of game flow. Since data acquisition is quite unreliable (for reasons mentioned above), you can never know what bits of data you need to reconstruct the missing events. In such situation you may be able to turn most of those questionable cases into real evidence.