GAMING WITH DATA SCIENCE: November 2015

Tuesday, 24 November 2015

Importance of Computational Complexity

On frequent intervals I organize virtual workshops for selected individuals, who have some background in databases and are interested in learning SQL and NoSQL. These are purely learning experiences for every participant, where you can get some exposure to elementary procedures in NoSQL. But I have one hidden agenda, also. I try to teach them some basic ideas in computational complexity, which are quite often omitted in many computer courses.

Algorithms and computational complexity is the key thing to understanding why your program or DB query has so poor performance - meaning, why it so slow. Unfortunately, many computer science students hate algorithm analysis courses where the principles of computational complexity is taught. They seem to miss the point that algorithm analysis is a stairway to better programming, and is a necessary skill for DB query optimization, Data Science and Big Data analysis.

Latest workshop focused on MySQL and ArangoDB as an example engines. Participants had to learn some queries on premade SQL database, then export all data to ArangoDB, and build similar queries. This gave everybody an opportunity to compare SQL and NoSQL query styles, their performance and principles of computational complexities with query optimization.

During the workshop there was some really interesting progress on queries, which shows nicely what computational complexity is, actually.

Familiar country - SQL

After familiarizing themselves with current database the participants had several tasks where they had to write queries to collect certain types of data sets. One simple task was to find out top scoring players with player name included. Player data was in its own table, and scores were recorded in their own table including game type and other game information.

One early version of necessary query was made like this by a participant

Query A.

SELECT u.playername, h.gamescore
FROM (SELECT playerid, MAX(gamescore) as highscore
FROM scoretable
WHERE gametype = 'CTF'
GROUP BY playerid) x
JOIN
scoretable h ON x.playerid = h.playerid AND x.highscore = h.gamescore
JOIN
player u ON h.playerid = u.playerid
ORDER BY h.score DESC LIMIT 10

Participants were quite satisfied with it, since its running time seemed quite fast. It was 76 ms, to be exact. But I had to point out, that it is possible to make it faster. You can easily see, that Query A is doing unnecessary work since the query joins maximum scores for each and every player with player data, and after that sorts them in descending order picking out 10 of the best.

This means that query first groups gamescores and picks maximum value for each scoring player (N), then joins this with playerdata (M) and sorts the joined table. But we are only interested in 10 best scoring players so it is better to sort those score immediately and after that join only those best 10 players with player data table, like this

Query B.

SELECT u.playername, x.highscore
FROM (SELECT playerid, MAX(score) as highscore
FROM scoretable
WHERE gametype = 'CTF'
GROUP BY playerid
ORDER BY highscore DESC
LIMIT 10) x
JOIN
player u ON x.playerid = u.playerid

and also get rid of that unnecessary join in original query. Running time is actually halved to 35 ms. Joins cost in performance, and now we are joining only the 10 best players, not all scoring players.

Hostile territory - NoSQL

When SQL tasks were completed, whole database was exported into ArangoDB. This was very simple task, since you can export MySQL database in JSON format which is really easy to import into ArangoDB.

After the import participants had to recreate the same queries as before, but now with ArangoDB query language AQL. Participants had only few problems with AQL. It has some similarity with SQL which makes it easier to "jump into" for anyone familiar with SQL - compared to Javascript style queries. But there were enough differences also which lead to many mistakes.

Biggest difference seemed to be the process order, but when participants finally understood that AQL is like mix of procedural SQL and basic SQL, they started to grasp the skill of building AQL queries. Their first try on '10 best players' query was far from efficient, however.

Query C.

FOR d IN players
LET hscores = (
FOR s IN scoretable
FILTER s.playerid == d.idplayer
FILTER s.gametype == 'CTF'
LET numscore = TO_NUMBER(s.score)
COLLECT pid = s.playerid INTO scores
RETURN {'hscore': MAX(scores[*].numscore)}
)
SORT hscores[0].hscore DESC
LIMIT 0,10
RETURN {'player': d.playername, 'score': hscores[0].hscore}

This query works and gives correct result, but it has a disastrous performance of 26 seconds (26000 ms), which is unacceptable. After I guided players to analyse its computational complexity, they soon understood why it was so. You are going through every player in database, then every time go through score data to calculate his maximum score (even if he has no scores at all), then sort them and pick the 10 highest scoring players.

The real problem is the sort operation, however. Sort is done in such phase, that query optimizer cannot optimize it. Actually, SORT is a bit of a bottleneck in ArangoDB. But in this case there is also a lot of unnecessary calculations. Problem was that participants tried to join player data and player scores and collect the result simultaneously, like they did in SQL.

In AQL you can operate procedurally step-by-step. I guided participants to collect first just 10 best scoring player ids, and then join that with player names, leading to this query

Query D.

LET leader = (FOR s IN scoretable
FILTER s.gametype == 'CTF'
LET numscore = TO_NUMBER(s.score)
COLLECT pid = TO_NUMBER(s.playerid) INTO scoresbypid
LET hscore = MAX(scoresbypid[*].numscore)
SORT hscore DESC
LIMIT 0,10
RETURN {'pid': pid, 'score':hscore}
)
FOR l IN leader
FOR d IN devices
FILTER d.idplayer == l.pid
RETURN {'pid': d.playername, 'score': l.score}

Participants were astonished, when running time dropped to 132 ms. This performance is good but not the best. Only difference was that first we operated only on scoredata, grouped them by player id, calculated the maximum and then sorted it picking the 10 best. After that, this data with only 10 objects was joined with player names.

Even in this case the sort operation is a bottleneck, but less so than in previous case. To conclude the session, I presented final modification to Query D in order to optimize it even more.

Query E.

LET leader = (FOR s IN scoretable
FILTER s.gametype == 'CTF'
SORT s.score DESC
LIMIT 400
LET numscore = TO_NUMBER(s.score)
COLLECT pid = TO_NUMBER(s.playerid) INTO scoresbypid
LET hscore = MAX(scoresbypid[*].numscore)
SORT hscore DESC
LIMIT 0,10
RETURN {'pid': pid, 'score':hscore}
)
FOR l IN leader
FOR d IN devices
FILTER d.idplayer == l.pid
RETURN {'pid': d.playername, 'score': l.score}

Only difference is the sort operation before grouping. This may sound like a stupid idea, because we already know that sort is a bottleneck. However, it performs quite fast on original data. Since we are looking for 10 best scoring players, we are not interested in low scores at all. In this case, no player had more than 200 scores recorded. So, I just sorted quickly all scores and took only 400 best scores to subsequent calculations. Running time of this query dropped to 33 ms, which was even better than best MySQL query on fully optimized database.

Aftermath

Participants found it interesting how fast a SQL database is when it is properly indexed and uses foreign keys, so it can optimize queries. So you should not think that RDBMs and SQL are obsolete old technology. They are still very competent at what they do, if you know how to tune them.

It was also interesting how easy it is to do sloppy queries in NoSQL, and how you actually have to plan queries. On the other hand, in this case ArangoDB's AQL gives you lots of option to filter out the data you need.

Docker, future of virtualization?

Calling Docker a virtualization platform does not do it justice.In a sense it is more than that, although it is kind of virtualization, but it covers a wide spectrum from microservices to software distribution. Docker units are not virtual machines (VM) in the same sense as VirtualBox VMs, they are called containers. Unlike VM, which has guest OS, a container is more like a package of software with filesystem. Guest OS is missing completely and Docker engine runs the software in container.

Docker is an open source platform to develop, deploy and run distributed applications. It is extremely useful for developers, not just administrators. I have used virtualization platforms like VirtualBox and VMWare for more than a decade, and quickly saw the possibilities in Docker.

Since I have to develop quite often various software solutions and analysis pathways, I need a quick and reliable way to run software on various platforms in such way that components are isolated and contained. Sometimes I need SQL or NoSQL database on ad hoc basis for quick data import and analysis. Docker is invaluable tool, that solves scenarios where you have to develop and test something without the risk of messing other active processes.

Docker has several advantages. It is lightweight. It is isolated and secure, but uses host memory and processing resources more efficiently than VM. Perhaps the best part is that Docker is based on open standards, and the software itself is open source and free.

What makes it even better is portability and flexibility in matters of infrastructure. When properly packed, a docker container should run on any docker supported environment. Docker itself will take care of all dependencies in software.

In my opinion, Docker should be part of any Data Scientist's toolbox.

Monday, 23 November 2015

Data Science, Mobile Gaming and Incomplete Data

Data Science is currently an integral part of the game design covering its whole lifecycle. Game designers and producers need insights into player behaviour, and this information is used from the earliest sketches of the game all the way to the soft launch and real launch. After the launch data analysis plays even more important role in marketing and possible adjustment of game features. Of course, all these analysis results will be used while designing new games.

Record everything

In order to acquire all that insight into the game, we need data scientists to pinpoint all actions and events in game that needs to be recorded in databases. This has to be carefully planned according to the needs of later analysis. That is why data scientist has to participate actively in all game design and development from the beginning of the project.

Basically any choice player makes should be recorded, and all UI actions. Depending on the type of game it may be relevant to record some events or actions in game environment to give context for player actions - e.g. how fast player reacts to certain event.

There are several aspects of data that should be recorded - not just player data, game scores, purchases or game events. One important aspect is all such data that is used to evaluate UX, since it can be used to iron out any wrinkles that affect a smooth gaming experience. Bad UX can ruin an otherwise good game. There are other types of data too, but that is a topic of another post, later.

Unreliable internet

There is one thing that you have to consider carefully when implanting these "recording hooks". All this data is collected through internet, and your players may use any kind of internet connection. They may be using old and slow devices, also. This means that you have no guarantee that all data packages from player's device ever reach your backend databases, if they are sent at all. It is safe to assume, that at least certain percentage of data packages are lost during transmission.

Biggest reasons for missing or corrupted data packages are
- 2G/3G/4G mobile connections
- Old and unreliable wired connections
- Heavy traffic
- Computer or mobile device which is clearly under minimum specification needed.
Last of those items above is something which happens usually with Android-games. People install them on old devices which do not have enough memory or processing power. Game starts to choke up in graphics intensive scenes, and then some data package transmission are dropped.

Actually this problem appears not only in games, but in any kind of application that uses internet for data communications. In some cases you can solve this with two-way communication, where every data package has to be acknowledged as received, otherwise it will be resent etc. However, this makes communication system more complex and will increase the load on both server and client. Anyway, there is no perfect solution.

So you should build your recording events in such manner, that you can reconstruct at least the most important events with the help of other recorded events - if it happens that data is incomplete. I have encountered these situations myself, and with careful planning it can diminish the problem significantly.

Where did players go?

As an example, I had a case where a small group of players did not seem to be playing at all in a sense that they had no recorded scores. This aroused my curiosity and I wanted to see what they were doing and maybe understand why this happened. Fortunately I had implanted all kinds of game events with data recording hooks, and I had plenty of data to base my analysis.

I quickly found out, that actually most of these non-scoring players were actually playing. I could easily reconstruct this based on fact that there were recorded events of game progress, although the game end and score records were missing. Of course it is always possible that they quit the game before game end.

It was clear that some of them had been playing, but it was not clear how much. So, I continued to dig deeper, and found out that part of these players must have played at least one, perhaps more games through because they had achieved certain bonuses, which player gains only by playing the game to the end.

All this work was worthwhile. Of this non-scoring group only less than 2% were left totally unexplained, 5% clearly showed no signs of play. 23% showed signs of play progress, but it was not clear how far they had actually progressed. Rest of them (70%) could be reconstructed with reliable number of game sessions. Based on these findings I could update automatic database queries for data analysis algorithms to get correct statistics.

Data is always incomplete

I have worked for almost a decade analysing medical and bioscience data and designing data acquisition and storage systems for those fields of scientific research. In those fields you have to accept the fact that data is always incomplete. Due to human related issues, you are always missing some parts of data on many individuals. It is quite easy to believe that in games you could get accurate and complete data on all player actions, since everything happens within the computer. But it just does not happen that way.

Lesson you can learn from the incident described above is, that in computer and mobile games you really should record player events and actions as much as possible in all reasonable points of game flow. Since data acquisition is quite unreliable (for reasons mentioned above), you can never know what bits of data you need to reconstruct the missing events. In such situation you may be able to turn most of those questionable cases into real evidence.

Sunday, 22 November 2015

Avoid absurd population growth in a game world

Game designers have a responsibility to make their game world logical and realistic within its own reality. So instead of just making up some numbers that describe the behaviour of entities, you should try to calculate the implications.
As an example of miscalculation I remember an old fantasy strategy simulation game where - according to the rules of system - humanlike population grew 10% every month. My guess is that creators of game thought it as a appropriate number taking in contrast to game turn and general flow of game. And when you play the game and observe only two or three consecutive turns, it may feel OK.

Exponential Growth

From larger perspective this scenario starts to look quite weird. Unless there is a huge death ratio this population will explode in couple of years, because it more than triples in a year, leading to exponential growth. Besides, when it says "grows 10%" my assumption is, that all decreasing effects have already taken into account.
Lets assume that these humanlike creatures have average human life span. If there are no other causes of death but old age, then a 1000 head population will grow in 6 years to approximately 1 000 000. In 10 years there will be almost 100 000 000 population.

Breeding like rabbits

However, when you start to think it deeper, this is absurd from reproduction point of view, also. Assuming that this humanlike population follows human reproduction patterns, we know that there are two sexes and pregnancy lasts 9 months, and humans are not reproductive in their 10 first years of life. For simplicity's sake we assume that male/female distribution is 50/50.
Lets start with that 240 head population. Since system clearly says that population grows 10% every month, it means that there has to be women giving birth every month. Lets assume that one woman is not constantly pregnant, but has 3 month "rest period". This way we can divide the 120 woman population (50% of whole population) into 12 groups, giving birth every month.
When the first group of 10 women gives birth, they have to produce 24 children to reach that 10% growth quota. This means that on the average every woman gets twins (and we know that twins are rare in normal life), but 4 of them (40%) has to get triplets. When months and years go by even twins are not enough, since population has now grown, and those newborns won't produce children for a decade, at least. So, in couple of months those mothers have to give birth to triplets, then quadruplets and so on. Well, they still have that 3 month rest period to gather strength.

Burden to community

Combine reproduction and exponential growth with age distribution and you have a population where large part of the population are little children needing food and care, all mothers are constantly pregnant and men are doing hard time providing sustenance to all of them, which will be harder and harder to find. Actually at the end of the year 3 there will be the original 240 adults and almost 600 2-year old, about 1500 1-year old and about 5000 newborns. In total, for every adult couple there are about 30 children to take care of.
Actually those adults were lucky. There was also a game where population increased by 5% every week. However, I leave those calculations as an exercise to the reader.

Lesson learned

I can only guess the reasons for this absurd growth rate. Maybe it was needed to compensate losses in battle, sickness, famine, child death etc. Maybe it was needed to give players an impression that something is really happening in monthly game turn.
But even then, the wording should be different, like birth rate. Growth means usually net increase (or decrease) and is a sum of newborns adjusted with death figures (battles, sickness, accidents etc.). Anyway, if that 10% were birth rate, it would be absurd.
What should have been done is quite simple. First do some groundwork and calculate a realistic birth rate taking into account the age structure of population, fertility rate, reproduction age, pregnancy time. Then adjust it with death at birth and death of old age figures. Then you could list all possible random causes of death like famine, disease, accident and battles. This gives you equations to estimate how your population could grow and you can simulate what happens if some variable is increased or decrease.
In reality, the annual growth rate of human population on Earth has never been over 2%. Before Middle Ages it was not under 0.2%. Currently this growth rate is declining.

My point is that any game designer should make some calculations and look if his scenario is plausible or totally ridiculous. More importantly those calculations help in game design, because you can simulate various scenarios, especially if it is any kind of strategy or simulation game. Otherwise there will always be at least one nitpicker like me, who will quickly make some calculations and put the results in Internet for everybody's amusement.