GAMING WITH DATA SCIENCE: Importance of Computational Complexity

On frequent intervals I organize virtual workshops for selected individuals, who have some background in databases and are interested in learning SQL and NoSQL. These are purely learning experiences for every participant, where you can get some exposure to elementary procedures in NoSQL. But I have one hidden agenda, also. I try to teach them some basic ideas in computational complexity, which are quite often omitted in many computer courses.

Algorithms and computational complexity is the key thing to understanding why your program or DB query has so poor performance - meaning, why it so slow. Unfortunately, many computer science students hate algorithm analysis courses where the principles of computational complexity is taught. They seem to miss the point that algorithm analysis is a stairway to better programming, and is a necessary skill for DB query optimization, Data Science and Big Data analysis.

Latest workshop focused on MySQL and ArangoDB as an example engines. Participants had to learn some queries on premade SQL database, then export all data to ArangoDB, and build similar queries. This gave everybody an opportunity to compare SQL and NoSQL query styles, their performance and principles of computational complexities with query optimization.

During the workshop there was some really interesting progress on queries, which shows nicely what computational complexity is, actually.

Familiar country - SQL

After familiarizing themselves with current database the participants had several tasks where they had to write queries to collect certain types of data sets. One simple task was to find out top scoring players with player name included. Player data was in its own table, and scores were recorded in their own table including game type and other game information.

One early version of necessary query was made like this by a participant

Query A.

SELECT u.playername, h.gamescore
FROM (SELECT playerid, MAX(gamescore) as highscore
FROM scoretable
WHERE gametype = 'CTF'
GROUP BY playerid) x
JOIN
scoretable h ON x.playerid = h.playerid AND x.highscore = h.gamescore
JOIN
player u ON h.playerid = u.playerid
ORDER BY h.score DESC LIMIT 10

Participants were quite satisfied with it, since its running time seemed quite fast. It was 76 ms, to be exact. But I had to point out, that it is possible to make it faster. You can easily see, that Query A is doing unnecessary work since the query joins maximum scores for each and every player with player data, and after that sorts them in descending order picking out 10 of the best.

This means that query first groups gamescores and picks maximum value for each scoring player (N), then joins this with playerdata (M) and sorts the joined table. But we are only interested in 10 best scoring players so it is better to sort those score immediately and after that join only those best 10 players with player data table, like this

Query B.

SELECT u.playername, x.highscore
FROM (SELECT playerid, MAX(score) as highscore
FROM scoretable
WHERE gametype = 'CTF'
GROUP BY playerid
ORDER BY highscore DESC
LIMIT 10) x
JOIN
player u ON x.playerid = u.playerid

and also get rid of that unnecessary join in original query. Running time is actually halved to 35 ms. Joins cost in performance, and now we are joining only the 10 best players, not all scoring players.

Hostile territory - NoSQL

When SQL tasks were completed, whole database was exported into ArangoDB. This was very simple task, since you can export MySQL database in JSON format which is really easy to import into ArangoDB.

After the import participants had to recreate the same queries as before, but now with ArangoDB query language AQL. Participants had only few problems with AQL. It has some similarity with SQL which makes it easier to "jump into" for anyone familiar with SQL - compared to Javascript style queries. But there were enough differences also which lead to many mistakes.

Biggest difference seemed to be the process order, but when participants finally understood that AQL is like mix of procedural SQL and basic SQL, they started to grasp the skill of building AQL queries. Their first try on '10 best players' query was far from efficient, however.

Query C.

FOR d IN players
LET hscores = (
FOR s IN scoretable
FILTER s.playerid == d.idplayer
FILTER s.gametype == 'CTF'
LET numscore = TO_NUMBER(s.score)
COLLECT pid = s.playerid INTO scores
RETURN {'hscore': MAX(scores[*].numscore)}
)
SORT hscores[0].hscore DESC
LIMIT 0,10
RETURN {'player': d.playername, 'score': hscores[0].hscore}

This query works and gives correct result, but it has a disastrous performance of 26 seconds (26000 ms), which is unacceptable. After I guided players to analyse its computational complexity, they soon understood why it was so. You are going through every player in database, then every time go through score data to calculate his maximum score (even if he has no scores at all), then sort them and pick the 10 highest scoring players.

The real problem is the sort operation, however. Sort is done in such phase, that query optimizer cannot optimize it. Actually, SORT is a bit of a bottleneck in ArangoDB. But in this case there is also a lot of unnecessary calculations. Problem was that participants tried to join player data and player scores and collect the result simultaneously, like they did in SQL.

In AQL you can operate procedurally step-by-step. I guided participants to collect first just 10 best scoring player ids, and then join that with player names, leading to this query

Query D.

LET leader = (FOR s IN scoretable
FILTER s.gametype == 'CTF'
LET numscore = TO_NUMBER(s.score)
COLLECT pid = TO_NUMBER(s.playerid) INTO scoresbypid
LET hscore = MAX(scoresbypid[*].numscore)
SORT hscore DESC
LIMIT 0,10
RETURN {'pid': pid, 'score':hscore}
)
FOR l IN leader
FOR d IN devices
FILTER d.idplayer == l.pid
RETURN {'pid': d.playername, 'score': l.score}

Participants were astonished, when running time dropped to 132 ms. This performance is good but not the best. Only difference was that first we operated only on scoredata, grouped them by player id, calculated the maximum and then sorted it picking the 10 best. After that, this data with only 10 objects was joined with player names.

Even in this case the sort operation is a bottleneck, but less so than in previous case. To conclude the session, I presented final modification to Query D in order to optimize it even more.

Query E.

LET leader = (FOR s IN scoretable
FILTER s.gametype == 'CTF'
SORT s.score DESC
LIMIT 400
LET numscore = TO_NUMBER(s.score)
COLLECT pid = TO_NUMBER(s.playerid) INTO scoresbypid
LET hscore = MAX(scoresbypid[*].numscore)
SORT hscore DESC
LIMIT 0,10
RETURN {'pid': pid, 'score':hscore}
)
FOR l IN leader
FOR d IN devices
FILTER d.idplayer == l.pid
RETURN {'pid': d.playername, 'score': l.score}

Only difference is the sort operation before grouping. This may sound like a stupid idea, because we already know that sort is a bottleneck. However, it performs quite fast on original data. Since we are looking for 10 best scoring players, we are not interested in low scores at all. In this case, no player had more than 200 scores recorded. So, I just sorted quickly all scores and took only 400 best scores to subsequent calculations. Running time of this query dropped to 33 ms, which was even better than best MySQL query on fully optimized database.

Aftermath

Participants found it interesting how fast a SQL database is when it is properly indexed and uses foreign keys, so it can optimize queries. So you should not think that RDBMs and SQL are obsolete old technology. They are still very competent at what they do, if you know how to tune them.

It was also interesting how easy it is to do sloppy queries in NoSQL, and how you actually have to plan queries. On the other hand, in this case ArangoDB's AQL gives you lots of option to filter out the data you need.

GAMING WITH DATA SCIENCE

Tuesday, 24 November 2015

Importance of Computational Complexity