Wrangling and Chess Analysis

Data Source


This analysis focuses on extracting data from an unusual/irregular form (i.e. not nice and rectangular in shape, with each observation in a different row and each column representing a different variable) and making usable datasets from them. Chess data comes in a '.pgn' file which is a specialized filetype used to record the information of any chess game. This project thus used publically available .pgn files from Lichess. The first section goes into detail about how we can manipulate the irregular '.pgn' file to make a regular, retangular dataset to analyze. If you want to skip this part and just go to the analysis, click 'Jump to Analysis' below. Otherwise, lets talk about how we can deal with an irregular data structure.

Jump to analysis


If we open a .pgn file (using a standard notepad opener), we see that the file contains data on thousands of different games. Usually the first 17 or so lines of the file contains the metadata information of each game played (highlighted in blue). This metadata contains the names of the players, their ELO (a score representing their ranking, higher is better), UTC date and time, their ELO rating difference and a whole lot of other information. Then we see that there is a linebreak and the following line (highlighted in orange) contains info about the actual moves played and it is all contained within one single line (in the example image there is a soft wrap on the line but it's all technically one line).


So, we can view this .pgn file as a giant list of a singluar column data. Each row represents a different type of data [Date, time, round, name of black player, name of white player etc.] and each observation is a string which can be parsed through using string selectors. Finally each set of 19 lines contains data about one game of chess and any line that starts with the strings "[Event" means that the following rows contain data about another game.


So, now we need to separate the metadata from the actual moves played and create a dataset that takes all this "column list" data and make it into a rectangular format; where each row is a different game played, and each column contains the metadata info and the actual moves played. In essence, we're widening this column type of dataset. Our final output should be much wider, where each new column has data that were previously stored within different rows.

library(knitr)
library(formatR)
library(rmarkdown)
library(plyr)
library(tidyverse)
library(XML)
library(stringr)
library(dplyr)
library(tidyr)
library(quantreg)
library(radiant)
library(skimr)

Create a data frame or a tibble of tags/metadata portion of the games in chess_classic_games.pgn. One row should represent one game, and one column should represent one tag.

# Create a dataframe or a tibble of tags/metadata portion of
# the games in chess_classic_games.pgn 

pgn_ex <- readLines("chess_classic_games.pgn")

pgn2 <- str_split_fixed(pgn_ex, " ", 2)

# create a boolean for metadata chars
is_metadata = str_detect(pgn_ex, "^\\[.*\\]$")
is_metadata[1:25]
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## [25]  TRUE

Let's explain what we just did. So, we know that from the image and contents of the .pgn file that each game has a bunch of rows of data that is all metadata, then there is a line skip, and then there is info about the game moves, another line skip, and then the medata for the NEXT game starts. Since all metadata values start and end with "[...METADATA...]", we can use str_detect() to go through out .pgn file, line by line, and ask, 'is this row a metadata row?'. This is the function of is_metadata(). We can double check our process and use the image of the .pgn file to see that lines 18,19,20 should all fail the metadata check, as these lines do not contain "[]". Looking at the first 25 lines, our output confirms this result, with three FALSE values exactly at lines 18,19, and 20.
This idea of string searching and string splitting is our bread and butter in separating out each 'variables' of the columns we want to created, which we will do below.

# now to blank out the non-metadata, 
pgn_meta <- pgn_ex
	
# this one blanks out all the FALSE results from is_metadata
pgn_meta[!is_metadata] = ""

# string split based on a space 
pgn_meta2 <- str_split_fixed(pgn_meta, " ", 2)

# print results for checking
pgn_meta2[1:25,]
##       [,1]              
##  [1,] "[Event"          
##  [2,] "[Site"           
##  [3,] "[Date"           
##  [4,] "[Round"          
##  [5,] "[White"          
##  [6,] "[Black"          
##  [7,] "[Result"         
##  [8,] "[UTCDate"        
##  [9,] "[UTCTime"        
## [10,] "[WhiteElo"       
## [11,] "[BlackElo"       
## [12,] "[WhiteRatingDiff"
## [13,] "[BlackRatingDiff"
## [14,] "[ECO"            
## [15,] "[Opening"        
## [16,] "[TimeControl"    
## [17,] "[Termination"    
## [18,] ""                
## [19,] ""                
## [20,] ""                
## [21,] "[Event"          
## [22,] "[Site"           
## [23,] "[Date"           
## [24,] "[Round"          
## [25,] "[White"          
##       [,2]                                                                  
##  [1,] "\"Rated Bullet tournament https://lichess.org/tournament/rs53Xpda\"]"
##  [2,] "\"https://lichess.org/Xa3soL0G\"]"                                   
##  [3,] "\"2018.09.30\"]"                                                     
##  [4,] "\"-\"]"                                                              
##  [5,] "\"YessidP\"]"                                                        
##  [6,] "\"pushka_1995\"]"                                                    
##  [7,] "\"1-0\"]"                                                            
##  [8,] "\"2018.09.30\"]"                                                     
##  [9,] "\"22:00:08\"]"                                                       
## [10,] "\"1607\"]"                                                           
## [11,] "\"1574\"]"                                                           
## [12,] "\"+10\"]"                                                            
## [13,] "\"-11\"]"                                                            
## [14,] "\"D05\"]"                                                            
## [15,] "\"Colle System\"]"                                                   
## [16,] "\"60+0\"]"                                                           
## [17,] "\"Time forfeit\"]"                                                   
## [18,] ""                                                                    
## [19,] ""                                                                    
## [20,] ""                                                                    
## [21,] "\"Rated Bullet game\"]"                                              
## [22,] "\"https://lichess.org/oYMqpagH\"]"                                   
## [23,] "\"2018.09.30\"]"                                                     
## [24,] "\"-\"]"                                                              
## [25,] "\"zulike2004\"]"
# remove the [ in the first one to get the varnames 
varnames <- str_replace(pgn_meta2[,1], "\\[","")


# This sets up the names, replacing empty strings with ""
varnames <- replace(varnames, varnames == "", NA)

# now for the actual values, cleaning the values of the variablenames is harder
# removing quotation marks takes one step at a time 

values = pgn_meta2[,2]

# check first 20
values[1:20]
##  [1] "\"Rated Bullet tournament https://lichess.org/tournament/rs53Xpda\"]"
##  [2] "\"https://lichess.org/Xa3soL0G\"]"                                   
##  [3] "\"2018.09.30\"]"                                                     
##  [4] "\"-\"]"                                                              
##  [5] "\"YessidP\"]"                                                        
##  [6] "\"pushka_1995\"]"                                                    
##  [7] "\"1-0\"]"                                                            
##  [8] "\"2018.09.30\"]"                                                     
##  [9] "\"22:00:08\"]"                                                       
## [10] "\"1607\"]"                                                           
## [11] "\"1574\"]"                                                           
## [12] "\"+10\"]"                                                            
## [13] "\"-11\"]"                                                            
## [14] "\"D05\"]"                                                            
## [15] "\"Colle System\"]"                                                   
## [16] "\"60+0\"]"                                                           
## [17] "\"Time forfeit\"]"                                                   
## [18] ""                                                                    
## [19] ""                                                                    
## [20] ""
# replaces multiple instances of \ with empty ""
values = str_replace_all(values, "\"", "")
values[1:20]
##  [1] "Rated Bullet tournament https://lichess.org/tournament/rs53Xpda]"
##  [2] "https://lichess.org/Xa3soL0G]"                                   
##  [3] "2018.09.30]"                                                     
##  [4] "-]"                                                              
##  [5] "YessidP]"                                                        
##  [6] "pushka_1995]"                                                    
##  [7] "1-0]"                                                            
##  [8] "2018.09.30]"                                                     
##  [9] "22:00:08]"                                                       
## [10] "1607]"                                                           
## [11] "1574]"                                                           
## [12] "+10]"                                                            
## [13] "-11]"                                                            
## [14] "D05]"                                                            
## [15] "Colle System]"                                                   
## [16] "60+0]"                                                           
## [17] "Time forfeit]"                                                   
## [18] ""                                                                
## [19] ""                                                                
## [20] ""
# check first 20 of new values
values = str_replace(values, "\\]$", "")
values[1:20]
##  [1] "Rated Bullet tournament https://lichess.org/tournament/rs53Xpda"
##  [2] "https://lichess.org/Xa3soL0G"                                   
##  [3] "2018.09.30"                                                     
##  [4] "-"                                                              
##  [5] "YessidP"                                                        
##  [6] "pushka_1995"                                                    
##  [7] "1-0"                                                            
##  [8] "2018.09.30"                                                     
##  [9] "22:00:08"                                                       
## [10] "1607"                                                           
## [11] "1574"                                                           
## [12] "+10"                                                            
## [13] "-11"                                                            
## [14] "D05"                                                            
## [15] "Colle System"                                                   
## [16] "60+0"                                                           
## [17] "Time forfeit"                                                   
## [18] ""                                                               
## [19] ""                                                               
## [20] ""
# One row should represent one game, and one column should represent one tag
# try without using a for loop, pivot from long to wide 
metadata_df <- data.frame(varnames, values)

test2 <- metadata_df %>%
  group_by(varnames) %>%
  mutate(row = row_number()) %>%
  tidyr::pivot_wider(names_from = varnames, values_from = values)%>%
  select(-row)

# Remove NA's from pivoting
test2 <- test2[1:sum(str_count(varnames, "Event"), na.rm = TRUE),]

Cleaned_chess_classic <- test2[,2:length(test2)] # remove rows column from start
Cleaned_chess_classic[1:5,] # first 5
Cleaned_chess_classic[20:25,] # next 5 after 20
# make tag_line that has the row number of [Event], and moves_line
# that has the row number of [1. d4...], tag_line uses test2 and move_line 
# should / can use pgn_ex 

tag_line <- c(which(metadata_df$varnames == "Event"))

newtest <- cbind(test2, tag_line)

# make move_line next 

# move_line <- c(which(str_detect(pgn_ex, "^[:digit:]"))) doesn't cover all cases
# add in the second case of " 0 -1" into the str_detect and use an | operator

move_line <- which(str_detect(pgn_ex, paste(c("^[:digit:]", " 0-1"), collapse ="|")))

newtest <- cbind(newtest, move_line)


Cleaned_chess_classic <- cbind(Cleaned_chess_classic, tag_line, move_line)

Now we have a more proper dataset that we can work with, if we check the 'Cleaned_chess_classic' table we see that each row is a separate game and all the metadata have now been placed in columns and filled in with their respective values. Essentially, this was using some clever string selection to first create variable names (in varnames) and using the similar method to create the values (in values) for the data that goes into each variable name. So varnames and values are two separate tables created. Finally, they're all merged together using the groupby() function alongside the pivot_wider() function. We're pivoting the original narrow column data with each row depicting a new variable, and pulling it apart across, widening it such that each column (going left to right) now represents the variables.

Now lets start analyzing this data



How much ELO is playing first worth?

Say you're playing a chess match with your friend. You notice that they always go on about how much better than you they are because of their (just barely) higher ELO rating. But you also notice that whenever you play, they always go first. So, you think to yourself, I wonder how much ELO going first is worth?

Lets think about this problem a bit more. There is always a slight advantage going first in a chess match, white pieces tends to have slightly higher than 50% winrate and this is seen across all ELOs. This means, that there is some advantage to going first and thus, it should reflect on the ELO rating. We want to calculate what this ELO rating would be.
To do this, we can think of the problem backwards and ask, "To achieve a 50% winrate between White and Black, what should the ELO differences between White and Black be?" Thus, this ELO difference would represent the ELO value of going first.


Caveat: You may be thinking, isn't this too simple? And the answer is, yes. This is probably too simplified but lets just do some napkin math and have fun with it.

Lets create a linear model of the proportion of time the first player (white) wins as a function of the amount of higher rating the first player has over the second player. We'll use this model to determine how many rating points playing first is worth (i.e., how much lower in ELO does the player using white have to be in order to win exactly half of the time).

# y = mx  + b ; 
# y = proportion of wins (Result = 1-0),
# x = Diff between positive WhiteElo - Black Elo 
# b = intercept

### ok let convert that resutls to 1 0 (White win or loss, count draws as loss)

# Create new data row that has a binary var of 1 or 0 based on white wins 
Result_binary <- c(nrow = nrow(newtest))

for (i in 1:nrow(newtest)){
  if (newtest$Result[i] == "1-0"){
    Result_binary[i] <- 1
  }  else {
    Result_binary[i] <- 0
  }
}

# create smaller df, with just elo vars and Result_binary for lm
Ratings_worth <- data.frame(WhiteElo = newtest$WhiteElo,BlackElo = newtest$BlackElo, Result_binary = Result_binary)

# filter where WhiteElo > BlackElo
Ratings_worth <- Ratings_worth[which((as.numeric(Ratings_worth$WhiteElo) > as.numeric(Ratings_worth$BlackElo))),]

# create WhiteDifference vector
Ratings_worth$WhiteDiff <- as.numeric(Ratings_worth$WhiteElo) - as.numeric(Ratings_worth$BlackElo)

# create the linear model 
Ratings_worth_mod <- lm(Result_binary~WhiteDiff, data = Ratings_worth)

summary(Ratings_worth_mod)
## 
## Call:
## lm(formula = Result_binary ~ WhiteDiff, data = Ratings_worth)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9176 -0.5402  0.3243  0.4464  0.4818 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.174e-01  1.242e-02  41.668   <2e-16 ***
## WhiteDiff   7.878e-04  9.278e-05   8.491   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4863 on 2529 degrees of freedom
## Multiple R-squared:  0.02772,    Adjusted R-squared:  0.02734 
## F-statistic:  72.1 on 1 and 2529 DF,  p-value: < 2.2e-16
# calculate white_worth as y = 0.5, half of the time white wins 
white_worth = (0.5 - Ratings_worth_mod$coefficients[1])/ Ratings_worth_mod$coefficients[2]

white_worth
## (Intercept) 
##   -22.06279

Using the linear model above, playing first is worth about 22 ELO rating points, or the white pieces should be below 22 elo points compared to the black pieces in order to win 50% of the time. But does this behaviour hold when really good players play chess?

To answer this question, lets repeat the linear model from before, but filter() the data so that only games between players where both players have a 2000 rating or better is selected.

filtered2k <- filter(Ratings_worth, as.numeric(WhiteElo) >= 2000 & 
                       as.numeric(BlackElo) >= 2000)

mod2k <- lm(Result_binary~WhiteDiff, data = filtered2k)
summary(mod2k)
## 
## Call:
## lm(formula = Result_binary ~ WhiteDiff, data = filtered2k)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8943 -0.5541  0.2856  0.4069  0.4856 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.513249   0.061319    8.37 2.63e-14 ***
## WhiteDiff   0.001184   0.000538    2.20   0.0293 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4813 on 161 degrees of freedom
## Multiple R-squared:  0.02918,    Adjusted R-squared:  0.02315 
## F-statistic: 4.839 on 1 and 161 DF,  p-value: 0.02925
white_worth2k <- (0.5 - mod2k$coefficients[1])/ mod2k$coefficients[2]
white_worth2k
## (Intercept) 
##   -11.19527

Interesting, using the linear model for 2k elo and above, playing first is worth about 11 ELO rating points, or the white pieces should be below 11 elo points compared to the black pieces in order to win 50% of the time. There is about a 10 ELO difference when looking at players above 2k elo compared to looking at all players, showing that the advantage white has may not be as significant as skill improves. This makes sense with known chess data that going first usually is a small advantage but it's nice to try to put some ELO numbers behind this analysis. Next time you're playing chess with someone as black and they win make sure to bring up this 10 - 20 point ELO difference (but only if you lose).



If you enjoyed reading this analysis feel free to check out some of my other work in the data science space.