Machine Learning Prediction of S&P 500 Movements using QDA in R – DataDrivenInvestor

Quadratic Discriminant Analysis is a classification method in statistics and machine learning. It is similar to Linear Discriminant Analysis (LDA), but it assumes that the classes have different covariance matrices, whereas LDA assumes that the classes have the same covariance matrix. If you want to learn more about LDA, here is my previous article where I talk about it.

In QDA, the goal is to find a quadratic decision boundary that separates the classes in a given dataset. This boundary is based on the estimated means and covariance matrices of the classes. Moreover, QDA can be used for both binary and multiclass classification problems. It is often used in situations where the classes have nonlinear boundaries or where the classes have different variances.

In R, QDA can be performed using the qda() function in the MASS package. We will use it on the SMarket data, part of the ISLR2 library. The syntax is identical to that of lda(). In the context of the Smarket data, the QDA model is being used to predict whether the stock market will go up or down (represented by the Direction variable) based on the percentage returns for the previous two days (represented by the Lag1 and Lag2 variables). The QDA model estimates the covariance matrices for the up and down classes separately and uses them to calculate the probability of each observation belonging to each class. The observation is then assigned to the class with the highest probability.

train <- (Smarket$Year < 2005)Smarket.2005 <- Smarket[!train, ]Direction.2005 <- Smarket$Direction[!train]

We first load the libraries and then split the data into test and training subsets in order to avoid overfitting the model.

Then we fit a QDA model to the training data (subset = train), using the qda function.

# OUTPUT:Prior probabilities of groups:Down Up 0.491984 0.508016

Group means:Lag1 Lag2Down 0.04279022 0.03389409Up -0.03954635 -0.03132544

We only use the Lag1 and Lag2 variables because they are the ones that seem to have the highest explicative power (we discovered it in a previous article about logistic regression: basically, they are the ones with the smallest p-value). Here is the article if you want to delve deeper into the topic:

The output contains the group means. But it does not contain the coefficients of the linear discriminants, because the QDA classifier involves a quadratic, rather than a linear, function of the predictors.

Next, we make predictions on the test data using the predict function and calculate the confusion matrix and the classification accuracy.

mean(qda.pred$class == Direction.2005)# OUTPUT:[1] 0.599

The output of the table function shows the confusion matrix, and the output of the mean function shows the classification accuracy.

Interestingly, the QDA predictions are accurate almost 60% of the time, even though the 2005 data was not used to fit the model. This level of accuracy is quite impressive for stock market data, which is known to be quite hard to model accurately. This suggests that the quadratic form assumed by QDA may capture the true relationship more accurately than the linear forms assumed by LDA and logistic regression. However, I would definitely recommend evaluating this methods performance on a larger test set before betting that this approach will consistently beat the market!

We can create a scatterplot with contours to visualize the decision boundaries for the Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) models on the Smarket data.

len1<-80; len2<-80; delta<-0.1grid.X1<-seq(from=min(Smarket$Lag1)-delta,to=max(Smarket$Lag1)+delta,length=len1)grid.X2<-seq(from=min(Smarket$Lag2)-delta,to=max(Smarket$Lag2)+delta,length=len2)dataT<-expand.grid(Lag1=grid.X1,Lag2=grid.X2)

lda.pred<-predict(lda.fit,dataT)zp <- lda.pred$posterior[,2] -lda.pred$posterior[,1]contour(grid.X1, grid.X2, matrix(zp, nrow=len1),levels=0, las=1, drawlabels=FALSE, lwd=1.5, add=T, col="violet")

qda.pred<-predict(qda.fit,dataT)zp <- qda.pred$posterior[,2] -qda.pred$posterior[,1]contour(grid.X1, grid.X2, matrix(zp, nrow=len1),levels=0, las=1, drawlabels=FALSE, lwd=1.5, add=T, col="brown")

The first two lines of code create a color indicator variable for the Direction variable based on whether it is Up or Down in the training data. The plot function is then used to create a scatterplot of the Lag2 variable against the Lag1 variable, with points colored according to the color indicator variable.

The next four lines of code define a grid of points to be used for generating the contours. The expand.grid function creates a data frame with all possible combinations of Lag1 and Lag2 values within the specified grid range.

The susequent chunck of code use the predict function to generate the predicted class probabilities for each point in the grid, for both the LDA and QDA models. The contour function is then used to create a contour plot of the decision boundaries for each model, with the levels set to 0 to show the decision boundary between the two classes. The LDA contours are colored violet, while the QDA contours are colored brown.

Thank you for reading the article. If you enjoyed it, please consider following me.

See the rest here:
Machine Learning Prediction of S&P 500 Movements using QDA in R - DataDrivenInvestor

Related Posts

Comments are closed.