Succeeding in Data Science — Chapter 3 Theory

<<Chapter 2 Excellence

Data science is a vast field. It has several topics, concepts, and methods. Call them tools.

A common mistake is we try to learn the tools first. And forget that we are here to learn solving problems. The tools are only means to solve them.

Imagine you want to become a car mechanic. Would you want to start with learning the origins of a wrench, or first understand a car problem and learn how to fix it? Possibly using the wrench and/or other tools?

Similarly, in data science look at the problem first. Tools must be learned as needed. Become a master of solving problems. And you’ll automatically become a master of the tools. However, the reverse is not true!

Figure 1. Learning triangle.

Figure 1 succinctly states the above. The starting point is a problem. Then the needed tools. Finally, a solution using the tools to solve the problem.

As the figure shows, problem falls in the smallest circle. It is because we solve a defined problem. It is finite. Therefore, it is good to start from here.

Tools set, on the other hand, shown in a large circle is massive. Imagine starting your learning from this circle. It is so easy to get lost and never come out of it. Sometimes we try to come out by forcing a few tools we learned on a problem. Obviously, this takes us nowhere.

How to learn?

Data science is just like learning a new language. Become comfortable with misunderstanding it at the beginning. Beginning is always difficult. But you’ll start to understand!

Importantly, learning takes time. And it should take time. Accepting this makes us patient. Patience is the first rule of learning.

Figure 1. Data Science Wheel.

With that in mind, look at the Data Science Wheel in Figure 1. The wheel isn’t an exhaustive list of data science tools. Instead, it is meant to provide an initial learning path.

The wheel has problem at its core. To solve a problem, we first understand whether it is supervised or unsupervised. Thereafter, we choose a suitable method present in the next rim, e.g., linear regression. We, then, go to the higher rims to understand, diagnose, and advance a solution.

For example, the concept of identical and independently distributed (iid) is the fundamental assumption behind linear regression. Violation of this assumption might cause poor performance during diagnosis. Weighted regression, for example, can address this issue.

In the following, we will go through an example on how to start from a problem and learn from the wheel? In this example, you’ll notice the language is different than our usual. It is intentional. Every field has its language. For example, attorneys talk differently. We find their conversation alien but for them it is convenient. Similarly, data science has its language for convenience. It might be repelling at the beginning but you’ll soon appreciate the convenience it brings.


Figure 2. An illustrative example of learning data science tools by keeping the problem at its core.

Suppose CEO of a retail company wants to boost their revenue through ad promotions. She wants to know how much ad expense increase is required to boost the revenue by 20%. She asks you to give her an answer.

You must focus on solving the problem and see what you need to learn. During the learning process, you will go back-and-forth to some of the tools. Going back-and-forth is an essential part of learning and, hence, presented here.


How much ad expense increase is required by a retailer to increase the revenue by 20%?


The ad expense is the independent variable (x) whose effect on the dependent variable (y), revenue, needs to be assessed. Since there are both x and y, it is a supervised problem.

Linear regression

The purpose is to determine the relationship between ad expense (x) and revenue (y). The relationship should explain the effect of x on y. Using this, we can tell the required increase in x to achieve a 20% increase in y. Linear regression is a good choice to estimate the relationship.

Figure 3. Regression line.

The relationship is estimated by fitting a regression line denoted as,

The line is a model. A model is an expression or a system that represents a pattern in a data. The pattern is used for estimations, predictions, and/or inferences.

Here, the model has three parameters, viz.,

b0: The baseline revenue when there is no ad expense (x), i.e, the expected value of y when x = 0.

b1: The estimated change in y for a unit change in x, i.e., b1 = dy/dx. This is our parameter of interest for the CEO’s problem.

ε: The error in the model. We minimize the model error to estimate its parameters b0 and b1.

OLS (Ordinary Least Squares)

Ordinary least squares (OLS) is an approach to estimate a model parameters. It works by minimizing the sum of the square of the errors.

This OLS objective function is interpreted as: return the values (arg) of b0 and b1 that minimizes (min) the summation of the square of the errors (yi — (b0+b1xi)) where i = 1, …, n.

A question: why sum of squares? Answer:

  • Sum: Estimating a model means estimating its parameters. The parameters b0 and b1 that yields the smallest overall error is the best fit. The overall error is the sum of errors.
  • Square: The errors can be positive or negative. Their sum nullifies each other. In fact, sum of errors of a best model is close to zero. Instead, we want to minimize the overall magnitude of error. The square preserves the magnitude and prevents the nullification. Still, why square? Why not sum of absolute or the fourth power of the errors? Both of them would do the same. Good question. It is discussed in the comments.

Multiple regression

There could be additional factors at a retailer that affect the revenue. For example, the holidays. Sales are generally higher during holidays such as Black Friday, or Christmas. Ignoring them might result in estimating a spurious relation between ad expense and revenue. Similarly, there is weather effect. For instance, snowfall results in higher shovels sales.

These factors become additional x’s in our model. Now we have three variables, ad expense (x1), holidays (x2), and weather (x3).


A multiple regression model can be succinctly expressed as,

Notation conventions:

Matrix and vector operations strictly follow certain rules. For instance,

Multiplication: Look at Xb multiplication. Here, the number of columns on X must be equal to the number of rows in b. Here they are both equal to 4. The result of Xb will be a n×1 vector.

Sum: Look at Xb + ε. Both are n×1 sized matrices (vectors to be exact). Therefore, they can be added.

Matrix compatibility is a must. It is not optional. For example, bX is incorrect! This comes handy while verifying any matrix algebraic expression.


The error ε is a random variable. A random variable is unknown but has some defined characteristics. It is characterized in linear regression as,

These expressions mean,

  • The expected value of the error is zero. And,
  • The variances of the error for the observations in i=1,…,n, are equal to σ². Also, the correlation between any two observations is zero, i.e., cov(εi, εj)=0.


In the characterization of ε, its expectation is 0. That means,

E[ε] = E[y] — E[Xb] = 0

If E[ε]≠0, the model will be biased, i.e., on average the predictions Xb won’t be equal to the actuals y. This ε characteristic is used to estimate the best fitting model.

Since, y and X are given (known), E[y] = y and E[Xb] = XE[b]. We denote

and try to estimate the expected value of b.

Here, I cannot perform a normal arithmetic operation \hat{b} = y/X because we are working with matrices. The analogous operation is \hat{b} = X^{-1} y. But this is still incorrect because X isn’t a square matrix and, therefore, X^{-1} does not exist. Instead, we multiply the transpose of X on both sides and invert the resultant pxp square matrix.


The estimate for b is an estimate after all. Therefore, it will have some uncertainty. The uncertainty is measured with its covariance as,

The variance of y is σ², which is the same as that of ε. Why? Discussed in comments. For now, we can estimate σ² as,

Why is (p+1) subtracted from n? Discussed in comments.

iid (Identically and independently distributed)

The identically and independently distributed (iid) assumption is on the error εi’s. Explained as follows.

Identical: The assumption on their expectation equal to 0 and variance equal to σ² as shown earlier in Statistics. This property of equal variance is called homoskedasticity.

Consider a simple scenario when the observations do not have identical distributions. As shown in Figure 4 (right), suppose the variance is nonidentical and increasing with x. It means the revenue (y) can fluctuate in a wide range as the ad expense (x) increases. Still, the regression line in red is the same in both. This regression line could be misleading because in reality the revenue can only marginally increase for a larger ad expense due to high variance. Therefore, accounting the nonidentical variance, also called as heteroskedasticity, becomes essential.

Figure 4. Illustration of iid vs non-iid data. The left chart is iid with identical variance also called as Homoskedasticity. The variance increases with x in the right chart showing Heteroskedasticity.

Independent: The assumption is that the correlation between any two observations i,j is zero, i.e., cov(εi, εj)=0, for all i, j and ij. This means the observations are independent. Said differently, the observations are unrelated.

But this assumption is invalid in some real world problems. For example, my purchase of a book influences my friend’s purchase. He might buy the same book because I did (or does not buy for the same reason). Due to this, our data points are not independent. Instead, they are related.

Ignoring the dependence in observations when they actually exist causes inaccurate model estimation.

Weighted regression

The cov(ε) denoted as Ω under iid and non iid scenario is shown in Figure 5. The iid data has a diagonal Ω with equal variance. On the other hand, non iid data has non-zero off diagonal elements due to dependence and unequal variances due to them being nonidentical. Weighted regression can be used in such a scenario.

Figure 5. Illustrating the difference between covariance of errors in iid (left) and non-iid (right) data.

Weighted regression is an algebraic approach to incorporate non-constant variance and/or dependence of observations in model estimation. It is shown as follows.


We start with scaling our model by dividing its expression by sqrt(Ω).

After the scaling, the covariance of ε’ is

Therefore, the re-expressed (scaled) model is iid. Now, we can derive the expression for b by reusing the OLS estimate.


A phenomenon of the predictors, x’s, being linearly dependent is multicollinearity. For example, in our model we considered including holidays (x2) and weather (x3).

A holiday always falls on the same time of a year. The climate during this time is, therefore, expected to be the same. The weather will, therefore, be similar (with some fluctuations) on a holiday every year.

Consequently, x2 and x3 are collinear. Due to their collinearity, their coefficients b2 and b3 can be severely inflated. Why? Because, they can compensate each other’s inflation. For example, if b2 in estimated as 10 times its true value, b3 will become 1/10th of its true value. b2 and b3 can play this game in a wide range resulting in a high estimate variance.

A high estimate variance means b2 and b3 estimates can be extremely different if the model is trained on another data set. Basically, their estimates will vary from data set to data set and, therefore, unreliable.


Mathematically, multicollinearity causes the high variance in coefficient estimates because,

The variance inflation is resolved using regularization. Regularization aims at regularizing the coefficient estimates to prevent their variance inflation.

There are two commonly used regularization:

  1. Ridge (L2) regularization

A regularization term is added in the OLS objective function to constrict the values of coefficient estimates. The ridge objective, therefore, has a constricting component called regularizer.

Using matrix algebra, the closed-form solution is

The term λI added to X^TX “stabilizes” the matrix inverse. It no longer goes to zero in presence of multicollinearity. This prevents the variance inflation. Also, the coefficients become regularized, i.e., their estimates aren’t arbitrary large values.


The solution for ridge is derived using matrix algebra. To get there, we express the OLS objective function in the form of matrix operations.

We know the solution of OLS. So if we could express the ridge objective in the form of OLS, we have our solution. Augmenting X and y does this trick.

The ridge objective can be expressed as an OLS using the augmented X’ and y’.

Therefore, the solution is straightforward,

2. Lasso (L1) regularization

Lasso is an acronym for “least absolute shrinkage and selection operator.” But it is used like a common word “lasso” due to its relevant definition: a rope with a noose to catch and tighten an errant animal.

Similarly, lasso estimates the coefficients while tightening them. It differs from Ridge because it also acts as a variable selector by making some bj’s equal to 0. This means the corresponding xj is dropped out of the model and, hence, only the remaining x’s are selected.

The objective function is

Gradient descent

A closed-form solution for lasso does not exist. Therefore, proximal gradient descent can be used (other algorithms such as coordinate-wise gradient descent, grafting, and e-boosting are also used).

The gradient descent method works by iteratively estimating b until convergence.


The derivative in a gradient descent method can be computed only if f is differentiable with respect to b. In proximal gradient descent method, f is the OLS component in the lasso’s objective function, i.e.,

Its partial derivative is,


The lasso approach performs regularization and variable selection. A simpler approach for variable selection is stepwise regression.

In this approach, a variable is added or dropped one-by-one. The variable is added if its coefficient bj is statistically significant. Otherwise, it is dropped. At each step, OLS is used for estimation. This procedure is called forward stepwise. We can also do backward stepwise which is the reverse.

Stepwise addresses multicollinearity when the number of variables aren’t too many. For high-dimensional problems, regularization methods work better.


From the tools we learned, I will start with the simplest approach. I will perform multiple linear regression with forward stepwise for variable selection as we have only a few variables.

Since, x3 is collinear with x2, it is likely to be dropped by forward stepwise. Suppose the resulting estimated model is,

The parameter of my interest is b1 estimate because,

From this, I will estimate the percent increase in ad expense x to achieve a 20% increase in revenue y by also looking at their current values, xt and yt as follows,


We went back and forth between the tools. It might appear haphazard. It conflicts with a belief of sequentially going over topics or chapters. But contrary to this belief, natural learning is haphazard.

Despite being haphazard, dust settles when you reach a solution. You’ll realize you achieved the goal of solving a problem. And, learned several tools along the way without getting lost.

Q1. I am beginner. How do I know which tools to learn?

Read a textbook. But only cursorily. Get acquainted with the tools without getting into the details. For example, when I read a textbook chapter or a research paper, I only look at what they can solve.

The topics there come back to me when I am solving a problem. Then I know where to look for the details. It is like become aware that a wrench exists to tighten bolts. But learn about its size, strength, torque, etc., when the wrench is possibly useful in a problem.

Do not become anxious of not learning every tool detail early on. The tools learning add up with time as you solve more problems. Eventually, you become a master of problem solving and, in the process, of every tool.

Q2. I am a student. I am just starting to learn. I do not have any project or problem to solve. How to begin?

Actually you do. You are overlooking the problem given right in front of you in textbooks. Most textbooks explain topics using an exemplifying problem. For example, the textbook Elements of Statistical Learning use 1. Email Spam, 2. Prostate Cancer, 3. Handwritten Digit Recognition, and 4. DNA Expression Microarrays problems to elucidate the topics in its chapters.

But we tend to ignore the problem a chapter is solving. Instead, we focus on the topics. Perhaps because the topics are new to us. Also, sometimes the textbook problems appear trivial or unrelatable but the topics are glorified. So, we naturally gravitate towards learning the topics.

For example, the ad expense — revenue problem appears ordinary. I can easily overlook it to focus only on a supervised learning approach: linear regression.

Sometimes our excitement also makes us skip a textbook problem. It is common because our mind is entering a new territory. We want to become knowledgable quickly. Delving on an unrelatable problem appears as an impediment.

This is where we get wrong! Instead of becoming knowledgable, we get exhausted. And, we end up neither learning the tools, nor solving problems. A pursuit that begun with vigor ends with a fatigued mind.

To stay on our pursuit, we must start with a problem. Get into tools’ details step-by-step. In the beginning, you will have learned only a few tools. Your solution, as a result, could be basic. But that is okay.

A basic solution is better than no solution! Also, you are a beginner. So it is okay.

But as you do this many times, you learn more. After a few iterations, you know many tools and their details. You will soon develop advanced solutions. Remember, iteration is the key!

Finally, one can very well break the data science wheel here to recommend a different order of learning. It is great if you do that. It means you have gotten good at learning. The idea is that you learn enough to make your own wheel!

Disclaimer: The post contains original content copyrighted to the author.

Director of Science at ProcessMiner | Book Author |