50 Questions on Statistics & Machine Learning – Can you answer?
We have tried to summarize detailed 50 tricky and insightful questions on statistics and machine learning discussion that should benefit you in two ways:
- a. Evaluating what you do not know and what you know.
- b. Knowing and addressing these similar questions in job interviews help you to generalize your skills.
The purpose of this article is not to clarify machine learning but to clarify the essential principles that recruiters frequently ask questions in job interviews. Yes, those mentioned questions come to the rescue if you are planning to make your career in Machine Learning and Statistics.
Q1. What are the important skills to have in Python with regard to data analysis?
Answer: The following are some of the important skills to possess which will come in handy when performing data analysis using Python.
- a. Good understanding of the built-in data types especially lists, dictionaries, tuples, and sets.
- b. Mastery of N-dimensional NumPy Arrays.
- c. Mastery of Pandas data frames.
- d. Ability to perform element-wise vector and matrix operations on NumPy arrays.
- e. Knowing that you should use the Anaconda distribution and the conda package manager.
- f. Familiarity with Scikit-learn.
- g. Ability to write efficient list comprehensions instead of traditional for loops.
- h. Ability to write small, clean functions (important for any developer), preferably pure functions that don‘t alter objects. Knowing how to profile the performance of a Python script and how to optimize bottlenecks.
Q2. What is the real definition of Artificial Intelligence?
Answer: Over the years, the term ‘Artificial Intelligence’(AI) originates from a complex set of technologies and there have been suggestions for many different definitions of AI. So, the most common, real, and accurate definition of AI given by researcher and AI enthusiast is described as “the computer systems or machines used to analyze or process the data unlike the natural intelligence which are programmed to think like human beings and animals and also imitate their actions”
According to the European Union, “Artificial intelligence (AI) refers to systems that show intelligent behavior: by analyzing their environment they can perform various tasks with some degree of autonomy to achieve specific goals.”
Q3. Explain machine learning for 5-year-old children.
Answer: It is clear and simple. It is just like how babies are learning to walk. They learn and understand (unconsciously) every time when they fall down, and realize that their legs should not be in a bent position and legs should be straight. When they fall down the next day, they will feel pain. They will start crying. But, they’ll learn not to stand again like that. So, they try harder to prevent pain. They also need help from the door or wall or something next to them to succeed, which allows them to stand firm.
This is how a machine works & learns the environment’s intuition.
NOTE: The interview only attempts to test whether complex ideas are able to be described in basic terms.
Q4. You are given a data set. From the long list of machine learning algorithms, how do you determine which one algorithm to use?
Answer: You have to say that, the selection of the machine learning algorithm depends solely on the type of data set. If a data set showing linearity is given to you, then the right algorithm to use will be linear regression. If you are given a data set based on audios, images, the neural network will then help you to create a stable model.
If the data set contains non-linear interactions, then a better choice should be a boosting or bagging algorithm. If the business prerequisite is to construct a model that can be implemented, instead of black-box algorithms such as SVM, GBM, etc. we can use a decision tree or regression model which is easy to explain and interpret.
In short, for all cases, no one master algorithm exists. To grasp which algorithm to use, we have to be scrupulous enough.
Q5. Do you think Deep Learning is Better than Machine Learning? If so, why?
Answer: Though traditional ML algorithms solve a lot of our cases, they are not useful while working with high dimensional data, that is where we have a large number of inputs and outputs. For example, in the case of handwriting recognition, we have a large amount of input where we will have a different type of input associated with different types of handwriting.
The second major challenge is to tell the computer what are the features it should look for that will play an important role in predicting the outcome as well as to achieve better accuracy while doing so.
Q6. Consider a train data set that has 1 million rows and 1000 columns is given to you. The set of data is based on a classification problem. If your manager has asked you to decrease the dimension of data so that it is possible to minimize model computing time. What would you do?
Note: Your Computer has limitations on memory.
Answer: It is a strenuous job to process high-dimensional data on a small memory machine, and your interviewer will be well aware of it. The strategies you may use to handle those conditions are as follows:
- a. We can close all other programs on our computer, including the web browser because we have lower RAM so that more of the memory can be put to use.
- b. We can sample the data set randomly. This suggests that, let’s assume we can construct a smaller data set of 30000 rows and 1000 variables and do the computations.
- c. We should distinguish the numerical ad categorical variables and eliminate the correlated variables to minimize dimensionality. We will use correlation with numerical variables. And similarly, we will use the chi-square test for categorical variables.
- d. Also, PCA can be applied to choose the components in the data set that can describe the maximum variance.
- e. We may also apply our market understanding to estimate that the response variable can be influenced by all predictors. However, this is an intuitive approach that could lead to a substantial loss of knowledge of the valuable predictors that are not established.
Q7. If you are having 4GB RAM in your machine and you want to train your model on a 10GB data set. How would you go about this problem? Have you ever faced this kind of problem in your machine learning/data science experience so far?
Answer: First of all, you have to ask which ML model you want to train.
For Neural networks: Batch size with NumPy array will work.
Steps involved:
- a. Load the whole data in the Numpy array. Numpy array has a property to create a mapping of the complete data set, it doesn‘t load the complete data set in memory.
- b. You can pass an index to the Numpy array to get the required data.
- c. Use this data to pass to the Neural network.
- d. Have a small batch size.
For SVM: Partial fit will work
Steps involved:
- a. Divide one big data set in small size data sets.
- b. Use a partial fit method of SVM, it requires a subset of the complete data set.
- c. Repeat step 2 for other subsets.
Q8. How do you select important variables while working on a data set? Explain your strategies.
Answer: Following are the variable selection strategies you can use:
- a. Before choosing important variables, eliminate all the correlated variables.
- b. Use Forward Selection, Backward Selection, and Stepwise Selection.
- c. Linear Regression is used and variables are chosen based on p values.
- d. Use Random Forest, Xgboost and plot variable importance chart.
- e. Use Lasso Regression.
Note: For the available range of features, calculate information gain, and pick top n features accordingly.
Q9. In PCA, is rotation necessary? If yes, why? If you do not rotate the components, what will happen?
Answer: Yes, in PCA rotation (orthogonal) is necessary because maximizes the component’s captured variance. This makes it easier to interpret the components.
Note: The purpose of doing PCA is to choose fewer components (than features) that can clarify the maximum variance in the data set.
The relative orientation (location) of the components does not change when doing rotation, it just changes the exact position of the points.
If we do not rotate the components of the data set, the influence of PCA will decrease. And we will have to choose more components to clarify the variance in the data set.
Q10. A data set is given to you. There are several variables in the data set, some of which are highly correlated and you know about them. You have been asked by your manager to run the PCA. Will you first eliminate correlated variables? Why?
Answer: There’s a possibility you may be tempted to say NO, but that would be wrong. Discarding correlated variables has a significant influence on PCA since the variance described by a single factor is inflated in the presence of correlated variables.
For Example: In a data set, you have 3 variables, 2 of which are correlated. The first principal components would exhibit double the variance when you run the PCA on this data set that it would exhibit with uncorrelated variables. The addition of correlated variables also allows PCA to impose greater emphasis on certain variables, which is misleading.
Q11. If the data set is given to you. Suppose there are missing values in the data set, distributed over 1 standard deviation from the median. What percentage of the data will have been unaffected? Why?
Answer: For you to start thinking, this question has enough hints. Since data is scattered around the median, let’s say it is a normal distribution. We know that ~68 percent of the data in a normal distribution lies in 1 standard deviation from mean, mode, or median which leaves ~32 percent of the data unaffected. Thus, ~32 percent of the data will remain unaffected by missed values.
Q12. A data set consisting of variables that have more than 30% missing values is given to you. Let’s say, 8 variables have missing values above 30% out of 50 variables. How will you deal with them?
Answer: We can deal with them in the following ways:
- a. We may blatantly delete them.
- b. Assigning missing values to a unique category, who knows the missing values could decode any pattern.
- c. With the target variable, we will sensibly verify their distribution, and if any trend is detected, we can preserve those missing values and add a new category to them while deleting others.
Q13. A data set on cancer detection is given to you. You’ve developed a model for classification and achieved 96 percent accuracy. Why shouldn’t you be happy with the success of your model? What would you do about that?
Answer: If enough data sets have been operated on, you can deduce that cancer detection results in imbalanced data. Accuracy cannot be used as the measure of the performance of the classification model in an imbalanced dataset. This is because 96 percent (as given) may only accurately estimate the majority class, but our class is a minority class (4 percent) which is the individuals who have already been diagnosed with cancer. Therefore, we can use Sensitivity(True Positive Rate), specificity(True Negative Rate), F measure to evaluate the class-wise performance of the classifier in order to assess model output. If the performance of the minority class is found to be poor, we should take the following steps:
- a. To make the data balanced, we can use undersampling, oversampling, or SMOTE.
- b. We can also use anomaly detection.
- c. We should allocate weight to class such that greater weight is given to the minority class.
- d. By using probability calibration and finding an optimal threshold using AUC, ROC curve, we can adjust the prediction threshold value.
Q14. If you are working on a data set of the time series. You have been asked by the manager to create a high accuracy model. You start with the algorithm of the decision tree since you know the decision tree operates on all sorts of data reasonably well. Later, you tested a regression model of the time series and got better accuracy compared to the decision tree. Will it happen? Why?
Answer: It is understood that time-series data has linearity. On the other hand, to detect non-linear relationships, a decision tree algorithm is considered to operate better. The Decision Tree Algorithm is unable to map the linear relationship compared to the regression model. So, the decision tree failed to provide robust predictions. Therefore, we discovered that provided the data set satisfies its linearity assumptions, a linear regression model can provide robust predictions.
Q15. During analysis, how do you treat missing values?
Answer: The extent of the missing values is identified after identifying the variables with missing values. If any patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful business insights.
If there are no patterns identified, then the missing values can be substituted with mean or median values (imputation) or they can simply be ignored. Assigning a default value which can be the mean, minimum, or maximum value. Getting into the data is important.
If it is a categorical variable, the default value is assigned. The missing value is assigned a default value. If you have a distribution of data coming, for normal distribution give the mean value.
If 80% of the values for a variable are missing then you can answer that you would be dropping the variable instead of treating the missing values.
Q16. How do you think Google is training data for self-driving cars?
Answer: Google is currently using Recaptcha to source labeled data on storefronts and traffic signs. They are also building on training data collected by Sebastian Thrun at GoogleX — some of which was obtained by his grad students driving buggies on desert dunes! Of course, you probably don’t know that. But have you ever wondered why for the last two odd years, increasingly the Captcha that you are solving on file-sharing websites involves identifying cars in tiny pixelated images or recognizing fire hose, or storefronts, or bicycles, or buses? This is not random. It’s by design because Google wants you to identify these images so that its artificial intelligent systems can learn from your knowledge.
Q17. Why is Naive Bayes so ‘naive’?
Answer: Naive Bayes is so ‘naive’ because all of the features in a data set are considered to be equally independent and important. As we know, in real-world situations, these conclusions are rarely accurate.
Q18. Explain the Naive Bayes algorithm of prior probability, likelihood, and marginal probability,
Answer: The proportion of a dependent (binary) variable in the data set is nothing but a prior probability. It is the closest guess you can make, without any additional details about a class. For example, The dependent variable in a data set is binary (1 and 0). 70% is the proportion of 1 (spam) and 30% is 0 (not spam). Therefore, we should predict that there is a 70% risk of classifying any new email as spam.
The probability of classifying a certain observation as 1 in the presence of a certain other variable is a likelihood. For example, The possibility of using the term ‘FREE‘ in previous spam messages is a likelihood. The likelihood that the term ‘FREE‘ will be included in any message is the marginal likelihood.
Q19. You are given a new assignment that includes trying to save more money for a food service company. The concern is that the company’s distribution staff does not supply food on time. As a consequence, their clients are unhappy. And, in order to make them satisfied, they end up providing free meals. Which algorithm of machine learning will they use?
Answer: You may have started jumping around the list of Machine Learning (ML) algorithms in your mind. Yeah, but, wait! These questions are asked to test the basics of machine learning.
This is not a concern for machine learning. This is a concern of route optimization. The problem of machine learning consists of three things:
- a. You have data on it.
- b. Mathematically, you cannot solve it (even by writing exponential equations)
- c. There exists a pattern
To determine if machine learning is a method to solve a specific problem, always search for the above three mentioned factors.
Q20. What is the difference between correlation and covariance?
Answer: The standardized form of covariance is a correlation. Covariances are challenging to compare. For example, we will get different results of covariance when calculating covariance of age(years) and salary($) which cannot be compared because of having unequal scales. In such a condition, we measure a correlation to achieve a value between -1 and 1, regardless of their respective scale.
Q21. You have come to realize that your model has a low bias and high variance. To tackle it, which algorithm do you use? Why?
Answer: Low bias occurs when the expected values of the model are similar to real values. In other words, what we can say is that the model becomes flexible enough to simulate the distribution of training results. Although it seems like a fantastic success, but not to forget, there are no generalizations capabilities for a flexible model. It means that it produces disappointing results when the model is evaluated on unseen data.
We may use bagging algorithms (Such as random forests) in such cases to resolve the issue of high variance. The bagging algorithm divides the data set into subsets of repeated randomized sampling. Then, using a Single Learning Algorithm (SLA), these samples are used to generate a set of models. Later, using voting(classification) or averaging (regression), the predictions are combined.
In order to combat the high variance, we should also:
- a. Using the technique called regularization, coefficients of the higher model get penalized, thus reducing the complexity of the model.
- b. Using top n features from the chart of variable importance.
Q22. Is it possible to capture the correlation between categorical and continuous variables? If yes, then how?
Answer: Yes, to capture the relation between categorical and continuous variables, we can use the technique called ANCOVA(covariance analysis).
Q23. You have come to realize that your model has a low bias and high variance. To tackle it, which algorithm do you use? Why?
Answer: Low bias occurs when the expected values of the model are similar to real values. In other words, what we can say is that the model becomes flexible enough to simulate the distribution of training results. Although it seems like a fantastic success, but not to forget, there are no generalizations capabilities for a flexible model. It means that it produces disappointing results when the model is evaluated on unseen data.
We may use bagging algorithms (Such as random forests) in such cases to resolve the issue of high variance. The bagging algorithm divides the data set into subsets of repeated randomized sampling. Then, using a Single Learning Algorithm (SLA), these samples are used to generate a set of models. Later, using voting(classification) or averaging (regression), the predictions are combined.
In order to combat the high variance, we should also:
- a. Using the technique called regularization, coefficients of the higher model get penalized, thus reducing the complexity of the model.
- b. Using top n features from the chart of variable importance.
Q24. You are now ready to create a high accuracy model after investing many hours. As a result, you create 5 models of GBM, assuming the magic will be achieved by a boosting algorithm. Unfortunately, none of the models would do higher than the benchmark average. You finally decided to combine those models. While it is known that ensemble models return high accuracy, you are unfortunate. Where did you miss it?
Answer: As we know, the concept of combining weak learners to produce strong learners is the foundation of ensemble learners. But, when combined models are uncorrelated, these learners have superior performance. Since we used 5 GBM models and had no increased accuracy, it indicates that the models are correlated. The issue with correlated models is that both models have the same data.
For example: If model 1 has user1122 rated as 1, model 2 and model 3 have high chances and would have done the same, even if its real value is 0. Ensemble learners are thus built on the basis that weak uncorrelated models can be combined to achieve stronger predictions.
Q25. How is KNN different from Kmeans clustering?
Answer: Do not get fooled by ‘k’ in their names. You should realize that the fundamental difference between these two algorithms is that Kmeans clustering is unsupervised in nature and k-nearest neighbors (KNN) are supervised in nature. Kmeans is a clustering algorithm where KNN is a classification (or regression) algorithm.
The Kmeans algorithm divides the data set into clusters, such that the created clusters are homogenous and the points of each cluster are similar to each other. The clusters have no labels because they are unsupervised in nature. So, the algorithm aims to preserve adequate separability between these clusters.
KNN algorithm is a straightforward algorithm that uses the whole dataset in its training dataset. Whenever a prediction is made for an unknown data instance, it looks for the k-most similar across the entire testing dataset, and eventually returns the data with the most similar instances as the predictions. KNN is often used when searching for similar items, such as finding similar to this one. The algorithm suggests that you are one of them because you are close to your neighbors.
Q26. We use Euclidean distance in k-means or KNN to measure the distance between neighbors that are closest. Why not Manhattan distance?
Answer: We cannot use the Manhattan distance since it just measures the distance horizontally and vertically. It has restrictions on dimensions. Also, the euclidean metric can be used for distance measurement in any space. As data points may be present in any dimension, a more feasible alternative is the Euclidean distance.
For example: Let’s consider a chessboard, because of their respective horizontal and vertical movements, the movement made by a rock or a bishop is determined by Manhattan distance.
Q27. You developed a model of multiple regression. Model R2 is not as good as you are hoping for. Your model R2 is 0.8 from 0.3 for improvement by removing the intercept term. Is it possible? How?
Answer: Yes, it is possible. In a regression model, we need to consider the importance of the intercept term. Without any independent variable i.e. mean prediction, the intercept term indicates model prediction.
We know that,
R² = 1 – ∑(y – y´)²/∑(y – ymean)² …… (i)
Here, in the above equation y’ is the predicted value. R2 values test your model with respect to the mean model when the intercept term is present. The model does not make any other evaluation in the absence of an intercept term (ymean), with a large denominator, the value of R2 will be greater when ∑(y – y´)²/∑(y)² equation’s value becomes smaller than real.
Q28. Your manager has confirmed that the regression model suffers from multicollinearity after testing the model. How do you verify if it is true? Can you still build a better model, without losing any information?
Answer: We can build a correlation matrix to classify and remove variables with a correlation above 75% in order to verify multicollinearity (deciding a threshold is subjective). Moreover, to verify the existence of multicollinearity, we can calculate the variance inflation factor (VIF). The VIF value < = 4 indicates no multicollinearity, while the VIF value > = 10 value indicates extreme multicollinearity.Also, tolerance can be used as an indicator to measure multicollinearity.
But, removing correlated variables could lead to information loss. We may use penalized regression models like ridge or lasso regression in order to preserve those variables. In the correlated variables, we can also apply some random noise such that the variables appear distinct from each other. But, while adding noise to those variables might influence the accuracy of the prediction. So this technique should be used carefully.
Q29. When is Ridge regression favorable over Lasso regression?
Answer: According to the writer of ISLR Hastie, Tibshirani, Lasso regression is used in the presence of few variables with medium/large size effect. Ans Ridge regression is used in the presence of many variables with small/medium-sized effects.
Conceptually, we can say, both parameter shrinkage and variable collection are done by lasso regression whereas parameter shrinkage is done by ridge regression and ends up with all model coefficients. Ridge Regression might be the better choice in the presence of correlated variables. Even, in cases where the least square estimates have more variance, ridge regression fits well. Therefore, it depends on the objective of our model.
Q30. The increase in the global average temperature has contributed to a decrease in the number of pirates globally. Will it mean climate change is caused by a decrease in the number of pirates?
Answer: This question is based on the classic case of ‘causation and correlation’. No, we should not conclude that climate change is caused by a decrease in the number of pirates because this phenomenon might be affected by other factors like lurking or confounding variables.
Thus, there might be a correlation between the number of pirates and the global average temperature. But we cannot claim that pirated people died because of an increase in global average temperature based on this information.
Q31. You have a data set for dealing with p (variable no.) > n (observation no). Why is OLS a bad choice for working? What would be the right techniques to use? Why?
Answer: We cannot use classical regression techniques in such high dimensional data sets, since their predictions appear to fail. When p>n, then a particular least-square coefficient approximation can no longer be determined, such that OLS can not be used at all. We may use penalized regression techniques such as lasso, LARS, ridge to combat this condition which can shrink coefficients to minimize variance. Specifically, ridge regression performs better in situations where there is more variance in the least square estimates.
Note: Subset regression and forward stepwise regression are among other methods.
Q32. The easy part is to perform a binary classification tree algorithm. Do you know how a tree split happens i.e. how does the tree determine which variable to break at the root node and the succeeding nodes?
Answer: As we know, decisions based on the Gini Index and Node Entropy are taken by classification tree, In simple terms, the Best possible feature find by tree algorithm divide the data set into purest possible children nodes,
The Gini Index states that if we randomly pick two objects from a sample, they must be of the same class and if the sample is pure, the probability is 1 for this. We can measure Gini as follows:
- a. The formula to calculate Gini for sub-nodes for the sum of squares of probability for success and failure is (p^2 + q^2).
- b. Calculate the split Gini using the weighted Gini Score of each split node.
Note: The calculation of impurity as given by (for the binary class) is entropy and can be calculated as:
Entropy = -plog2p – qlog2p
In this case, p and q are the probability of success and failure in that node. When the node is homogeneous then entropy is zero. It is maximum when all groups are 50%-50%present in a node. It is advantageous to have a lower entropy.
Q33. With 10000 trees, you have built a random forest model. After receiving a training error as 0.00, you got delighted. But, the error for the validation is 34.23. What is going on? Haven’t you perfectly trained a random forest model?
Answer: There was overfitting of the model. Training error 0.00 indicates that training data patterns have been imitated by the classifier to the degree that they are not present in the unseen data. Therefore, the Classifier was unable to find such patterns and returned greater error prediction when the classifier was run on unseen data. It occurs when we use a greater number of trees than required in random forests. Hence, we could use cross-validation to tune the number of trees to prevent this case.
Q34. Explain the convex hull. (Hint: Think SVM)
Answer: In the case of linearly separable data, the outer limits(boundaries) of the two data points are represented by the convex hull. When a convex hull is created, we get Maximum Margin Hyperplane (MHH) for two convex hulls as a perpendicular bisector. MMH is the line that aims to establish the strongest separation line between the two groups.
Q35. On the time-series data set, what cross-validation strategy will you use? Is it LOOCV or k-fold?
Answer: Neither k-fold nor LOOCV cross-validation strategies will use the time series of data.
K fold can be problematic in time series issues because there could be a trend in year 4 or 5 that is not in year 3. These patterns can be separated by resampling the data set, which we might end up validating over past years, which is wrong. Instead, we can use the 5 fold forward chaining technique as seen below:
- a. Fold 1: training [1], test [2]
- b.Fold 2: training [1 2], test [3]
- c.Fold 3: training [ 1 2 3], test [4]
- d.Fold 4: training [ 1 2 3 4], test[5]
- e.Fold 5: training [1 2 3 4 5], test [6]
Where 1,2,3,4,5and 6 represents the “year”.
Q36. ‘People who bought this, also bought…’. Which algorithm is the outcome of the recommendations found on Amazon?
Answer: The underlying principle for this type of recommendation engine derives from collaborative filtering. “User behavior” is considered by the Collaborative Filtering algorithm to recommend items. In terms of transaction history, ranking, selection, and purchasing information, they manipulate the actions of other users and items. The Preferences and behavior of the other users over other users’ items are used to recommend items to new users. In this recommendation engine, the features of the items are not specified.
Q37. After working on the project we realized that the dimensionality of data is enhanced by one-hot encoding. But, label encoding does not. How?
Answer: Do not get confused by this question. It is an easy question asking you to find out the difference between one-hot encoding and label encoding.
The dimensionality i.e features of a data set is improved by one-hot encoding since it generates new variables for each level present in categorical variables. For example: Let’s assume we have a ‘flower’ variable. Rose, Lotus, and Sunflower are three levels of the variable. Three new variables such as a flower.Rose, flower.Lotus, and flower. Sunflowers containing 0 and 1 values will be created by one hot encoding ‘flower’ variable.
The levels of a categorical variable in label encoding are represented as 0 and 1. For example, we have variables as ‘Yes’ and ‘No’ then the variable ‘Yes’ is represented as ‘1‘, and variable ‘No’ is represented as ‘0’. For binary variables, label encoding is commonly used.
Q38. Do you suggest that a better predictive model will lead to the treatment of a categorical variable as a continuous variable?
Answer: The categorical variable should only be regarded as a continuous variable for stronger predictions when the variable is ordinal in nature.
Q39. When is regularization needed in machine learning?
Answer: When the model starts to overfit/underfit, regularization becomes necessary. This technique introduces a cost term for the objective function to add more function. It then attempts to push the coefficients to zero for many variables and thereby reduce the expense duration for many variables. This helps to reduce the uncertainty of the model so that predicting (generalizing) will become easier for the model.
Q40. What do you think about the Type I vs Type II error?
Answer: Error Type I is committed when the null hypothesis is valid and we dismiss it, also referred to as False Positive (FP). Error Type II is committed when the null hypothesis is invalid and we accept it, also referred to as False Negative (FN).
We may say that error Type I arises in the sense of the confusion matrix when we predict a value as Positive i.e. 1 but it actually is negative i.e. 0. Error Type II error arises when we predict a value as negative i.e. 0 but it actually is positive i.e. 1.
Q41. If you’re working on a problem based on classification. You have also sampled the training data set in the training and validation for validation purposes. You are assured that, because your validation accuracy is high, your model will perform extremely well on unseen data. However, after obtaining bad test accuracy, you will get shocked.
Answer: Stratified sampling can be used instead of random sampling in the case of a classification problem. The proportion of target classes is not considered in a random sampling. On the other hand, stratified sampling also helps to preserve the target variable distribution of the resulting distributed samples.
Q42. The adjusted R2 or F value is generally used to evaluate the linear regression model. How can you evaluate a logistic regression model?
Answer: There are the following methods we can use:
- a. As we know, the logistic regression model is used for probability prediction, to evaluate its performance, we can use the AUC-ROC curve along with the confusion matrix.
- b. In logistic regression, the analogous metric of adjusted R^2 is AIC. AIC is the fit measure that penalizes the model for the number of coefficients of the model. In other words, AIC is used for evaluation and comparison of models that you fit and compare. Therefore, a lower number of AIC values is preferred for the model.
- c.The result expected by a model with nothing but an intercept is indicated by Null Deviance. Lower the value then better will be the model. Residual deviance represents a model’s predicted response to the addition of independent variables. Lower the value then better will be the model.
Q43. OLS is for a linear regression method model. Maximum Likelihood is for logistic regression. Explain this statement.
Answer: The approaches used for the corresponding regression methods to approximate the unknown parameter (coefficient) value are Ordinary least square (OLS) and Maximum Likelihood. In basic terms, OLS is a linear regression approach that approximates the parameters that result in a minimal distance between actual and predicted values. Maximum Likelihood helps to choose the parameter values that increase the probability that the parameters are most likely to generate the data observed.
Q44. You were asked to test a model of regression based on R2, adjusted R2, and tolerance. In which criteria you will evaluate a regression model.
Answer: Tolerance (1 / VIF) is an indicator percentage of variance in a predictor that is used as an indicator of multicollinearity that cannot be accounted by other predictors. It is beneficial to set large tolerance values.
In order to evaluate model fit, we will assume adjusted.
R2 because R2 increases as we add more variables, regardless of progress in prediction accuracy. But, if an additional variable increases the model’s accuracy, then the adjusted R2 would only increase, otherwise, it would remain the same. The adjusted R2 varies between data sets because a general threshold value for adjusted R2 is difficult to assign.
For example: Compared to gene mutation data set and stock market data set with lower adjusted R2 value, the gene mutation data set provides pretty good predictions whereas the model of stock market data sets would not be good.
Q45. What are Tensors?
Answer: Tensors are nothing but a de facto for representing the data in deep learning. They are just multidimensional arrays, that allows you to represent data having higher dimensions. In general, Deep Learning you deal with high-dimensional data sets where dimensions refer to different features present in the data set.
Q46. What are dimensionality reduction and its benefits?
Answer: Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely.
This reduction helps in compressing data and reducing storage space. It also reduces computation time as fewer dimensions lead to less computing. It removes redundant features; for example, there’s no point in storing a value in two different units (meters and inches).
Q47. Explain Gradient Descent.
Answer: A gradient measures how much the output of a function changes if you change the inputs a little bit. It simply measures the change in all weights with regard to the change in error. You can also think of a gradient as the slope of a function.
Gradient Descent can be thought of as climbing down to the bottom of a valley, instead of climbing up a hill. This is because it is a minimization algorithm that minimizes a given function (Activation Function).
Q48. What is the bias-variance trade-off?
Answer: Bias is an error introduced in your model due to the oversimplification of the machine learning algorithm. It can lead to underfitting. When you train your model at that time the model makes simplified assumptions to make the target function easier to understand.
Low bias machine learning algorithms -> Decision Trees, k-NN, and SVM
High bias machine learning algorithms -> Linear Regression, Logistic Regression
Variance is an error introduced in your model due to complex machine learning algorithms, your model learns noise also from the training data set and performs badly on the test data set. It can lead to high sensitivity and overfitting. Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens until a particular point. As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance.
Bias-Variance trade-off: The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance.
- a. The k-nearest neighbor algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbors that contribute to the prediction and in turn increases the bias of the model.
- b. The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.
Q49. How does a confusion matrix work?
Answer: The confusion matrix is a square matrix that has Actual and Predicted as Rows and Columns respectively. It determines the number of Correct and Incorrect Predictions. It is used in evaluating the results of a classification machine learning model.
A confusion matrix is usually computed for computing the cross-tabulation of observable(true) and predicted class (model) in any machine learning algorithms like logistic regression, decision forest, Naïve Bayes and may more. There are many matrices such as precision and recall which help the model’s accuracy and pick the best model.
Suppose a 2-class case confusion matrix with Fraud and Genuine is shown below where a row represents the instances of an actual class and each column represents the instances of a predicted class.
The fields in the matrix state the following:
Now, we can describe important performance measures in machine learning:
Accuracy: Accuracy is the number of transactions that have been correctly classified which is also known as detection rate is most efficient and widely used performance metrics that is
Accuracy (ACC) = (TN +TP)/(TN + TP+ FP +FN )
Precision: Precision which is also known as positive predicted value is the number of transactions that were properly classified as genuine or fraudulent.
Precision / Positive Predicted Value = TP/(TP+ FP )
Recall: Recall which is also known as sensitivity or probability of detection or true positive rate which measures the fraction of records properly classified by the system means the record which has the maximum chance of being fraudulent.
Recall / Sensitivity / True Positive Rate: TP/(TP+ FN )
Specificity: Specificity which is also known as true negative rate measures the percentage of normal records properly classified by the system implies the records that have a minimal probability of being a fraud.
Specificity / True Negative Rate = TN/(TN+ FP)
False alarm rate: False alarm rate calculates from total cases reported as fraudulent how many were incorrectly listed.
False Alarm Rate = FP/(FP + TN)
Cost: Cost tells us that the costs of our system are effective
Cost = 100 * FN + 10 * (FP + TP)
F1-Score: Harmonic Mean of precision and recall
F1 – Score = (2 *Precision * Recall )/(Precision+Recall)
Q50. How is True Positive Rate (TPR) and Recall related? Write the equations.
Answer: True Positive Rate (TPR) = Recall. Yes, they are equal having the formula(TP/ TP + FN).
My View
Statistics is one of the crucial and most important components of mathematics. It is the aspect of mathematics used for data collection, data organization, presentation, and outline work. In other words, to make it easy to understand, statistics are more about achieving those raw information strategies. The statistical model can be applied to solve social, industrial, and scientific problems, Similarly, one of the crucial aspects of computer science is machine learning. Of which several mathematical techniques are used to immediately make the machine learn.
Machine Learning and statistics are two closely connected domains. Actually, at times, the relation between machine learning and statistics can be quite blurry. But, there are several ways that clearly belong to the statistics sector. However, if you are working on machine learning projects, it is not beneficial but useful too. It would be correct to suggest that statistical methods are needed to work efficiently within a predictive modeling project for machine learning.
Note: If you have recently appeared or are planning for some career interview or struggling with progress in data science/machine learning, share your interview experience in the comments below. We’d love to help you at best!
☺ Thanks for your time ☺
What do you think of these questions on statistics and machine learning? Let us know by leaving a comment below. (Appreciation, Suggestions, and Questions are highly appreciated).