Machine Learning for Engineers
Introduction to Machine Learning, Alpaydin, 2014. ISBN 9780262325745
Objectives
1. Understand the fundamentals of ML methodologies such as decision trees, perceptrons, kernel machines, graphical
models, Markov models, and Bayesian estimation.
2. Understand the capabilities, limitations, and model selection with supervised, unsupervised, reinforcement, and
deep learning techniques.
3. Learn to automate data acquisition, feature engineering, and ML pipelines using real-world data to solve
engineering problems.
4. Analyze ML models through performance evaluation metrics and validate hypotheses through statistical methods.
5. Understand how to turn an ML application into a functional website, deploy it to a personal computer with Docker
and Amazon Web Services through continuous integration.
Takeaways
Artificial Intelligence (AI) is a simulation of human intelligence through computers. It leverages machine learning (ML) and various other technologies. AI is often referenced in the context of natural language processing (NLP), computer vision, and speech recognition.
Machine Learning automates analytical model building by examining data. It is suitable for classification, prediction, and optimization problems. Descriptive analytics includes visual analytics, diagnostic analytics uses statistics, predictive analytics uses ML, and prescriptive analytics uses data science.Programming computers to optimize a performance criterion using example data or past experience is the core of ML. We have a model defined up to some parameters, and learning is the execution of a computer program to optimize the parameters of the model using the training data or past experience. The model may be predictive to make predictions in the future, or descriptive to gain knowledge from data, or both.Machine learning uses the theory of statistics in building mathematical models because the core task is making inferences from a sample. The role of computer science is twofold: in training, we need efficient algorithms to solve the optimization problem and store and process the massive amount of data we generally have. Once a model is learned, its representation and algorithmic solution for inference need to be efficient as well. In certain applications, the efficiency of the learning or inference algorithm, namely, its space and time complexity, maybe as important as its predictive accuracy
Supervised learning is when the aim is to learn a mapping from the input to an output whose correct values are provided by a supervisor. It can be used for prediction of future cases, knowledge extraction, compression, outlier detection, and more. In unsupervised learning, the aim is to find the regularities in the input. There is a structure to the input space such that certain patterns occur more often than others, and we want to see what generally happens and what does not. In statistics, this is called density estimation.
Reinforcement learning is used to make sequential decisions to find similarities and relationships. The learner is a decision-making agent that takes actions in an environment and receives a reward (or penalty) for its actions in trying to solve a problem. After a set of trial-and-error runs, it should learn the best policy, which is the sequence of actions that maximize the total reward.
When considering a broader system that uses ML, it should have one of the following needs. If not, ML may not be the best solution: automate, alert or prompt, organize, annotate, extract, recommend, classify, quantify, synthesize, answer a straightforward question, transform its input, and detect novelty or an anomaly. ML should be used when the problem is too complex for coding, constantly changing, perceptive (e.g., image, speech, video recognition), unstudied phenomenon, has a simple objective (e.g., yes/no), or is cost-effective. However, ML should not be used when system actions, decisions, or changes in behavior must be explainable, system errors or failures are too costly, getting the correct data is too complicated or impossible, more straightforward methods or heuristics would work reasonably well, or you can manually fill an exhaustive lookup table (list expected output for any input).
Time series data refers to observations recorded at regular time intervals. In regression, we try to understand the journey by going back. Auto Regression, on the other hand, looks at the data of the same variable to understand the journey. There are three metrics that define a regression model: bias, variance, and error. Bias refers to the tendency to learn the wrong things. High bias leads to under-fitting, while high variance leads to over-fitting. Error refers to the deviation from the predicted line and is measured as the sum of squared errors.
Different types of regression include Simple Linear Regression, Multiple Linear Regression, Multivariate Linear Regression, and Polynomial Regression. Gradient descent is a bias improvement technique, while regularization is an error-reducing technique. Ridge is used to fix over-fitting by penalizing the coefficient, while Lasso is used to fix under-fitting by adding absolute value.
High-bias models do not contain the solution, while high variance indicates that the model class is too general. A model with low bias and low variance is desirable. Under-fitting occurs when the model performs poorly on training data, while over-fitting occurs when the model performs well on training data but poorly on unseen test data.
Joint probability distributions are at the core of probabilistic machine learning approaches. It is possible to compute any joint or conditional probability defined over any subset of variables if we know the joint probability distribution P(X1...Xn) over a set of random variables. Learning or estimating the joint probability distribution from training data can be easy if the dataset is large, but it may require methods that rely on prior knowledge or assumptions when the data is sparse.
Maximum likelihood estimation (MLE) is one of two widely used principles for estimating the parameters that define a probability distribution. The principle is to choose the set of parameter values that makes the observed training data most probable over all possible choices of parameters. The other principle is maximum a posteriori probability (MAP) estimation, which chooses the most probable value of the parameters given the observed training data and a prior probability distribution that captures prior knowledge or assumptions about the parameter values.
A decision tree is a predictive model used in data mining and machine learning for classification and identifying important features. The model predicts the value of an output variable at the leaf nodes of the tree based on input variables or attributes at the root and interior nodes. A table with a single class is considered homogeneous or pure, while a table with more than one class is heterogeneous or impure. To measure the degree of impurity, three methods can be used: entropy, gini index, and classification error.
The decision tree is a hierarchical model for supervised learning that can represent any Boolean function. The tree uses a series of yes/no questions to move down the tree, where each feature represents a concept that can predict labels from features. The model is deterministic, works with discrete or continuous parameters, and is an eager learner that operates mainly through batch processing.
The tree is built by adding nodes through an eager learning process that constructs a classification model before receiving new data to classify. The main advantage is that the target function is approximated globally during training, requiring less memory than using a lazy learning system. Examples of eager learners include decision trees, Naive Bayes, and artificial neural networks.
In contrast, lazy learning can provide good local approximations in the target function without requiring prior assumptions about data parameters. It can also solve multiple problems and deal successfully with changes in the problem domain. However, it has disadvantages such as large memory requirements, poor performance with noisy training data, and being easily fooled by irrelevant attributes.
Decision trees have limitations, including orthogonal decision boundaries, sensitivity to training set rotation, and instability, making them sensitive to small variations in training data. To avoid overfitting, the tree's accuracy on the training dataset should not be higher than that of the test data. This can be achieved through pre-pruning or post-pruning techniques.
Support vector machines (SVMs) are used for linear and nonlinear classification, regression, and outlier detection. They are ideal for linearly separable datasets but use soft margin or kernel methods for non-linear datasets. The kernel trick uses existing features, applies transformations, and creates new features to find nonlinear decision boundaries.
K-nearest neighbors (KNN) use the Euclidean distance between two points to determine the nearest neighbors. The value of k affects the model's complexity, where a small k results in high complexity and a large k results in low complexity. Cross-validation is used to fine-tune k. KNN has advantages such as fast training and learning complex target functions but is slow at testing time, requires a lot of storage, and is easily fooled by irrelevant attributes and noise.
Big Data:
Volume: refers to the size of data
Variety: refers to the structured and unstructured nature of data
Velocity: refers to the rate of data generation or change
Veracity: refers to the reliability of data
Value: refers to the worth or usefulness of data
Variability: refers to the format or structure of data
Methods to reduce features:
Missing values ratio
Low variance
High correlation
Backward feature elimination
Forward feature selection
Covariance and Correlation:
Covariance: indicates the direction of the relationship between two variables, but the values are not standardized
Correlation: measures how strongly two variables go up or down together, and the values are standardized
PCA (Principal Component Analysis):
Eigenvector: does not change direction in a transformation
Standardization is needed to give more weight to features with high variance, which can dominate the first PC selection
Clustering:
Most clustering algorithms are distance-based, and high-range features will have a bigger influence
KNN and SVM leverage distance, meaning large values will have a higher influence
Standardization is not necessary for logistic regression, decision tree, random forest, and gradient boosting
Eigenvector and Eigenvalues:
Av = RV (a matrix, v eigenvector, R eigenvalue)
K-means:
Finds groups in data based on feature similarity
Unsupervised Learning:
Customer segmentation in CRM
Image compression: Color quantization
Bioinformatics: Learning motifs
KNN and SVM:
Leverage distance, meaning large values will have a higher influence
Regression:
When input variable degree is greater than 1, it is multiple linear regression
Lasso can fix underfitting by adding the absolute value
Linear kernel is the most expensive to execute
RBF (radial based function) is another kernel
Hyperparameter C is the degree of tolerance
Distance Measures:
Euclidean distance is used to measure the distance between two data points and squares them to calculate the distance.
Machine Learning Metrics:
Accuracy: measures how many of both positive and negative values are correctly classified. Not suitable for heavily imbalanced data since it is easy to get high accuracy by simply classifying all observations as the majority class.
Precision: measures how many of the predicted observations are correct. Use when positive values are more important or the dataset is balanced.
Recall: measures how many of the "actual" positive observations were predicted correctly. Use when positive values are more important or the dataset is balanced.
F-score: a harmonic mean of precision and recall. Use when the dataset is balanced. Suitable for heavily imbalanced data provided we change the primary class to the relevant class because it uses harmonic mean to balance precision and recall.
ROC - AUC: ROC is a curve against True Positive Rate (Recall) and False Positive Rate. Higher values indicate better performance. AUC-ROC is used when we have a balanced dataset. Not suitable for heavily imbalanced data since the False Positive Rate for deeply imbalanced datasets is pulled down due to many True Negatives.
PR AUC: use it when we have an imbalanced dataset. Not used when negative values are more important than positive values. Suitable for heavily imbalanced data. Because ROC AUC looks at a TPR and FPR, PR AUC looks at positive predictive value and TPR. The precision-recall equation is helpful for imbalanced classes due to the absence of TN in the calculation.
Mathew's Correlation Coefficient (MCC): not suitable since the praxis requires comparing base model and proposed models results against expected results. MCC is great when the classifier results need to be evaluated against expected results. Classification model output is discrete, whereas regression model output is continuous.
MSE: a straightforward metric that calculates the error (i.e., Error = Actual - Predicted Value), squares it, and then provides the mean of all the errors. There is no right or wrong MSE value. Lower values typically mean a better model. For instance, zero MSE means a perfect regression model. If you have outliers in your result, do not use MSE. It is very sensitive to outliers.
RMSE: Both RMSE and output variable are of the same unit, making it easy to plot and interpret (compared to MSE). RMSE is also known as the standard deviation of residuals. RMSE suffers all MSE problems.
MAE: more robust to outliers than MSE and RMSE. Lower MAE, MSE, and RMSE imply higher accuracy.
R-squared: tells you how well points fit on a curve or a line. Use it when R-squared always increases with more number of data points, even if the new points are insignificant. Adjusted R-squared controls the "number of variables" to inflated R-Squared.
Silhouette Score: tracks how every point in one cluster is close to every point in the other clusters. It is in the range of -1 to +1. A positive score indicates far from other clusters, zero indicates too close to the boundary, and -1 indicates incorrectly assigned.
Markov Property: known as the memoryless property, the current state depends on the previous one.
Markov Chain: States, Action Acknowledge, Probability.
Q-Value: explicitly tells the agent which action should be chosen at each state, according to the Q-Value score.
Episodic Learning: agents run trials, constantly collecting samples, getting rewards, thereby evaluating the V and Q functions.
Monte-Carlo: run all possible iterations to calculate expected value or temporal-difference (TD). TD only accounts for the current value instead of the entire chain sequence. Monte-Carlo is time-consuming but accurate, while TD is not accurate but good at estimation. Q-learning uses TD(lambda), which combines Monte-Carlo and TD. Instead of running every iteration to generate the Q-table ahead of time, it uses a policy like Decaying-Epsilon, Greedy to keep updating the Q-table with the best value from every decision iteration.
Reinforcement learning problem is an undirected graph problem, unlike Bayesian networks that are directed graphs. Every reinforcement learning problem can be solved using Markov Decision Process, which uses the current state and actions to predict the next state, not the previous states. Every action may return rewards.
Reward Calculation Using MDP requires a tuple of (State, Action, Probability of state transition, Reward, Future reward discount). Bellman Optimality Equation can calculate state, but it does not tell you which action to take. Bellman Equation can calculate State-Action and tell you which action to take. It is also known as Q-value.
A problem statement should be clear, concise, specific, and single issue. It can also include a thesis statement, research objective, research question, hypothesis, methodology, evaluation process, validation process, and conclusion.
Bias = - threshold. A neuron without an activation function is regression. Optimizer algorithms change the attributes in neural networks, such as weights and learning rate, to reduce losses. For example, Adam and SGD.
Uncertainty analysis involves determining a base value (best guess) and range for every decision variable, selecting bases, and performing one-way sensitivity analysis for all variables. Spider diagram separates one-way analysis, while Tornado Diagram is a simultaneous plot of objective value (Annual Profit) on the X-axis as a function of free variable changes. Limitations of one-way sensitivity analysis include underestimating sensitivity due to the additive effects of varying more than one variable.
CNN is for image convolution, while RNN is for text. RNN uses the same weights for each step, while CNN uses a fixed number of inputs and outputs. Week 8: Non-semantic models include Bag of Words and TF-IDF. Non-context Semantic Models include Words2Vec, N-Grams, RNN, LSTM. Context-based Semantic Models include ELMo and BERT. BERT (Bidirectional Encoder Representations from Transformers) uses positional encoding, pays attention to the "bag of words," allows transfer learning, and allows hyperparameter tuning.
For no-code ML, there should be a minimum of 500 instances and a header for intuitive interpretation. The data should be prepared by normalizing and taking care of missing data. Tasks can be either classification or regression.
Meta-learning is a process of learning from machine learning models, while ensemble learning involves combining multiple models. Bagging and boosting are two approaches in ensemble learning, where bagging involves equally weighting all elements while boosting weights observations differently. The steps involved in machine learning include reading, processing, optimization, and application. It is not recommended to use AutoML SkLearn when you need an explanation behind a model or to automatically tune hyperparameters.
ML Jar is a tool that is great for understanding models, but not optimized for performance. It is recommended to use Baseline, Linear, Decision Tree, Random Forest, XGBoost, Neural Network, and Ensemble models, along with full explanations, learning curves, plots, and SHAP plots. Spark is a distributed processing system for big data, employing in-memory caching and optimized query execution for analytic queries against large data sets. Unlike Hadoop, once the controller is done, the whole system is done. Spark is written with Scala.
PPO and SSE are two approaches to reinforcement learning, where PPO is model-based and on-policy, while SSE is model-free and off-policy. Typical neural networks include ANN, CNN, and RNN. There are three kinds of neural networks that exist, and they are used for statistical learning problems with great results. Neural networks are part of the parametric family, with active function categories such as binary step, linear activation, and nonlinear activation. The optimizer changes attributes such as weight and learning rate to reduce losses. An axon represents the output of a neuron in the human brain.
Logistic regression is used when you want to predict a categorical variable from continuous or categorical variables. In logistic regression, the dependent variable is divided into two equal subcategories. The drawback of neural network modeling is that parameters are typically uninterpretable, and the response variable is a nonlinear function of the linear predictor values. The logistic regression method is used when the response variable is a qualitative or categorical variable. An ANN consists of nodes and connections between nodes. The multilayer perceptron is an artificial neural network structure that can be used for classification and regression.
Parallel processing in neural networks includes SIMD and MIMD. The perceptron is the basic processing element, and it has inputs that may come from the environment or the outputs of other perceptions. It is an unstable algorithm if small changes in the training set cause a large difference in learning. A real neuron can have any input with any size output, such as single-single, single-multiple, or multiple-single. A node can compute its weighted input and can be in an excited or non-excited state. Neural networks learn via back-propagation.
Hyperparameters of a neural network are variables that determine network structure and how the network is trained. Hyperparameters related to network structure include dropout and activation function. Methods used to find out hyperparameters include manual search, grid search, random search, and Bayesian optimization.
Backpropagation is a system used for calculating partial derivatives by working backwards in a neural network.
One Way Sensitivity Analysis is a method used to underestimate sensitivity due to additive and multiplicative effects.Convolution is a data reduction technique that is used in convolutional neural networks (CNNs).Activation functions are used in neural networks to account for interaction and non-linear effects. Without an activation function, a neuron would simply be a regression.Transfer learning is the process of using an existing trained model to perform a new task.Deep learning is computationally expensive but can handle high-dimensional data. Feature extraction is done automatically and a hidden layer in a neural network helps with the learning process. Convolution helps reduce the number of features, and adding weight in a neural network helps to reduce uncertainty.
A perceptron is a type of neuron that can be used for classification and regression.
In terms of performance metrics, if positive and negative observations are balanced and the prediction of positive observations is more important, precision is the best metric to use. TRR is recall.Neural networks are highly parallel and have a distributed processing architecture. They emphasize tuning weights automatically and have many perceptrons.A bright pixel in the output image of a CNN indicates a strong edge around the original image. Convolution helps us look for specific localized image features (like edges) that we can use later in the network.
RNNs are recurrent because they use the same weights for each step. A typical vanilla RNN uses only 3 sets of weights and two biases, and they use a variable number of inputs and outputs.MLPs are designed to achieve statistical generalization and are composed of many different functions. For classification, the function maps an input to a class.Ways to avoid overfitting in a neural network include early stopping, weight sharing, and penalizing large weights.Softmax and Sigmoid functions are two types of nonlinear activation functions.
Neural networks can be used to analyze the level of uncertainty in data.The purpose of a CNN is to analyze specific features of localized images which will be used in the network. CNNs use a fixed number of inputs and outputs, while RNNs use a variable number of inputs and outputs.
A biological neuron (BNN) is not a control unit for controlling computing activities.
A perceptron is an effective form of a classifier or algorithm that facilitates or supervises the learning capability of binary classifiers. Bias does not have to be unique for all neurons in a hidden layer.Both CNNs and NNs use a fixed number of inputs and outputs.
A Markov decision process is not related to data leak prevention (DLP).
Q-learning determines what action to take given the state.
Monte-Carlo is a method that accounts for the current value instead of an entire chain of sequence and is time-consuming but accurate.