December 18th, 2020 · 7 minute read

A Machine Learning Primer for Everyone

Updated: January 25th, 2021

Machine learning ML all around us

Machine Learning is everywhere, and it affects your daily life. Here is what you need to know.

You know what a microwave is and how to use it, but you probably don't know how it works. You don't have to. Machine Learning (ML) is a bit like that.

Unlike the example of the microwave, your interactions with ML are not so obvious. Some uses you may be completely unaware of.  Knowing where machine learning affects your daily life and how it works is important to everyone.

A popular application of machine learning is your Netflix recommendations feed. Netflix recommendations are created with machine learning to personalize every viewer's list. If a viewer frequently watches a certain kind of movie genre or television series, the system will automatically start suggesting similar offerings to the viewer.

How is that magic happening? The software applies statistical analysis and predictive analytics to spot patterns in the viewer's data history to predict what the viewer will want to see in the future and place that content in the list of recommendations.  

What exactly is machine learning, and how does it work?

Machine learning is an application of artificial intelligence. Machine learning happens when you load lots of data into a computer program and choose a model or algorithm to recognize patterns in your data. This allows the computer to come up with its own predictions.

The model is trained using a machine learning algorithm. Once the model is trained, it can be used to make predictions about your data. With machine learning, computers learn to recognize patterns in data and make predictions based on that data without having been programmed. In general, the more data you provide your model, the better its predictions will be.

Bias raises its ugly head

The efficacy and accuracy of a machine learning model depend on the quality, objectivity, and size of training data used to train it.

Poor quality or inadequate data leads to bias in machine learning models, which invariably leads to poor predictions. How can something inanimate like a machine learning model develop bias, though?

The problem originates with the humans who design and train the machine learning models. All humans have prejudices and unconscious cognitive biases. Their algorithms will invariably reflect these biases.

And we are not talking here about the biases of one person. The whole machine learning team and their personal biases are involved, including designers, data scientists, and data engineers.

Biases can develop because the data sets that are used to train or validate machine learning models are biased, incomplete, or flawed.

Why is bias in machine learning models a problem?

Simply put, where bias comes into the picture, we can't trust the decisions and predictions of machine learning models. We can't rely on them to help us preempt challenges so we can take action to handle them.

Our lives decided by algorithms

The increasing use of artificial intelligence and machine learning in areas such as policing, hiring, healthcare, and criminal justice has led to grave warnings about bias and threats to personal freedom.

Flawed decisions based on the recommendations of algorithms, programmed by biased humans, are a cause for great concern. COMPAS AND Faception are just two cases in point.

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), developed by Northpointe Inc., is a risk assessment algorithm that predicts whether an offender is likely to commit a crime in the future. The algorithm bases its assessments on answers to a 137-question questionnaire.

An investigation of COMPAS by ProPublica found that COMPAS was biased against Black individuals-it gave more false-positive rates for Blacks. The investigation found that COMPAS was more likely to flag black defendants as high risk to offend again and to flag white defendants as low risk to offend again. These assessments by algorithms affect an individual's future and freedom.

Israeli startup, Faception, has developed computer vision and machine learning technology that reveal people's personality based only on their facial image. According to the startup, the system can identify geniuses, white-collar offenders, pedophiles, and terrorists just from facial images.

That is letting a computer decide if you're a terrorist just from your face. Experts warn that there is a very real danger of unacceptable numbers of false positives – imagine standing at the airport and being flagged as a possible terrorist just because your eyes are slanted a certain way!

Validating machine learning models

There are several validation models, and there isn't one validation method that works for all cases. We will briefly discuss some of them here, but not in any technical detail.

1.     Train-test split

With this method, you randomly split your data, about 70% you use for training the model and 30% you use to test the model. The data used to test the model is also referred to as the validation or hold-out set.

The model is "fit" on the training set, and the fitted model is then validated on the hold-out or validation set.

2.     K-fold cross-validation

Here you don't split your data once; you split it multiple times and validate every resulting combination of those splits.

The data is divided into what is called "folds". The data is split into k fold, the data is trained on k-1 folds and is then tested on the one fold that was held out. This is done for all combinations of the splits.

"k" stands for hyper-parameter - it controls the number of folds the data is split into.

The more splits you make, the more combinations you have to validate, the more accurate your model will presumably be.

3.     LOOCV Model Evaluation 

Leave-one-out cross-validation, or LOOCV, is another form of k-fold cross-validation, but here k is set to the total number of examples in the dataset. So, the data is split into the most possible folds. What this means is that for each sample or item in the dataset, a separate model is created and evaluated.

The benefit is that it leads to a highly reliable model performance but at a great computational cost.

Other methods to validate machine learning models not discussed here include:

·       Nested Cross-Validation

·       Time-series Cross-Validation

·       Wilcoxon signed-rank test

·       McNemar's test

·       5x2CV paired t-test

·       5x2CV combined F test

Machine learning involves feeding huge volumes of data into algorithms to train them. However, much of this data can't be used as-is. It must first be cleansed.  Data cleansing involves actions like completing missing values, removing rows, removing errors and duplicate data, and more.

Data cleansing improves the quality of the training data and therefore the quality of the machine learning models. Incorrect or inconsistent data leads to bias in machine learning models and ultimately to false predictions.

Data cleansing is a time-consuming process. Data scientists spend an inordinate amount of time on improving the quality of the data for machine learning. In fact, most data scientists spend only 20% of their time on actual data analysis and modeling. The rest of their time is spent on cleansing and reorganizing their data. In order to speed up their efforts on data cleansing and AI readiness they use tools like to leverage ML to understand their data quickly and accurately.

Final thoughts

Machine learning puts society in a position to make more informed, data-driven decisions. It also allows companies to provide more customized services to their customers; creating unique experiences that build brand affinity. However, it's requires great care to ensure it is implemented well and maintained to ensure not only accurate, but ethical outcomes.

To learn more check out other content from Apption and Datahunter.


Try a Demo of Datahunter today.


⇠ Back to Blog