## Getting Started with Machine Learning and Predictive Analytics

Machine learning continuously eliminates errors to improve predictions. This sounds like science fiction and seems to give computers a mythical power they do not have. Machine learning for prediction is a practical device many companies are using to make sense of the metrics they collect.

What are predictive analytics?

Prediction is one of the major purposes of correlational statistics, although sometimes “prediction” is not quite the right word for it. The idea is that you want to see if you can find the components of a certain important variable. You have a number of measurements and you want to see what proportion of the “variance” of the important variable is contributed by each of the measures you have. You want to know how much each variable uniquely contributes to the variance, and how the whole array of your measurements can contribute when taken together. The word “prediction” often has nothing to do with the future, but simply being able to tell what the value of the important variable is without looking at it.

You do this predicting using statistical techniques based on an inter-correlation matrix. You want to find the weights to assign to each of your measures to make the sum of all your measured variables make the best possible “prediction” of the important variable.

For example, you might want to predict sales volume based on a lot of data-mined information about the population that includes your customers.

• For a sampled time period when you know the sales volume, you use the data you have in numeric form.
• You do an inter-correlation matrix including the sales volume number.
• From the inter-correlation matrix you derive a “multivariate regression formula.” This is a formula. When you plug your measures into it the result will be as close as possible to sales volume.
• How far off the average prediction is from the actual value of sales volume is the amount of error in your prediction.
• That’s the number, the amount of error that you want to reduce next time you do this. That’s where machine learning comes in.

What does machine learning do to help prediction?

Computers employed in machine learning operate according to certain models or algorithms. These include models that use the multiple regression technique and those which develop their own formulas of prediction in other ways.

• Regression (as in our example above).
• Decision trees (if-then scenarios).
• Bayesian methods (probabilistic-using known statistical distributions).
• Neural networks (optimizing without the need for statistics).

You make three sets of data to train and test the machine learning.

1. A training set consisting of the majority (60 to 80 percent of the preliminary data you collected). This data is used to train the machine to learn.
2. A validation set consisting of 10 to 20 percent of the data collected. You use this data to see just how accurate your predictions are–after the machine is trained. You may try out different algorithms here.
3. A test set consisting of data not used in the training or validation. You use this data as a final test of the validity of your chosen machine algorithm.

Machine learning is an iterative process. You train the machine to make the best predictions using a particular algorithm. Then, based on the accuracy of prediction, you retrain it modifying your algorithm or parameters in your data set. You try to reach a point that the machine predicts all sales (no misses or false negatives) and does not say a case will be sale when it is not (false positive).

Once a satisfactory level of performance has been reached on the validation set, use the test set to assess the performance of the fully trained system on unseen data. If the test set performance is satisfactory, you have a computer model that can be used to predict the behavior of real customers with acceptable accuracy. 