My experiences with Amazon Machine Learning
One of the things I like most in Qualogy is the exposure I get to all kind of different technologies. As a data scientist, my job involves hands on experience on different levels; database systems, programming skills, statistics, data mining, machine learning, visualizations etc.
Given the fact that technology is continuously evolving and new products arise on a daily basis, keeping up with the latest developments is a must. In that context, I would like to share my experience with using the Amazon Machine Learning platform. Amazon machine learning is a service that makes it easy for developers of all skill levels to use machine learning technology. Without having to write code or manage any infrastructure, you can make predictions just by calling simple APIs. Hereafter, starting from scratch, I describe my experience with AML.
We needed a dataset
Initially, we needed a dataset for experimental reasons with a relatively big size but kind of straightforward. Open datasets are available on numerous repositories, so looking around in the UCI machine learning repository, we ended up with the Bank Marketing dataset that seemed to fit our needs.
This dataset is related to the direct marketing campaigns of a Portuguese banking institution. The story behind this dataset is that one or more phone contacts were made to the client in order to predict if the client would or would not buy the product. There are 45.211 entries with 20 input variables (numeric and categorical) such as age, job, marital status etc and the output variable is binary. More details for this dataset can be found here.
Uploading and defining the schema
The dataset we downloaded from UCI, is in a csv format. We then needed to upload it on the Amazon server. As with any other AWS services, we needed to create an S3 instance (Simple Storage Service). Within the instance, we can now create buckets to store our files. The uploading of our csv file was straightforward (caution to use truly comma separated files and not semicolon).
Once uploaded, you can create a datasource using the csv file. The data source is an additional abstraction layer, which allows multiple data sources to work on (a part of) the CSV file, and are also required to create machine learning models. Automatically, Amazon tries to determine the scheme of the data (data type per dimension), although these can be manually adjusted with binary, categorical, numeric and text as possible datatypes.
Finally, you must signify the target variable by marking the corresponding attribute for prediction. After these first steps, Amazon takes over. The first thing to see is a small summary of the data, such as distribution of the target attribute, the types of the data or whether there are any missing values in the dataset.
Modeling and evaluation
We were then ready to create our model. Here we had two options; either do it manually or automatically. In the latter case, Amazon will try to find the most suitable parameters for the problem. In the former case, you have the freedom to customize the parameters yourself.
When you choose to customize the preferences, you will be prompted to define a recipe, which is a series of transformations that may be applied to your data to improve the machine learning process. Among others, you have the chance to finetune the speed and accuracy of the model which includes the maximum number of passes over the data Amazon that is allowed to do and various regularization types that may be used to avoid overfitting.
In the manual mode, the options were already set to the default ones (the ones to be used in automatic mode), unless someone wants to change and experiment with these. For that reason we chose Amazon to automatically create the model for us. The moment the model was created, we could start the evaluation phase.
We are given two options for evaluating the model. Either use a 70-30 ratio split of the given data source or use a separate data source for evaluating purposes. We chose option 1, equipped with confidence and a cup of coffee and waited for the results.
What are the results?
The training phase, which is the most expensive, took approximately 20 minutes. The results were then presented together with a catching interactive figure, as shown below.
The accuracy is 91%, evaluated on the 30% of the dataset which corresponds to 13.500 entries. Precision is 0.6887 and recall 0.3668.
Amazon has this very nice possibility to change the cut-off threshold. Scores below the cut-off threshold are considered as 0 and above as 1. Consequently, there is a trade-off between the two types of errors. Moving the cut-off threshold to the left will result in more True Positives but will also increase the False Positives. The exact opposite behaviour we observe for the right side. The figure below depicts those details more illustratively.
AUC: The striped areas indicate records for which the answer was predicted incorrectly based on the selected cutoff.
Threshold adjustment: The threshold can be adjusted analogously, in order to fit best the users’ needs, by either decreasing/increasing FP or FN.
In this section, we are going to evaluate our experience with AML from several perspectives.
Do you want to know how Amazon predicts the result? I am sorry but you cannot find out. The Amazon prediction models are unknown. Despite the various settings that may be changed, many things remain black boxed: you have no idea what machine learning algorithms are used (Neural Networks, SVM, ...) or what combinations are used. Also how the possible parameters for each classifiers are chosen is a mystery. Even the precise form of evaluation is unclear, does it use cross-validation? If so, how many folds does it use? The latter is especially worrying because this is important if you would like to accurately compare AWS to other machine learning solutions.
Shuffling of data is not an option in AML. Without shuffled data, big bias can be introduced. That’s something everyone should take into account in order to avoid some frustration later on!
Let's take the time needed into consideration. Assuming that AML tries many (all?) statistical models with auto tuning of parameters with cross validation and also taking into account the (big) size of the Banking dataset, then we can conclude that it performed really fast. The training phase, which is the most expensive, took approximately 20 minutes.
Another remark is the user friendly interface of the Amazon platform. It took as about 1 hour until we get the first results. It is very well documented and has guidance in every step. Charges: The charges seem to be low. 4$ were spent in total while we experimented with two datasets several times each. The banking dataset is considered big in terms of scientific purposes.
Overall though we were impressed with the results so far, although further more detailed comparisons with other machine learning solutions should be attempted. For future work, it is more than imperative to try different platforms, such as Google Prediction APIs or the Microsoft Azure ML platform and compare against each other.