Background and Overview
At Unicon, our security operations (sec-ops) staff engages with project teams early to determine what support is needed throughout the project. Considerations and risk indicators for projects include the security classification of the data that will be handled, and the project activities which may include data analytics, software development, infrastructure design and deployment, and production operations.
We also regularly survey projects to assess risk levels and determine if extra security support might be needed. We wanted to build a tool to help with this analysis and quickly identify potential risks.
As a side goal, we also wanted to develop the tool on AWS services to gain additional experience with deploying ML/AI solutions on AWS (we already build and run learning analytics and course completion risk models on AWS services). This included not only implementing the tool, but understanding how to move from the data exploration/model development activities to production implementation.
In this case study/blog post, I'll describe how we rapidly developed and deployed a risk model using a common machine learning algorithm to estimate the risk of a project suffering a security incident. We took advantage of a number of AWS services to speed development as well as deploy the service at very low cost.
The approach is to follow a fairly standard process for developing analytics models, as illustrated below. The process is iterative, exploring the data with various visual and analytical techniques, as described in the following sections.
Fig. 1 - Machine Learning Development Process
Data Collection and Preparation
To start, we took project security survey data along with historical security incident data and used Jupyter notebooks to explore the data, do feature analysis to determine if a viable model could be developed, identify the features and independent variables to be used, and to experiment with various risk prediction models. AWS SageMaker also supports Jupyter notebooks as an algorithm development environment, so that seemed a good choice in which to do our exploration.
Initial Data Exploration
Since output of the model would be binary (at risk, not at risk), we believed a Logistic Regression model would be an applicable model. First, we re-mapped the raw survey form data (from a Google form) into normalized variable ranges and modeled category data inputs as multiple dichotomous independent variables. This was done in a utility Python program to pre-process the CSV file exported from Google Forms. From there, the preprocessed CSV was read into a Jupyter notebook with Pandas, sliced across various independent variables with multiple 2D and 3D scatter plots produced using MatPlotLib to look for initial trends and relationships.
Fig. 2 - Scatter Plot Sample
The security incident data points (red) cluster in the upper right of these feature axes. Note: Since the independent vars are discrete, many of the data points are coincident.
A variety of K-Means clustering models were generated (using the sklearn Python package) and evaluated for accuracy in the Jupyter notebook. Recursive Feature Elimination (RFE) was also used to explore the feature space. K-Means is an "unsupervised" classification algorithm, meaning that you do not have to know representative data or "ground truth" in advance. This makes K-Means useful for finding or confirming that relationships between variables or features exist in your data, which is very useful for initial data exploration. Once a candidate set of features looked promising, these were further analyzed with Principal Components Analysis to see what the transformed feature space looked like, if there was clear clustering in the transformed space, and how much variability the Principle Components accounted for. This further refined our feature selection and let us settle on a final set of features (production data access, team size, and Unicon operating the infrastructure) to develop the model.
Fig. 3 - Clustering Using K-Means
Clustering with various features and number of classes.
Fig. 4 - Recursive Feature Selection (RFE)
RFE also helps suggest features contributing most strongly.
Fig. 5 - Plot of the 1st two PCA dimensions
Once the features were identified, a Logistic Regression model was developed. Here, we ran into some problems. We do not really have an extensive enough set of training or test data yet. As a result, the LR model is currently producing relatively poor predictions. However, we were able to use the K-Means developed classifier as a fairly effective classification model, so we went forward with that as our implementation approach. A second model, based on using the Principle Components transformation with a threshold value on the first PCA dimension was also developed. With our limited training data, we have a (very) biased error rate estimate on the order of 15-25%, which is enough for us to use the risk prediction as an additional guide in evaluating project risk.
To build the Lambda, we persisted the Python fitted sklearn model object to a file, loaded that to S3 so that the Lambda could rehydrate the model object as well as allow the model to be updated without rebuilding the Lambda. Because we needed modules not natively included in the AWS Lambda Python environment, we needed to build a deployment package on an EC2 instance with the Lambda handler code and the Python libraries (which includes compiled libraries for the NumPy, Sklearn, and other modules), bundled up as a zip file, copied to S3, and then deployed as a Lambda function.
The Lambda handler and unit test code, web assets, and supporting files such as the model files are managed in AWS CodeCommit. All of the build/test/package/deploy/test code currently runs as a manually triggered shell script that use AWS CLI (Command Line Interface) to minimize manual effort to launch and do code updates, but we will be looking to trigger this directly via AWS CodeCommit to launch an AWS CodePipeline to fully automate the deployment based on triggering events.
Fig. 6 - AWS Services
Illustration of AWS services used to deploy the app and risk model.
Below are a couple of screen shots of our web app in action. Note that the whole form with all inputs is not shown.
Fig. 7 - Low RiskApp display showing results for project properties with low risk for a security incident.
Fig. 8 - High Risk
App display showing results for project properties with high risk for a security incident.
Having completed the project, we have a few take-aways and observations that are worth sharing:
- Despite being new to Jupyter notebooks, the combination of Pandas, Numpy, Sklearn, and Matplotlib and Jupyter all had a fairly short learning curve and it was easy to be productive quickly. These packages all have excellent documentation and examples available and stackoverflow almost always had answers when stuck. Annotating the work in the notebook and sharing it with others is big plus too.
- Implementing all the AWS bits went smoothly. There are some subtitles in getting security and CORS set up properly between API Gateway and Lambda - really understand the API Gateway integration model and when API Gateway can issue the proper CORS headers and when your lambda function needs to take care of those. Although the need to produce a zip file with the needed python libraries at first seemed like it might be one of those maddening exercises that takes hackery to make succeed, it was painless. Following the steps in the AWS documentation to create a deployment package worked perfectly using python virtualenv as described.
- The combination of working in Jupyter and then implementing the model as a python lambda was much less painless than anticipated and feels like a highly productive environment. Where the model execution workload is substantially larger and lends itself to dedicated instances, we are eager to go down the full AWS SageMaker path.
- While a Logistic Regression model is a natural approach for binary classification problems, it didn't work for us, we think because of the limited training data available. The lesson here is that you have to be open to what your data is telling you and your theory just may not work, or you need to try other approaches. We certainly want to collect more data over time, re-evaluate the features we have today, and do a more rigorous evaluation of the model prediction accuracy with independent test data. Ideally, we would also like to revisit the LR model with more data and see if we can resolve the issues with the LR results. At present though, the results are valuable enough to support the risk evaluations our security operations folks have with project teams.
Notes: First, I don't want to misrepresent the quality of the risk model that we've developed. We really don't have enough data yet to be highly confident in the model or our model evaluation. As we gather more data over time, we will be re-evaluating the model and developing a more robust evaluation.
Secondly, my academic background was in "rocket science" with fairly extensive graduate course work and grad school research project work using a number of statistical techniques now referred to as "machine learning". The first third of my professional life was spent developing and implementing algorithms for image analysis, pattern recognition, and computer vision, as well as working with innovative PhDs (e.g. from the MIT Media Lab) developing and implementing novel approaches to computer vision applications. Point being, while it is easy with current tools to train and run ML/AI models, experience in feature selection, knowing how to train and test models is critical to meeting objectives for an ML project. Ability to recognize when the available data do not support the goals is essential for any ML/AI project. Make sure your team has the variety of skills needed given the problem at hand and understands the consequences of a model not working at the needed accuracy levels.