

At Unicon, our security operations (sec-ops) staff engages with project teams early to determine what support is needed throughout the project. Considerations and risk indicators for projects include the security classification of the data that will be handled, and the project activities which may include data analytics, software development, infrastructure design and deployment, and production operations.
We also regularly survey projects to assess risk levels and determine if extra security support might be needed. We wanted to build a tool to help with this analysis and quickly identify potential risks.
As a side goal, we also wanted to develop the tool on AWS services to gain additional experience with deploying ML/AI solutions on AWS (we already build and run learning analytics and course completion risk models on AWS services). This included not only implementing the tool, but understanding how to move from the data exploration/model development activities to production implementation.
In this case study/blog post, I'll describe how we rapidly developed and deployed a risk model using a common machine learning algorithm to estimate the risk of a project suffering a security incident. We took advantage of a number of AWS services to speed development as well as deploy the service at very low cost.
The approach is to follow a fairly standard process for developing analytics models, as illustrated below. The process is iterative, exploring the data with various visual and analytical techniques, as described in the following sections.
To start, we took project security survey data along with historical security incident data and used Jupyter notebooks to explore the data, do feature analysis to determine if a viable model could be developed, identify the features and independent variables to be used, and to experiment with various risk prediction models. AWS SageMaker also supports Jupyter notebooks as an algorithm development environment, so that seemed a good choice in which to do our exploration.
Since output of the model would be binary (at risk, not at risk), we believed a Logistic Regression model would be an applicable model. First, we re-mapped the raw survey form data (from a Google form) into normalized variable ranges and modeled category data inputs as multiple dichotomous independent variables. This was done in a utility Python program to pre-process the CSV file exported from Google Forms. From there, the pre-processed CSV was read into a Jupyter notebook with Pandas, sliced across various independent variables with multiple 2D and 3D scatter plots produced using MatPlotLib to look for initial trends and relationships.
A variety of K-Means clustering models were generated (using the scikit-learn Python package) and evaluated for accuracy in the Jupyter notebook. Recursive Feature Elimination (RFE) was also used to explore the feature space. K-Means is an "unsupervised" classification algorithm, meaning that you do not have to know representative data or "ground truth" in advance. This makes K-Means useful for finding or confirming that relationships between variables or features exist in your data, which is very useful for initial data exploration. Once a candidate set of features looked promising, these were further analyzed with Principal Components Analysis to see what the transformed feature space looked like, if there was clear clustering in the transformed space, and how much variability the Principle Components accounted for. This further refined our feature selection and let us settle on a final set of features (production data access, team size, and Unicon operating the infrastructure) to develop the model.
Once the features were identified, a Logistic Regression model was developed. Here, we ran into some problems. We do not really have an extensive enough set of training or test data yet. As a result, the LR model is currently producing relatively poor predictions. However, we were able to use the K-Means developed classifier as a fairly effective classification model, so we went forward with that as our implementation approach. A second model, based on using the Principle Components transformation with a threshold value on the first PCA dimension was also developed. With our limited training data, we have a (very) biased error rate estimate on the order of 15-25%, which is enough for us to use the risk prediction as an additional guide in evaluating project risk.
With the model developed, we proceeded to an implementation plan. Although the SageMaker approach was one of the initial options, given the infrequent execution, we didn't really need nor want to pay for EC2 instances spun up to run the model. Instead, we decided to implement the model evaluation as an AWS Lambda function. Since Lambda supports Python, we had a ready path to take the model developed in Python and implement it in Lambda. The tool would be a simple web app that makes a request to a API Gateway endpoint that would forward the request to the Lambda function to evaluate the model and return a risk score for presentation on the web app. The HTML and Javascript (and other static resources) would be served out of S3 configured as a static web site, with public access.
To build the Lambda, we persisted the Python fitted sklearn model object to a file, loaded that to S3 so that the Lambda could rehydrate the model object as well as allow the model to be updated without rebuilding the Lambda. Because we needed modules not natively included in the AWS Lambda Python environment, we needed to build a deployment package on an EC2 instance with the Lambda handler code and the Python libraries (which includes compiled libraries for the NumPy, Sklearn, and other modules), bundled up as a zip file, copied to S3, and then deployed as a Lambda function.
The Lambda handler and unit test code, web assets, and supporting files such as the model files are managed in AWS CodeCommit. All of the build/test/package/deploy/test code currently runs as a manually triggered shell script that use AWS CLI (Command Line Interface) to minimize manual effort to launch and do code updates, but we will be looking to trigger this directly via AWS CodeCommit to launch an AWS CodePipeline to fully automate the deployment based on triggering events.
Below are a couple of screen shots of our web app in action. Note that the whole form with all inputs is not shown.
Having completed the project, we have a few take-aways and observations that are worth sharing:
First, I don't want to misrepresent the quality of the risk model that we've developed. We really don't have enough data yet to be highly confident in the the model or our model evaluation. As we gather more data over time, we will be re-evaluating the model and developing a more robust evaluation.
Secondly, my academic background was in "rocket science" with fairly extensive graduate course work and grad school research project work using a number of statistical techniques now referred to as "machine learning". The first third of my professional life was spent developing and implementing algorithms for image analysis, pattern recognition, and computer vision, as well as working with innovative PhDs (e.g. from the MIT Media Lab) developing and implementing novel approaches to computer vision applications. Point being, while it is easy with current tools to train and run ML/AI models, experience in feature selection, knowing how to train and test models is critical to meeting objectives for an ML project. Ability to recognize when the available data do not support the goals is essential for any ML/AI project. Make sure your team has the variety of skills needed given the problem at hand and understands the consequences of a model not working at the needed accuracy levels.