During a recent webinar I heard the presenter say that "when people talk about big data what they really mean is Hadoop." I'm not sure that is entirely accurate but I do understand where the presenter was coming from - most big data projects are utilizing Hadoop in some capacity. We see that here at Unicon in our Learning Analytics practice as well. All of our large scale learning analytics projects are using Hadoop or Hadoop ecosystem components.
Of course there is a reason people are using Hadoop for big data projects. Generally speaking, for a wide variety of use cases, Hadoop is the right tool for the job. However, all of the processing capabilities and tooling that comes with Hadoop is not free; it comes at the cost of administration and support. Hadoop, as they say, is a fickle beast. Running your own Hadoop cluster, whether on-premise or in the cloud, is a challenge and you'll need people with a wide range of skills to meet that challenge.
As usage of Hadoop has grown, finding people with the necessary skills to run a Hadoop cluster has become increasingly more difficult. This is especially true in academic environments where resources are already constrained. Due to those constraints we often see software developers or LMS administrators tapped to manage a Hadoop cluster on top of their existing workload. This strategy may work in the short term but as usage of the cluster grows maintenance and administration tasks begins to resemble something closer to a full time job. Eventually organizations are left with solid technical infrastructure but not enough resources to support it.
As organizations try to find ways to leverage the power of Hadoop in a resource constrained environment a trend that we see emerging is a move toward Hadoop "on-demand" or Hadoop-as-a-service offerings such as Amazon Elastic MapReduce (EMR). Amazon EMR is a managed Hadoop framework that is a core component of Amazon Web Service's Big Data product suite. When paired with other AWS services such as Data Pipeline and S3, EMR allows you to create workflows that can dynamically provision a Hadoop cluster, execute your Hadoop workload and terminate the cluster as needed. Having this kind of flexibility allows you to reduce (but not eliminate) the maintenance and administration costs of your Hadoop infrastructure when compared to running your own Hadoop cluster 24/7. In addition to the potential for fewer maintenance and administration headaches, here are some other reasons to consider a move to Hadoop-as-a-service with Amazon EMR:
- Underutilization: In cases where a cluster is underutilized (maybe you are running a job or two a day or even a week) Amazon EMR makes sense because it allows you to pay only for the capacity and resources that you are using.
- Tight integration with other AWS offerings: If you are already "all-in" on AWS using EMR allows you to easily integrate with other AWS services you are likely already using. EMR is tightly integrated and specially tuned to work with S3 through AWS-proprietary libraries. EMR also works seamlessly with other Amazon services including Amazon Redshift, and Amazon Kinesis.
- Auto-scaling: Instances in an EMR cluster are categorized into two subtypes – Core Nodes and Task nodes. As the name implies, Task nodes are used as processors of your Hadoop workloads and can be easily added or removed as the demand for processing power changes. You can also benefit from cost savings by using the spot instance for Task nodes.
Of course, there are trade-offs in choosing AWS EMR over managing your own Hadoop cluster. First and foremost, you can only use the Hadoop distributions that EMR supports. Many popular Hadoop distributions such as Cloudera and Hortonworks are not available with EMR. Troubleshooting can also be a challenge with AWS EMR. While you do have root access to cluster instances and log files are available on S3, many of the management and troubleshooting tools that are available with other Hadoop distributions are not yet available with EMR. Finally, EMR is likely not the right choice if you need to customize your Hadoop cluster extensively or if you need a longer running cluster that is accessible to users such as data analysts.
As that webinar presenter suggested, if you are thinking about big data projects you are probably also thinking about Hadoop. Hopefully this blog post gave you some insight into whether the Hadoop-as-a-service option is right for you. If you would like a partner to help you determine if AWS EMR is right for you or if you have other Amazon Web Service questions, Unicon can help. Contact us to setup a discussion with one of our AWS technologists.