Simplify your Spark application dependency management with Docker and Hadoop 3 with EMR 6.0.0 (Beta)

Today, PySpark and SparkR users must install their dependencies on each host in a cluster. As a result, teams operating multi-tenant clusters struggle to keep up with specific library versions and manage dependencies, limiting developer productivity, increasing the time spent preparing a cluster for use, and adding complexity to cluster upgrades. 

Using Hadoop 3, Docker, and EMR, Spark users no longer have to install library dependencies on individual cluster hosts, and application dependencies can now be scoped to individual Spark applications. This is achieved by running Spark applications in Docker containers instead of directly on EMR cluster hosts. To use Docker with your Spark application, simply reference the name of the Docker image when submitting jobs to an EMR cluster. YARN, running on an EMR cluster, will automatically retrieve the image from Docker Hub or ECR, and run your application. You can use Docker images to package your own library dependencies, and can even run containers with different versions of R and Python on the same cluster. 

Also included in the EMR release 6.0.0 (Beta) is support for Amazon Linux 2, and Amazon Corretto JDK 8. Amazon Linux 2 is the latest generation of the Amazon Linux server operating system, providing new system tools like the systemd init system, and the performance tuned Amazon Linux LTS Kernel. Amazon Corretto JDK 8 provides a Java SE certified compatible JDK that includes long-term support, performance enhancements, and security fixes. 

You can stay up to date on EMR releases by subscribing to the EMR release notes feed. Use the icon at the top of the EMR Release Guide to link the feed URL directly to your favorite feed reader.



https://aws.amazon.com/about-aws/whats-new/2019/09/simplify-your-spark-application-dependency-management-with-docker-and-hadoop-3-with-emr-6-0-0-beta/