HOW TO: Set Up EMR Clusters with EC2 Spot Instances (2025)
AWS EMR is a fully managed big data platform. It handles the heavy lifting for you, so you can focus on running and scaling Apache Hadoop, Apache Spark, Apache Hive, Trino, and other open source frameworks on AWS. EMR can operate in different modes: EMR on EC2, EMR on EKS, and AWS EMR Serverless. Here, we will focus on the classic EMR on EC2 setup. One key benefit of EMR on EC2 is the use of AWS EC2 Spot Instances. These can reduce compute costs by up to ~90% compared to On-Demand pricing. Plus, AWS says that interruptions happen in less than 5% of Spot workloads, making Spot a pretty low-risk choice for most batch jobs.
In this article, we will cover everything you need to set up EMR Spot Instances, including when to use Spot versus On-Demand instances, how to configure EMR instance fleets for maximum cost savings, and troubleshooting common issues.
Step-by-Step Guide to Set Up AWS EMR Spot Instances
Prerequisites
First things first, get all the essentials ready before you start setting up EMR Spot Instances.
- AWS Account & Permissions — You need an AWS account with permissions to create EMR clusters, AWS EC2 instances, AWS S3 buckets, VPC resources, and AWS IAM roles.
- AWS S3 Bucket — Create an AWS S3 bucket in the same region to hold your input data, Spark code, and EMR log files.
- AWS IAM Roles — EMR requires specific service roles for the EMR service itself and for the AWS EC2 instances within the cluster. You can use AWS's default roles or create more granular, scoped AWS IAM roles.
- AWS VPC and AWS Subnets — Make sure you have a VPC (you can use the default VPC) with at least one subnet. Public subnets work for testing, but production clusters often use private subnets with a NAT gateway.
- Console Familiarity — You should know how to navigate the AWS Management Console and understand basic EMR/Spark concepts.
With those in place, let’s log in and start setting up EMR Spot Instances.
Step 1—Sign in to AWS Management Console
First things first, log in to the AWS Management Console and make sure you’re in the AWS region you want to use.
Step 2—Set Up an AWS S3 Bucket
In the AWS Console, search for "S3" and navigate to the AWS S3 service. Click "Create Bucket”. Give your bucket a unique name (must be globally unique across all AWS accounts). Select a region that is geographically close to where your EMR cluster will run to minimize latency.
It is recommended to enable Bucket Versioning and configure Server-Side Encryption if your use case requires them for data protection and recovery. It's also a good practice to create a few folders inside your bucket to keep data organized (logs/
, data/
, scripts/
).
For a more in-depth setup guide, see setting up an AWS S3 bucket.
Step 3—Set Up AWS VPC for Network Isolation
To set up your AWS VPC, go to the AWS VPC Dashboard in the AWS Management Console. Click "Create VPC". Provide a name for your AWS VPC and select a CIDR block (e.g., 10.0.0.0/16).
You will also need to set up public and private subnets. Generally, if your EMR master node requires internet access (for downloading libraries), place it in a public subnet. However, private AWS subnets offer better security for backend resources by restricting direct internet access.
For a more in-depth setup guide, see setting up AWS VPC for Network Isolation.
Step 4—Open AWS EMR Console
Next, navigate to the AWS console's services search box, type "EMR", and select "AWS EMR”.
On the EMR page, under “EMR on EC2”, click Clusters and then Create cluster.
This action opens the AWS EMR cluster creation wizard, where you will configure your cluster step by step.
Step 5—Set Cluster Name, Release Version, and Applications
In the EMR cluster creation wizard, first give your cluster a name (MySpotCluster). Then, select an EMR release version (emr-7.x, which typically includes Spark 3.x and Hadoop 3.x). Under "Applications", select the big data tools you need (Spark, Hadoop, Hive). Make sure you select all necessary libraries and applications for your jobs.
You can also add Steps (jobs) to run automatically upon cluster launch. (This is optional; you can skip this and submit jobs later to a running cluster).
Note: A cluster name can’t contain characters like <, >, $, |, or backtick.
Step 6—Configure EMR Instance Fleets and Multiple AZs
Immediately below the "Name and applications" section, you will find the "Cluster configuration" section. To select EC2 instance types, switch from the default "Uniform instance groups" to "Flexible instance fleets". EMR Instance fleets provide the most flexibility with Spot Instances: you can specify multiple instance types per node type and distribute capacity between On-Demand and Spot Instances. In the console, you can add up to 5 instance types per fleet (EMR supports up to 30 types via the CLI).
To do so, select “Flexible instance fleets”. You will then specify a fleet for the Master (primary), Core, and optionally Task nodes. Fleets enable you to mix different AWS EC2 instance types to meet your capacity targets efficiently.
It's recommended to select "Use high availability" to launch the cluster with three primary nodes. This configuration applies for the lifetime of your cluster and uses On-Demand instances for improved resiliency.
Step 7—Configure the EMR Master (Primary) Node
The Master (Primary) node is the central component of the EMR cluster; it controls and directs the cluster's operations. If it terminates, the entire cluster will shut down. Therefore, it is generally not recommended to use a Spot Instance for the primary node unless you are performing short-lived, fault-tolerant tests where sudden termination is acceptable, or if your cluster saves all critical data to an external store like AWS S3 and cost is the primary concern over uninterrupted operation. AWS explicitly advises against running the master node on Spot for production workloads due to the risk of cluster termination upon interruption.
When you start the primary instance as a Spot Instance, the cluster won't launch until that request is approved. This is something to consider when selecting your maximum Spot price.
You can only add a Spot Instance primary node when you start the cluster - not while it's running. Typically, you'd only use a Spot Instance for the primary node if the whole cluster is made up of Spot Instances.
For the EMR master node, pick an instance type that fits your needs. A moderate instance (like m5.xlarge) usually works fine, unless you're running tasks that require heavy management (Hive Metastore, Spark driver, or EMR Studio). If your cluster needs high availability for the master, consider using EMR’s multi-master mode (three masters) for failover.
Step 8—Configure Core Nodes (Data Nodes)
Core nodes are responsible for processing data and storing information in HDFS. Stopping a core node can result in lost data. To avoid this, only launch core nodes as Spot Instances if some HDFS data loss is acceptable.
When creating a core EMR instance group with Spot Instances, AWS EMR will pause the cluster launch until all requested instances become available. If, for example, you request six EC2 instances but only five are available at or below your bid price, the group will not launch. Instead, AWS EMR will continue waiting for the remaining instance or until you terminate the cluster. You can also modify the number of Spot Instances in a running core EMR instance group to add capacity.
Note: a lost core node can mean lost HDFS data if not replicated. Only use Spot for core nodes if you can tolerate partial data loss or rely on AWS S3/EMRFS for persistence.
Step 9—Configure Task Nodes (100% Spot Instances)
Task nodes are designed to process data but do not store persistent data in HDFS. If a task node terminates because the Spot price rises above your maximum bid, no data is lost, and the impact on your cluster is minimal. This makes task nodes ideal candidates for 100% Spot Instances.
When you launch one or more task instance groups as Spot Instances, AWS EMR provisions as many task nodes as it can, up to your maximum Spot price. If you request a task instance group with six nodes and only five Spot Instances are available at or below your maximum Spot price, AWS EMR will launch with five nodes and add the sixth later if it becomes available.
Launching task instance groups as Spot Instances is a strategic way to expand your cluster's processing capacity while significantly minimizing costs. If your primary and core instance groups run on On-Demand Instances, their capacity is guaranteed for the duration of the cluster, and you can dynamically add Spot task instances as needed to handle peak traffic or accelerate data processing.
Set your On-Demand target capacity to 0 and your Spot target capacity to the number of nodes you need for Task nodes (Task = 100% Spot). Include multiple instance types across different families (e.g., a mix of C, M, and R types) to allow EMR to pick the cheapest available capacity.
Also, configure a Spot timeout configuration (~5 minutes) and choose Switch to On-Demand if Spot can’t fulfill the request. This way, if EMR can’t get enough Spot Instances in time, it will fall back to On-Demand.
Next, for each fleet’s Spot Instances, apply allocation strategy (for example, price-capacity-optimized, which targets the largest available capacity pools).
Step 10—Configure Networking, Security, and IAM
Now, let's configure Networking, Security, and AWS IAM settings.
For AWS VPC & Subnets:
In the "Network" section, select the VPC you configured earlier. Then, choose the AWS subnets (one or more) where you want your master and core nodes to launch. If you picked multiple Availability Zones (one subnet in us-east-1a, another in us-east-1b), ensure you select the corresponding AWS subnets here. Confirm that these AWS subnets have proper connectivity (like Internet Gateway or NAT Gateway) if your cluster needs to access the internet or other AWS services.
For Security Groups:
EMR automatically creates security groups for the master and core nodes. By default, the master security group allows SSH access from your IP and includes necessary EMR ports. While you can modify these later, the defaults are usually sufficient for most setups.
Remember to add an EC2 key pair for SSH access, which will be invaluable for direct troubleshooting if issues arise.
For AWS IAM Roles:
Assign appropriate service roles. You can either choose an existing role or create a new one. These roles manage permissions for EC2 and EMR.
For a more in-depth guide on security, see EMR cluster configuration for security.
Step 11—Create an EC2 instance profile
Choose "Create an instance profile". Then, grant read and write access to all AWS S3 buckets in your account so the cluster can interact with your data. This profile defines the permissions for the EC2 instances within your EMR cluster to access other AWS services, particularly S3.
Step 12—Enable Logging and Debugging
Finally, enable logging and direct it to a logs folder within your designated AWS S3 bucket. This allows you to track cluster activity and debug issues effectively. (For EMR 6.9 and later, log archiving to AWS S3 is automatically enabled, but you still need to specify the S3 bucket location).
Also consider setting up AWS SNS or CloudWatch event notifications if you want alerts on step completion or failures (for example, to email you).
In short, configure AWS S3 logging and any alerting you need so you have full visibility into the cluster’s operation and failures.
Step 13—Review and launch the cluster
Review all your settings to make sure everything is correct. Then click Create cluster. EMR will begin provisioning the instances and bootstrapping the software. The cluster will enter a STARTING state and eventually a RUNNING state. This can take a few minutes as EC2 instances launch and the Hadoop/Spark stack initializes.
In the EMR console, you can click on the cluster ID to monitor progress and view any bootstrap or error messages in the Events tab. Once the cluster is RUNNING, you’re ready to submit jobs.
Diagnosing Common Issues
Even a well-configured cluster can encounter problems. Here are some troubleshooting tips:
1) Instance Fleet Launch Failures
If the cluster or an instance fleet fails to launch, check the cluster Events in the EMR console. A common issue is a VALIDATION_ERROR when your chosen AWS subnets lack sufficient free IP addresses. Ensure each selected subnet can accommodate all instances. Also, look for Spot capacity events – if EC2 cannot fulfill your Spot request, either add more instance types/AZs or rely on the On-Demand fallback mechanism you configured.
2) Spot Interruptions
If a Spot node is reclaimed, EMR will retry the work. By default, EMR runs application masters on On-Demand core nodes, so losing a task node typically only triggers retries. To mitigate failures, enable job-level fault tolerance (like Spark checkpointing or periodic writes) so a restarted task doesn't lose all progress. Increase retry counts (like, mapreduce.map.maxattempts, spark.task.maxFailures) as needed. If you use HDFS on Spot core nodes, increase the replication factor to prevent data loss when a node goes away.
3) Performance and Data Skew
If performance is poor, examine the Spark or YARN UI. A few straggling tasks often indicate data skew. If some tasks are significantly slower, repartition or rebalance your data to even out the load. Monitor garbage collection and memory usage: if tasks spend excessive time in GC or encounter OutOfMemory errors, you may need more executor memory or fewer cores per executor. Ensure you've allocated sufficient memory and CPU for your workload.
4) Resource Limits
Verify that you haven't hit AWS service limits. Check your EC2 vCPU quotas and request increases if you need more instances. Also, monitor subnet IP availability. The EMR Events tab will indicate if capacity or quota issues prevented nodes from launching.
Always remember to terminate the cluster (or set it to auto-terminate on step completion) when done to avoid incurring unnecessary costs. The largest cost is often leaving instances running. If an interruption occurs, you can fix your configuration or code and relaunch a fresh cluster quickly due to EMR's elasticity.
Best Practices for Using EC2 Spot Instances with EMR
1) Use Spot for Task Nodes Only
Run the primary (master) and typically the core nodes on On-Demand instances to protect the cluster's control plane and HDFS. Only place the non-persistent task nodes on Spot. This way, losing tasks doesn’t lose data.
2) Mix Instance Types (Fleets)
Always use Instance Fleets with multiple instance types (mix an Intel-based and a Graviton-based type) for your task fleet. This spreads the risk – if one type is unavailable, EMR can launch another.
3) Capacity-Optimized Allocation
Choose the capacity-optimized Spot allocation strategy. AWS EMR will then search across Availability Zones (AZs) and instance types to find available spare capacity, which significantly reduces the likelihood of interruption.
4) Set Max Price to On-Demand
Configure your Spot maximum price to match the On-Demand price. This way, AWS will only interrupt your Spot if it really needs the capacity (not because someone bid you out).
5) Short-Lived Clusters
Spot is ideal for short, batch workloads. For a multi-hour or variable cluster, plan accordingly. AWS guidance suggests using On-Demand for core capacity if the cluster is long-running (like a data warehouse) and mixing Spot for peak load. Conversely, for short-lived, cost-sensitive workloads, you might run most of it on Spot.
6) Monitor & Fallback
Keep a close eye on your cluster's status. Leverage EMR's retries and fallbacks: if Spot capacity isn't found within a configured timeout, set the fleet to fall back to On-Demand or terminate quickly so you can retry with a different configuration.
7) Checkpoint and Replication
In your Spark or Hadoop jobs, enable checkpointing and high replication for fault tolerance. For Spark Streaming, checkpoint to S3. For Hadoop/HDFS data, keep the replication factor at 3 (or 2 for small clusters) so losing one node doesn’t drop data blocks.
8) Optimize Data and Parallelism
Smaller tasks complete faster (as Spot provides a 2-minute interruption warning). Tune Spark's defaultParallelism, adjust file partitioning, and use compressed, splittable file formats to optimize shuffle operations. This way, if an executor is lost, recomputing a small task is quick. Also, make sure data in AWS S3 is laid out (like, partitioned by date) so Spark can prune input efficiently.
9) Instance Sizing
Select instance sizes that align with your job's requirements. Compute-intensive Spark jobs may prefer compute-optimized instances, while memory-heavy tasks might prefer memory-optimized instances. As a rule of thumb, use one executor per core (YARN vCore) and set executor memory appropriately. For example, r5.xlarge (4 vCPU, 32 GB RAM) for memory-heavy tasks or c6g.xlarge (4 vCPU, 8 GB RAM) for CPU-heavy tasks.
10) Split On-Demand/Spot
For data-critical or SLA-critical jobs, run the cluster's master and core nodes on On-Demand instances and supplement with Spot task nodes. This approach guarantees your data always resides on at least On-Demand nodes while still achieving significant cost savings on the compute portion.
Following these practices allows you to leverage most of the cost benefits of Spot Instances without compromising your cluster's reliability.
Conclusion
And that’s a wrap! Running EMR on Spot Instances requires careful planning, but the cost savings are substantial. This article has covered setting up an EMR cluster in the AWS Console with Spot Instances—from selecting instance fleets and mixes to handling networking, logging, and AWS IAM roles. Following these steps and best practices, you can create a stable and cost-efficient EMR cluster. It's a straightforward process, provided you monitor it and make adjustments as needed. With a Spot-backed EMR, you can run big data jobs efficiently without overspending, leading to clear payoffs in lower costs without sacrificing reliability.
FAQs
What are AWS EC2 Spot Instances?
Spot Instances are spare EC2 capacity offered at deep discounts (often up to ~90% off On-Demand prices). They can be reclaimed by AWS with little notice (typically a 2-minute warning). In practice, Spot Instances let you run compute workloads very cheaply if you can tolerate occasional interruptions.
What is the difference between a Spot Instance and a Reserved Instance?
A Spot Instance is a short-term, interruptible instance you request when spare capacity is available; its price fluctuates based on supply and demand. A Reserved Instance is a billing discount for an EC2 instance you commit to for 1 or 3 years. Reserved Instances give you a lower hourly rate (and some capacity reservation) for those years, but they operate like normal On-Demand instances (they aren’t interrupted as long as capacity exists). Spot Instances, by contrast, are variable-priced and can be terminated by AWS when demand is high.
How long do EC2 Spot Instances last?
Spot Instances can run indefinitely – there is no fixed maximum runtime – until they are terminated by you or reclaimed by AWS. In other words, they run until AWS needs the capacity back. AWS provides a two-minute interruption notice, so design your applications to handle that scenario.
What do AWS EC2 Spot Instances allow you to do?
Spot Instances let you use AWS’s unused EC2 capacity at very low cost. This means you can run more or larger instances for the same budget. For big data processing on EMR, that translates to huge cost savings. In effect, Spot Instances give you the same EC2 features as On-Demand, but at a fraction of the price, in exchange for the possibility of interruption.
What is the frequency of interruption of a Spot Instance?
Interruption rates vary by instance type, region, and current demand. AWS reports that historically less than about 5% of workloads on Spot Instances are interrupted. In other words, most Spot Instances run without interruption, but you should always architect as if interruptions could happen.
What is the difference between an EMR cluster and EC2?
An EC2 instance is just a single virtual server (compute node). An EMR cluster is a managed service that launches multiple EC2 instances and configures them to run big data frameworks (Hadoop, Spark, Hive, etc.). EMR takes care of installing and configuring the big data stack on your EC2 nodes.
What’s the difference between instance fleets and instance groups?
Instance groups are an older EMR feature: each group is a fixed-size set of identical instances of one type. Instance fleets are more flexible: each fleet can target a set of instances and include multiple types (up to 5 in the console, 30 via CLI), and you can set separate On-Demand vs Spot capacities and allocation strategies. Fleets allow EMR to mix and match types to fulfill capacity, while groups are simpler single-type pools.
Why not run the EMR master node on Spot?
EMR master node runs the cluster’s coordination services (like the YARN ResourceManager and HDFS NameNode). If the master is interrupted, the cluster stops. AWS explicitly warns that if the master terminates, the cluster ends. For that reason, always use an On-Demand instance for the master (unless you don’t mind the cluster ending unexpectedly).
Can I run core nodes on Spot Instances?
Core nodes hold HDFS data and also run tasks (like reducers or Spark executors). Losing a core node means losing the data on it. AWS recommends running core nodes On-Demand, or using Spot only if you can tolerate partial data loss. If you do use Spot for core nodes, make sure HDFS has sufficient replication, or better yet keep important data in S3.
What allocation strategy works best for Spot?
For Spot Instance fleets, capacity-optimized is usually the best choice. It launches Spot requests into the pools with the most available capacity (which lowers interruption risk). AWS also offers a capacity-optimized-prioritized variant and diversified. You should generally avoid only using lowest-price if you need reliability, since that may pick over-subscribed pools.
How do I choose instance types for my task fleet?
Pick instance types that match your workload profile and include a mix to give EMR options. As a rule of thumb, include some general-purpose (m5.large/xlarge) and specialized types you need. AWS guidance suggests compute-bound tasks on C-series, memory-bound on R-series, etc. In practice, you list several suitable types in the fleet so EMR can pick the cheapest available. If one type is not available, others can take its place.
Can I resize my EMR cluster after launch?
Yes. If you use instance groups, you can increase or decrease the group size (via the console or CLI). If you use instance fleets, you can edit the target capacities. In the EMR console, go to the cluster's configuration, and change the number of instances or On-Demand/Spot targets for the core/task fleets. EMR will add or terminate instances to match the new settings. You can even add a new task fleet to a running cluster if needed.