Amar Prakash Pandey - ᕦ(ò_óˇ)ᕤ

4TB RAM, Yet an OOM Error? Debugging a Spark Memory Mystery

banner

Everything seemed right—ample resources, a well-sized cluster, and yet, the Spark job kept failing with an out-of-memory error. Logs pointed to memory allocation failures, but with a 63-node cluster, each equipped with 64GB RAM, this shouldn’t have been an issue. We tweaked configurations, analyzed logs, and even considered scaling up the cluster. But the real solution? It wasn’t what we expected.

The Data Challenge

Our task involved processing three datasets:

Our Infrastructure:

The job failed with this error:

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007f0a62000000, 494927872, 0) failed; error='Not enough space' (errno=12)

At first glance, this error made little sense. Our cluster had plenty of RAM, yet the JVM couldn’t allocate memory properly. The failure suggested that the issue wasn’t a lack of overall memory, but rather how memory was being allocated within the executors.

Our first instinct? Increase the cluster size. But was that really the right approach?

Debugging Process: What Went Wrong?

  1. Initial Assumption: Not Enough Resources

    At first, we assumed our cluster was too small. More nodes mean more memory, right? But with 63 nodes and a total of 4TB RAM, this didn’t add up. Something else was at play.

  2. Heap Allocation Problems

    Each executor was given 64GB heap. While this seems like a good idea (more memory per executor), it actually led to heap inefficiencies:

    • Large heaps increase fragmentation.
    • Memory allocation within a single executor becomes harder for the OS.
    • Longer GC pauses, slowing down the job.
  3. JVM Garbage Collection (GC) Bottlenecks

    Garbage collection should ideally free up unused memory quickly. However, with a large heap size, we observed:

    • Major GC pauses due to excessive memory usage.
    • Inefficient cleanup cycles, causing out-of-memory errors.
    • We tried tuning GC with:
      -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'
      -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=30
      

    But the job still failed.

The Fix: More Executors, Smaller Heaps

Instead of increasing the cluster size, we changed:

Result? The job passed. ✅

Why This Worked

  1. Smaller Containers, Better Efficiency

    JVM memory management becomes inefficient with extremely large heap sizes. By reducing the heap size from 64GB to 16GB, memory allocation became more efficient and predictable.

  2. Improved Garbage Collection (GC)

    • Large heaps cause longer GC pauses, delaying memory cleanup.
    • Smaller heaps = shorter GC pauses = better performance.
    • With more executors, GC cycles ran faster and in parallel.
  3. Better Load Distribution

    • Before: Some nodes were underutilized, while others struggled.
    • After: More executors meant better workload balance.

Benchmarking: Before vs. After

Configuration Executors Heap Size per Executor GC Pause Time Job Status
Before 63 64GB Long ❌ Failed
After 252 16GB Short ✅ Passed

Although runtime remained the same, memory stability improved significantly.

Key Takeaways

Next time you face a Spark memory issue, don’t just scale up the cluster—try optimizing executors first. It might save you a lot of headaches (and money)! 🚀


Lastly, thank you for reading this post. For more awesome posts, you can also follow me on Medium — amarlearning, Github — amarlearning.

#Apache-Spark #Spark #Big-Data #Data-Processing #Performance #Performance-Optimization #Distributed-Computing #Data Engineering