The Do's and Don'ts of Apache Spark - Best Practices for Efficient Data Processing

Leo Chashnikov

May 29, 20236 min read

Apache Spark has emerged as one of the most popular big data processing frameworks due to its speed, scalability, and ease of use. However, harnessing the full power of Spark requires a good understanding of its best practices. In this article, we will explore the do's and don'ts of Apache Spark to help you maximize its potential and avoid common pitfalls.

I admit that separation between do's and don'ts are quite arbitrary - in most cases, one can be rewritten in terms of another. So I consider both parts as equally important, and inseparable of each other.

Also take into account that most of these advices are just advices, really - and if you see a good reason to break them, you probably should. But keep in mind what you're "paying", and what you're getting for that.

Do's:

Understand your data, and what you want to do with it. Several times I've been humbled trying to dive in and optimize some pipelines purely based on code, without understanding their purpose from business perspective. I've improved pipelines that should've been deleted, I came up with better ways to join datasets - and it turned out that we no longer use results of that join down the line, I optimized calculation of metrics that users were not reading. Such understanding would also improve tests you cover your code with (you won't be inventing scenarios on the spot, but testing usecases from real users), and invariant checks for input and output data - you should understand what outliers might be, what values can or cannot be null, what are expected variations - and hence protect your pipeline from processing invalid data, or publishing incorrect results.
Partition wisely. Spark operates on partitions, which are units of parallelism. If you're not utilizing partitions, and all your data can fit into memory of one machine - that's a reason to think if you should be using Spark in the first place. Partitioning your data optimally is crucial for performance. Consider how your data will be read, how it will be joined later, can you shuffle it to avoid skews, and avoid further re-shuffles down the line (as it's one of the most expensive operations). Consider how it will be queried after your pipeline ends, what partitions would consumers need - don't make people read terrabytes of data when they just want few hundreds of records.
Use Dataset API. You shouldn't use RDD API - I don't think this would come as any surprise for you. It has been long dead and gone. Dataset API allows you to write type-checked code, and be sure that if you've defined schemas of input data correctly - your code would work as expected, and not fail somewhere with unexpected null or ClassCastException. There are still some cases when you'll not get results you expect (some utility functions can return meaningless results instead of actually throwing an exception), but still you'll have experience that's way better then using DataFrames, and seeing a failure after job was running for 3 hours because you've misspelled column name.
Cache intermediate results. If you expect to reuse some data you've just computed multiple times in the future - cache it (in memory or on disk) using the persist or cache methods. Remember that without intermediate caching, values of your datasets are just graphs of computations to do, not results that exist in memory. And remember that caching is not free as well. Data that is cached to memory only wouldn't survive a restart of executor, and Spark would just drop it if it will be running out of memory for ongoing operations. Caching to disk is obviously more persistent, but it would require to serialize data.
Optimize shuffles. Shuffle operations involve redistributing data across partitions, and this is usually expensive. You should aim to reduce size of your data before shuffling as much as you can, so if there are any filtering operations you can apply - definitely do that. Another good option is byKey methods (like reduceByKey or aggregateByKey) - those can be executed without shuffle, as usually all data for a certain key is situated on one partition. You probably won't be able to get rid of shuffles altogether, and in that case also pay attention to how you partition your datasets (maybe one partitioning schema would serve you for several operations down the line, hence saving resources a bit later).

Don'ts:

Don't use collect. collect() action brings the entire dataset to the driver, which with high probability will cause out-of-memory errors (as the dataset is probably large). If you're in process of debug / validation of your pipeline and want to check that data looks as you expect it to - prefer methods like take() or first() - thus you'll control specific amount of records to be read.
Avoid unnecessary serialization/deserialization (SerDe). Serializing and deserializing data can be a time-consuming process. Spark can utilize metadata from columnar storage formats (like Parquet or ORC) really well, so it will push down filtering of unnecessary partitions (example: if you partition data by country, and Spark job explicitly reads only "country=UK" - rest of partitions wouldn't even be touched, reducing amount of data to be read dramatically), as well as ignore columns that are not used in the job. However, if you're using some "opaque" data formats (for instance, you've put all your data in a single parquet column with serialized JSON in it), Spark would have no other option but to read it all. Remember that columnar storages are made to handle big number of columns and sparse data well - so 20 columns are preferrable to a map with known 20 keys and different values.
Monitor resource usage. Spark requires sufficient resources to perform efficiently. Don't forget than when you run a worker (or a driver) on a machine, it can't utilize whole memory - there's certain JVM overhead, there are some other services as well. When you're trying to optimize your job, look at errors you're getting, and where exactly do you get them. Where does OOM occur? If it's in one of the workers, then increasing driver memory won't help - look for possible data skew and solve it instead. If driver is OOMing - check your collects, and filters before them - are you collecting millions of rows, when you expected there to be only several hundreds?
Avoid data skew. Data skew occurs when a few partitions hold significantly more data than others, leading to performance degradation. Obvious symptoms would be straggling tasks - if on Spark UI you can see 998 jobs finishing in minutes, and 2 are taking hours (or dying) - that's a very good sign of a skew. There are multiple ways to resolve the skew. One of the easiest options is just increasing number of partitions, so that multiple hot keys wouldn't end up on one machine. A better way would be to add salt to keys before aggregation - this will change partitioning key, and Spark would be able to distribute data more evenly.
Avoid UDFs and UDAFs. UDFs are powerful, but they will also cause inefficiencies in execution if not used sparingly. UDFs are arbitrary Scala (or Java) functions you define, and worker execute them on each row of data, at scale. They come at a cost though, and this cost is both ser/de, and breaking of vectorization. When data is represented in "native" Spark formats, and you're using built-in functions on it - Spark can easily optimize it - pushing down filters, reordering or merging operations, vectorizing calculations. UDFs, however, are totally opaque to Spark, and it cannot optimize them in any way - they have to be executed as is. They would also require Spark to "materialize" Scala objects with all their memory overhead, instead of utilizing more optimal memory allocation. There are cases when using UDFs is the only option to get the results you need, but if you see an option to do some data preparation or filtering before - this will be a better way to do it.

Apache Spark is a powerful tool, but with it's power you get the responsibility of using it wisely and sparingly. You don't have to throws thousands of worker nodes and terrabytes of memory on the problem just because you have them available - this approach wouldn't scale.

Keep these advice in your mind, but don't rush into action immediately - profile your jobs, see where they spend time, find low-hanging fruits (if previously you've mostly invested in new features, not in optimization - there will be plenty of them), and don't be afraid to plan some bigger refactorings when you'll prove yourself there's no other way around certain problems.

Chashnikov.dev

The Do's and Don'ts of Apache Spark - Best Practices for Efficient Data Processing

Do's:

Don'ts:

Recent Posts

Comments