By Marian Dvorsky, Software Engineer, Google Cloud
Grasping the trade-offs between cost and speed is the first step in improving the efficiency of your data pipelines
Google Cloud Dataflow is a fully managed service from Google Cloud Platform (GCP) for scalable batch and streaming data processing. In this post, you’ll learn about the relationship between cost and speed in Cloud Dataflow pipelines, particularly for batch workloads, and how understanding that relationship can lead to insights about efficiency.
In Cloud Dataflow, the cost of a pipeline is proportional to its resource usage of CPU, memory and disk. (See the Cloud Dataflow pricing model for details.) The speed of the pipeline is how quickly the job completes (also called elapsed time).
At any given time, the pipeline is using a certain number of workers of a given type. The type of the worker determines the resource requirements; for example, the default batch worker n1-standard-1 uses 1 vCPU, 3.75GB of memory and 250GB of local disk. According to the pricing model, running a pipeline with one such worker for 10 minutes costs approximately $0.14. A pipeline that uses two such workers but runs only 5 minutes also costs $0.14. So, as shown below, the second pipeline costs the same but runs 2x faster.
With autoscaling, the resource usage graph is not a simple rectangle shape, but rather the amount of resources ($ per minute) varies over time:
Improving pipeline speed via efficiency and parallelism
There are generally two ways to improve the speed of a pipeline: by improving efficiency or by increasing parallelism.
Improving efficiency means to use less resources, and it generally improves both the cost and the speed of the pipeline. An example of improving efficiency would be to switch to a more efficient Coder, or to shuffle less data by applying a filter before the
GroupByKey operation, instead of doing it after. (Methods for improving pipeline efficiency would be a topic for a separate blog post.)
Increasing parallelism means to run more of the computation in parallel. This approach generally improves speed but may come with an increase in cost, as the overhead of parallelization increases. With its autoscaling feature enabled, Cloud Dataflow automatically chooses the amount of parallelism to use for a job (as long as inherent parallelism exists in the job, e.g., it won’t be able to parallelize computation over a single key).
You can limit the parallelism (and therefore also per minute spend) of an autoscaled job by setting
--maxNumWorkers. The graph below illustrates the cost-versus-speed tradeoff for an example Google Cloud ML preprocessing pipeline by varying values of
--maxNumWorkers. As the max number of workers grows, the speed improves significantly for some increase in cost: The default configuration is 6x faster, but costs only 28% more than the version with max 300 workers.
In this example, there’s great speed improvement and little cost impact for going from max=300 to max=600. There’s significant further speed improvement by lifting the autoscaling ceiling altogether (“max default”), but that comes at a modest cost. Note that in all cases, an autoscaled job will be cheaper than a non-autoscaled version of the same pipeline that’s sized to complete in the same time as the autoscaled job.
Autoscaling is not just more efficient than non-autoscaled execution; it also provides a simple lever that can in some cases be used to control the trade-off between speed and cost via
--maxNumWorkers. In most cases, you’re better off not using it, but you may find it useful if you’d like to control the per-minute spend or decrease the cost of the job where execution time doesn’t matter. (Note that it’s always better to use autoscaling with
--maxNumWorkers than to only use
--numWorkers, as autoscaling can save more cost by downscaling —for example, during a non-parallelizable part of the pipeline.)
To learn more: