Technology

Understanding TensorFlow’s Execution Modes: Eager vs. Graph

Is your TensorFlow code running slower than a snail’s pace? You’re not alone. Many developers find their powerful machine learning models hitting performance bottlenecks. While TensorFlow offers incredible flexibility, unlocking its full speed potential often means understanding how it truly operates under the hood.

This guide will take you beneath the surface of TensorFlow and Keras to demonstrate how TensorFlow works, focusing on how you can make simple changes to your code to leverage graphs. We’ll explore how these graphs are stored and represented, and crucially, how they can accelerate your models, turning slow TensorFlow code into a streamlined powerhouse.

Understanding TensorFlow’s Execution Modes: Eager vs. Graph

In the past, you might have run TensorFlow eagerly. This means TensorFlow operations are executed by Python, operation by operation, returning results back to Python immediately. While eager execution is incredibly intuitive and great for debugging, it often leaves significant performance on the table, especially for complex computations.

Graph execution, on the other hand, is where TensorFlow truly shines for speed and portability. Graph execution means that tensor computations are executed as a TensorFlow graph, often referred to as a tf.Graph or simply a “graph.”

Graphs are sophisticated data structures containing a set of tf.Operation objects, representing units of computation, and tf.Tensor objects, which represent the data flowing between these operations. They are defined within a tf.Graph context. Because these graphs are self-contained data structures, they can be saved, run, and restored without needing the original Python code.

For those familiar with TensorFlow 1.x, this guide demonstrates a very different view of graphs, reflecting TensorFlow 2.x’s emphasis on integrating graphs seamlessly into a more Pythonic workflow. This is a big-picture overview that covers how tf.function allows you to switch from eager execution to graph execution, offering a path to significantly speed up TensorFlow.

The Benefits of Graph Execution

With a graph, your TensorFlow models gain immense flexibility. You can deploy your TensorFlow graph in environments without a Python interpreter, like mobile applications, embedded devices, and backend servers. TensorFlow uses graphs as the standard format for saved models when exporting them from Python.

Graphs are also highly optimizable. The TensorFlow compiler can perform powerful transformations like statically inferring tensor values by folding constant nodes in your computation (“constant folding”). It can also separate independent sub-parts of a computation and distribute them across threads or devices, and simplify arithmetic operations by eliminating common subexpressions. An entire optimization system called Grappler is dedicated to performing these and many other speedups.

In short, graphs are immensely useful, allowing your TensorFlow code to run fast, run in parallel, and run efficiently across multiple devices. The convenience of defining your models in Python, while automatically constructing graphs when needed, offers the best of both worlds.

Accelerating with `tf.function`: Your Key to Graph Execution

The magic bullet for transitioning from eager to graph execution in TensorFlow is `tf.function`. You create and run a graph by using `tf.function`, either as a direct call or as a decorator. `tf.function` takes a regular Python function as input and returns a `tf.types.experimental.PolymorphicFunction`. This `PolymorphicFunction` is a Python callable that builds TensorFlow graphs from your original Python function, and you use it just like its Python equivalent.

To get started, let’s set up our environment:

import tensorflow as tf
import timeit
from datetime import datetime
2024-08-15 01:23:58.511668: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-15 01:23:58.532403: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-15 01:23:58.538519: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

Here’s a practical example:

# Define a Python function.
def a_regular_function(x, y, b): x = tf.matmul(x, y) x = x + b return x # The Python type of `a_function_that_uses_a_graph` will now be a
# `PolymorphicFunction`.
a_function_that_uses_a_graph = tf.function(a_regular_function) # Make some tensors.
x1 = tf.constant([[1.0, 2.0]])
y1 = tf.constant([[2.0], [3.0]])
b1 = tf.constant(4.0) orig_value = a_regular_function(x1, y1, b1).numpy()
# Call a `tf.function` like a Python function.
tf_function_value = a_function_that_uses_a_graph(x1, y1, b1).numpy()
assert(orig_value == tf_function_value)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1723685041.078349 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685041.081709 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685041.084876 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685041.088691 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685041.100124 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685041.103158 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685041.106072 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685041.109491 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685041.112991 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685041.115870 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685041.118785 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685041.122189 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.369900 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.372045 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.374040 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.376123 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.378174 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.380184 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.382098 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.384064 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.386002 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.387981 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.389902 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.391922 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.431010 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.433093 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.435050 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.437074 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.439053 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.441049 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.442965 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.444941 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.446890 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.450623 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.453482 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1723685042.455908 10585 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355

A `tf.function` is powerful because it applies to the function it decorates and all other functions it calls internally, capturing a comprehensive graph of your computation. TensorFlow uses a library called AutoGraph (`tf.autograph`) to convert Python-specific logic like `if-then` clauses and loops into graph-compatible operations, ensuring your entire computation benefits from graph execution.

The `PolymorphicFunction` encapsulates several `tf.Graphs` behind one API, enabling the benefits of graph execution such as speed and deployability. Each `tf.Graph` is specialized for specific input types (e.g., tensors with a particular `dtype` or objects with the same `id()`). If `tf.function` encounters new input types, it creates a new `tf.Graph`, specialized to those arguments, represented by a `ConcreteFunction` which wraps the graph.

@tf.function
def my_relu(x): return tf.maximum(0., x) # `my_relu` creates new graphs as it observes different input types.
print(my_relu(tf.constant(5.5)))
print(my_relu([1, -1]))
print(my_relu(tf.constant([3., -3.])))
tf.Tensor(5.5, shape=(), dtype=float32)
tf.Tensor([1. 0.], shape=(2,), dtype=float32)
tf.Tensor([3. 0.], shape=(2,), dtype=float32)

Subsequent calls with matching input types will reuse existing graphs, preventing redundant tracing and ensuring optimal performance:

# These two calls do *not* create new graphs.
print(my_relu(tf.constant(-2.5))) # Input type matches `tf.constant(5.5)`.
print(my_relu(tf.constant([-1., 1.]))) # Input type matches `tf.constant([3., -3.])`.
tf.Tensor(0.0, shape=(), dtype=float32)
tf.Tensor([0. 1.], shape=(2,), dtype=float32)

Practical Considerations and Performance Gains

While `tf.function` can seem like a magic wand, there are practical aspects to consider. Code within a `tf.function` can execute both eagerly and as a graph. By default, it runs as a graph. You can verify your `tf.function`’s graph computation matches its Python equivalent by temporarily enabling eager execution globally with `tf.config.run_functions_eagerly(True)`.

A key difference between graph and eager execution involves Python side effects, like the `print` function. During graph execution, the `print` statement is executed only once when `tf.function` first traces the code to create the graph. It’s not part of the graph itself and won’t re-execute on subsequent calls using the same traced graph. If you need to print values in both eager and graph execution, use `tf.print` instead.

Another important concept is non-strict execution. Graph execution only runs operations necessary to produce observable effects, such as the function’s return value or documented side effects like `tf.print` and `tf.Variable` mutations. This means unnecessary operations, even those that might raise errors in eager mode, will be skipped. Do not rely on runtime errors from unnecessary operations in graph execution.

def unused_return_eager(x): # Get index 1 will fail when `len(x) == 1` tf.gather(x, [1]) # unused return x try: print(unused_return_eager(tf.constant([0.0])))
except tf.errors.InvalidArgumentError as e: # All operations are run during eager execution so an error is raised. print(f'{type(e).__name__}: {e}')
tf.Tensor([0.], shape=(1,), dtype=float32)
@tf.function
def unused_return_graph(x): tf.gather(x, [1]) # unused return x # Only needed operations are run during graph execution. The error is not raised.
print(unused_return_graph(tf.constant([0.0])))
tf.Tensor([0.], shape=(1,), dtype=float32)

`tf.function` Best Practices

To maximize performance and avoid common pitfalls when your TensorFlow code is slow, consider these best practices:

  • Toggle between eager and graph execution frequently with `tf.config.run_functions_eagerly` to quickly identify behavioral differences.
  • Create `tf.Variables` outside your Python function and modify them within. This also applies to Keras layers, models, and optimizers.
  • Avoid functions that rely heavily on outer Python variables, excluding `tf.Variables` and Keras objects.
  • Prefer functions that take tensors and other TensorFlow types as input.
  • Include as much computation as possible under a single `tf.function` to gain maximum performance. Decorate entire training steps or loops.

Seeing the Speed-Up

The performance boost from `tf.function` varies, but it’s often substantial. For small computations, the overhead of graph creation might dominate, but for larger, repeated tasks, the benefits are clear. Here’s a comparison:

x = tf.random.uniform(shape=[10, 10], minval=-1, maxval=2, dtype=tf.dtypes.int32) def power(x, y): result = tf.eye(10, dtype=tf.dtypes.int32) for _ in range(y): result = tf.matmul(x, result) return result
print("Eager execution:", timeit.timeit(lambda: power(x, 100), number=1000), "seconds")
Eager execution: 4.1027931490000356 seconds
power_as_graph = tf.function(power)
print("Graph execution:", timeit.timeit(lambda: power_as_graph(x, 100), number=1000), "seconds")
Graph execution: 0.7951284349999241 seconds

This simple example clearly shows a significant speedup with graph execution. For even greater boosts, especially with TensorFlow control flow and many small tensors, try `tf.function(jit_compile=True)` to leverage XLA compilation.

When is a `tf.function` Tracing?

Tracing is the process where `tf.function` converts your Python code into a `tf.Graph`. It incurs overhead, so you want to minimize frequent retracing. A simple `print` statement inside your `tf.function` can act as an indicator; it will execute every time the function traces.

@tf.function
def a_function_with_python_side_effect(x): print("Tracing!") # An eager-only side effect. return x * x + tf.constant(2) # This is traced the first time.
print(a_function_with_python_side_effect(tf.constant(2)))
# The second time through, you won't see the side effect.
print(a_function_with_python_side_effect(tf.constant(3)))
Tracing!
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(11, shape=(), dtype=int32)

Notice how new Python arguments, not TensorFlow tensors, can trigger retracing, as `tf.function` interprets them as potentially new hyperparameters requiring a new graph:

# This retraces each time the Python argument changes,
# as a Python argument could be an epoch count or other
# hyperparameter.
print(a_function_with_python_side_effect(2))
print(a_function_with_python_side_effect(3))
Tracing!
tf.Tensor(6, shape=(), dtype=int32)
Tracing!
tf.Tensor(11, shape=(), dtype=int32)

Conclusion

If your TensorFlow code is slow, harnessing the power of graph execution with `tf.function` is often the most effective solution. By understanding the nuances of eager vs. graph modes, leveraging AutoGraph, and applying best practices, you can significantly accelerate your machine learning models, making them faster, more efficient, and readily deployable. Dive deeper into the `tf.function` guide for more advanced techniques and complete specifications to truly master TensorFlow performance.

Originally published on the TensorFlow website, this article appears here under a new headline and is licensed under CC BY 4.0. Code samples shared under the Apache 2.0 License.

Related Articles

Back to top button