The Functional Paradigm of JAX and Flax
Unlike PyTorch or TensorFlow, JAX treats models as pure functions rather than stateful objects. This requires a shift in how developers handle randomness and model state.
- Explicit PRNG: JAX eliminates global random states. Developers must pass a
PRNGKeyto functions and split it usingjax.random.splitto generate independent, reproducible streams of randomness. This ensures that experiments are perfectly deterministic. - Model State Management: In Flax, parameters and optimizer states are stored in a plain Python data structure. The
TrainStateutility acts as a container for the model'sapply_fn, currentparams, and the optimizer (tx). - Defining Architectures: Using the
@nn.compactdecorator allows for inline definition of sub-modules within the__call__method. This triggers parameter initialization on the first forward pass, keeping the code concise and readable.
Bridging PyTorch and JAX
While JAX handles computation, it lacks built-in dataset management. The author demonstrates a common, effective pattern: using torchvision for data downloading and preprocessing, while overriding the DataLoader's collate_fn to output NumPy arrays instead of PyTorch tensors. This allows JAX to consume the data directly without unnecessary overhead.
Training and Activation Functions
To evaluate performance, the author implements a multi-layer perceptron (MLP) on the FashionMNIST dataset, comparing six different activation functions: Sigmoid, Tanh, ReLU, LeakyReLU, ELU, and Swish.
- Numerical Stability: The model outputs raw logits rather than probabilities. This is standard practice because the cross-entropy loss function is numerically more stable when it handles
log_softmaxinternally. - Initialization: The author uses
lecun_uniforminitialization to maintain parity with PyTorch defaults, which is critical for training stability in deep networks. - Performance: The training loop utilizes
@jit(Just-In-Time compilation) to accelerate thetrain_stepandeval_stepfunctions, demonstrating how JAX achieves high performance through XLA compilation.