Generate Videos by Slerp-Walking Stable Diffusion Latents

Latent Space Walking Creates Hypnotic Videos

Sample two random latents (shape 1x4x64x64 for 512x512 images), then use spherical linear interpolation (slerp) across 200 steps from init1 to init2. For each interpolated latent, run diffusion conditioned on a fixed text prompt (e.g., "blueberry spaghetti") with classifier-free guidance: concatenate unconditional and conditional embeddings, predict noise with UNet, apply guidance_scale=7.5, and denoise over num_inference_steps=50 using LMSDiscreteScheduler. Decode final latents via VAE to produce one frame per step. Repeat pairs up to max_frames=10000, saving JPEGs at 90% quality. Stitch with ffmpeg -r 10 -f image2 -s 512x512 -i frame%06d.jpg -vcodec libx264 -crf 10 -pix_fmt yuv420p output.mp4. This random walk yields surreal, morphing visuals without prompt changes.

Custom Diffuse Handles Guidance and Schedulers

Bypass pipeline for fine control: compute unconditional embeddings from empty prompt, cat with conditional (1x77x768). Set timesteps with offset=1 if supported, eta=0.0 for DDIM compatibility. For each timestep, double latents for CFG, predict noise_pred, scale as uncond + guidance_scale*(text - uncond), step scheduler to prev_sample. Scale latents by 1/0.18215 before VAE decode, clamp/post-process to uint8 numpy. Supports LMSDiscreteScheduler (multiplies latents by sigmas initially, divides model input by sqrt(sigma^2 +1)). Slerp avoids straight-line artifacts in high-D latent space using arccos(dot) for theta, blending with sin terms if dot < 0.9995.

Setup, Params, and Optimizations

Requires Hugging Face access token for CompVis/stable-diffusion-v1-3-diffusers (or v1-4), diffusers library, torch, einops, PIL, fire (pip install fire), ~10GB VRAM for 512x512. Run: python stablediffusionwalk.py --prompt "blueberry spaghetti" --name outdir --num_steps 200 --num_inference_steps 50 --guidance_scale 7.5 --seed 1337 --max_frames 10000. Wrap diffuse in torch.autocast('cuda') for half-precision speedup. Higher inference steps (100-200) improve quality; guidance 3-10 tunes adherence. Users extended to prompt interpolation, fp16 models (fix dtype mismatches by upgrading diffusers/transformers/scipy), or pipeline simplifications (pipe(prompt, latents=init, ...)).