Stitch it in Time: GAN-Based Facial Editing of Real Videos

Tel Aviv University

Our method can apply semantic manipulations to real facial videos without requiring any temporal components.

Don't have time? Skip to the tl;dr

Abstract

The ability of Generative Adversarial Networks to encode rich semantics within their latent space has been widely adopted for facial image editing. However, replicating their success with videos has proven challenging. Sets of high-quality facial videos are lacking, and working with videos introduces a fundamental barrier to overcome - temporal coherency. We propose that this barrier is largely artificial. The source video is already temporally coherent, and deviations from this state arise in part due to careless treatment of individual components in the editing pipeline. We leverage the natural alignment of StyleGAN and the tendency of neural networks to learn low frequency functions, and demonstrate that they provide a strongly consistent prior. We draw on these insights and propose a framework for semantic editing of faces in videos, demonstrating significant improvements over the current state-of-the-art. Our method produces meaningful face manipulations, maintains a higher degree of temporal consistency, and can be applied to challenging, high quality, talking head videos which current methods struggle with.

So, what's the gist?

Video outputs at different stages of our editing pipeline

We're using the same StyleGAN image-based editing techniques you all know and love, and we're doing it at the frame level - without any components that try to enforce temporal consistency! How come it's so smooth? Why is there so little jitter? The answer is simple - the original video is already temporally consistent. Instead of using the wrong tools, ruining consistency and working hard to restore it, we analyze the different components of a GAN editing pipeline, determine which ones are consistent, and just use those!


Effects of removing / replacing pipeline components

tl;dr?

Encoders are smooth at the local scale, generator tuning works great globally and maintains alignment. Together they almost perfectly preserve the original video's consistency.

What does it look like, compared to the alternatives?

Even without optical-flow or other temporal components, our method can tackle complex scenes with motion that destabilizes the state-of-the-art.


+Smile

+Young

+Old

+Old

Our method can be applied not only to real videos, but also to animated media!

Additional Examples

BibTeX

If you find our work useful, please cite our paper:

@misc{tzaban2022stitch,
      title={Stitch it in Time: GAN-Based Facial Editing of Real Videos},
      author={Rotem Tzaban and Ron Mokady and Rinon Gal and Amit H. Bermano and Daniel Cohen-Or},
      year={2022},
      eprint={2201.08361},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}