We're using the same StyleGAN image-based editing techniques you all know and love, and we're doing it at the frame level - without any components that try to enforce temporal consistency! How come it's so smooth? Why is there so little jitter? The answer is simple - the original video is already temporally consistent. Instead of using the wrong tools, ruining consistency and working hard to restore it, we analyze the different components of a GAN editing pipeline, determine which ones are consistent, and just use those!
Encoders are smooth at the local scale, generator tuning works great globally and maintains alignment. Together they almost perfectly preserve the original video's consistency.
Even without optical-flow or other temporal components, our method can tackle complex scenes with motion that destabilizes the state-of-the-art.
If you find our work useful, please cite our paper:
@misc{tzaban2022stitch,
title={Stitch it in Time: GAN-Based Facial Editing of Real Videos},
author={Rotem Tzaban and Ron Mokady and Rinon Gal and Amit H. Bermano and Daniel Cohen-Or},
year={2022},
eprint={2201.08361},
archivePrefix={arXiv},
primaryClass={cs.CV}
}