Generating Videos with Scene Dynamics

Abstract

We capitalize on large amounts of unlabeled video in order to learn a model of scene dynamics for both video recognition tasks (e.g. action classification) and video generation tasks (e.g. future prediction). We propose a generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background. Experiments suggest this model can generate tiny videos up to a second at full frame rate better than simple baselines, and we show its utility at predicting plausible futures of static images. Moreover, experiments and visualizations show the model internally learns useful features for recognizing actions with minimal supervision, suggesting scene dynamics are a promising signal for representation learning. We believe generative video models can impact many applications in video understanding and simulation.

Video Generations

Below are some selected videos that are generated by our model. These videos are not real; they are hallucinated by a generative video model. While they are not photo-realistic, the motions are fairly reasonable for the scene category they are trained on.

Beach

Golf

Train Station

Baby

Conditional Video Generations

We can also use the model to add animations to static images by training it to generate the future. Of course, the future is uncertain, so the model rarely generates the "correct" future, but we think the prediction has some plausibility.

Input	Output

Input	Output

Input	Output

More Results

For more results, check out the links below. Warning: The pages are fairly large (100 MB), so please browse on a computer and not a phone, and wait for the GIFs to download completely.

Beach
Golf
Train
Hospital
Unfiltered (trained without first filtering for scene category, not as good)

Video Representation Learning

Learning models that generate videos may also be a promising way to learn representations. For example, we can train generators on a large repository of unlabeled videos, then fine-tune the discriminator on a small labeled dataset in order to recognize some actions with minimal supervision. We report accuracy on UCF101 and compare to other unsupervised learning methods for videos:

We can also visualize what emerges in the representation for predicting the future. While not all units are semantic, we found there are a few hidden units that fire on objects which are sources of motions, such as people or train tracks. Since generating the future requires understanding moving objects, the network may learn to recognize these objects internally, even though it is not supervised to do so.

Hidden unit for "people"

Hidden unit for "train tracks"

The images above highlight regions where a particular convolutional hidden unit fires.

Brief Technical Overview

Our approach builds on generative image models that leverage adversarial learning, which we apply to video. The basic idea behind the approach is to compete two deep networks against each other. One network ("the generator") tries to generate a synthetic video, and another network ("the discriminator") tries to discriminate synthetic versus real videos. The generator is trained to fool the discriminator.

For the generator, we use a deep convolutional network that inputs low-dimensional random noise and outputs a video. To model video, we use spatiotemporal up-convolutions (2D for space, 1D for time). The generator also models the background separately from the foreground. The network produces a static background (which is replicated over time) and a moving foreground that is combined using a mask. We illustrate this below:

We simultaneously train a discriminator network to distinguish real videos from fake videos. We use a deep spatiotemporal convolutional network for the discriminator, visualized here.

We downloaded two years of videos from Flickr, which we stabilize, automatically filter by scene category, and use for training.

To predict the future, one can attach an encoder to the generator, visualized here.

Notes on Failures & Limitations

Adversarial learning is tricky to get right, and our model has several limitations we wish to point out.

The generations are usually distinguishable from real videos. They are also fairly low resolution: 64x64 for 32 frames.
Evaluation of generative models is hard. We used a psychophysical 2AFC test on Mechanical Turk asking workers "Which video is more realistic?" We think this evaluation is okay, but it is important for the community to settle on robust automatic evaluation metrics.
For better generations, we automatically filtered videos by scene category and trained a separate model per category. We used the PlacesCNN on the first few frames to get scene categories.
The future extrapolations do not always match the first frame very well, which may happen because the bottleneck is too strong.

Related Works

This project builds upon many great ideas, which we encourage you to check out. They have very cool results!

Adversarial Learning

Generative Adversarial Networks
by Goodfellow et al. 2014.
Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks
by Denton et al. 2015.
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
by Radford et al. 2015.
Context Encoders: Feature Learning by Inpainting
by Pathak et al. 2016.
Generative Image Modeling using Style and Structure Adversarial Networks
by Wang and Gupta. 2016.
Generative Adversarial Text to Image Synthesis
by Reed et al. 2016.

Generative Models of Video

Recursive estimation of generative models of video
by Petrovic et al. 2006.
Video (language) modeling: a baseline for generative models of natural videos
by Ranzato et al. 2014.
Patch to the Future: Unsupervised Visual Prediction
by Walker et al. 2014.
Unsupervised Learning of Video Representations using LSTMs
by Srivastava et al. 2015.
Deep multi-scale video prediction beyond mean square error
by Mathieu et al. 2015.
An Uncertain Future: Forecasting from Static Images using Variational Autoencoders
by Walker et al. 2016.
Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks
by Xue et al. 2016.
Dynamic Filter Networks
by Brabandere et al. 2016.
Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning
by Lotter et al. 2016.
Video Pixel Networks
by Kalchbrenner et al. 2016.
Unsupervised Learning for Physical Interaction through Video Prediction
by Finn et al. 2016.
Synthesizing Dynamic Patterns by Spatial-Temporal Generative ConvNet by Xie et al. 2017.

Please send us any references we may have accidentally missed.

Code

The code is available on Github implemented in Torch7 and open source. You can also download pretrained models (1 GB zip file).

Data

We are making the stabilized video frames available for others to use. You may download it all or in parts. Each tarball contains JPEG files which in turn contain 32 frames vertically concatenated, corresponding to roughly a second of video. The videos have been stabilized by SIFT+RANSAC.

The total dataset is 9 TB, which consists of 35 million clips. This is over one year of video.

By Scene:

Beach Only (196 GB)
Golf Only (131 GB)

All Videos:

List of URLs of Videos (150 MB)

Or you can download pre-processed and stabilized frames directly:

Videos1 Part 0 (approx 295 GB)
Videos1 Part 1 (approx 295 GB)
Videos1 Part 2 (approx 295 GB)
Videos1 Part 3 (approx 295 GB)
Videos1 Part 4 (approx 295 GB)
Videos1 Part 5 (approx 295 GB)
Videos1 Part 6 (approx 295 GB)
Videos1 Part 7 (approx 295 GB)
Videos1 Part 8 (approx 295 GB)
Videos1 Part 9 (approx 295 GB)
Videos2 Part 0 (approx 750 GB)
Videos2 Part 1 (approx 750 GB)
Videos2 Part 3 (approx 750 GB)
Videos2 Part 4 (approx 750 GB)
Videos2 Part 5 (approx 750 GB)
Videos2 Part 6 (approx 750 GB)
Videos2 Part 7 (approx 750 GB)
Videos2 Part 8 (approx 750 GB)
Videos2 Part 9 (approx 750 GB)

File Listings:

Videos from "Videos1" are part of the Yahoo Flickr Creative Commons Dataset. Videos from "Videos2" are downloaded by querying Flickr for common tags and English words. There should be no overlap.

If you use this data in your research project, please cite the Yahoo dataset and our paper.

Acknowledgements

We thank Yusuf Aytar for dataset discussions. We thank MIT TIG, especially Garrett Wollman, for troubleshooting issues on storing the 26 terabytes of video. We are grateful for the Torch7 community for answering many questions. NVidia donated GPUs used for this research. This work was partially supported by NSF grant #1524817 to AT, START program at UMBC to HP, and the Google PhD fellowship to CV.