Last month, Google’s GameNGen AI model showed that generalized image diffusion techniques can be used to generate a passable, playable version of Doom. Now, researchers are using some similar techniques with a model called MarioVGG to see if an AI model can generate plausible video of Super Mario Bros. in response to user inputs.
The results of the MarioVGG model—available as a pre-print paper published by the crypto-adjacent AI company Virtuals Protocol—still display a lot of apparent glitches, and it’s too slow for anything approaching real-time gameplay at the moment. But the results show how even a limited model can infer some impressive physics and gameplay dynamics just from studying a bit of video and input data.
The researchers hope this represents a first step toward “producing and demonstrating a reliable and controllable video game generator,” or possibly even “replacing game development and game engines completely using video generation models” in the future.
Watching 737,000 frames of Mario
To train their model, the MarioVGG researchers (GitHub users erniechew and Brian Lim are listed as contributors) started with a public data set of Super Mario Bros. gameplay containing 280 “levels'” worth of input and image data arranged for machine-learning purposes (level 1-1 was removed from the training data so images from it could be used in the evaluation). The more than 737,000 individual frames in that data set were “preprocessed” into 35 frame chunks so the model could start to learn what the immediate results of various inputs generally looked like.
To “simplify the gameplay situation,” the researchers decided to focus only on two potential inputs in the data set: “run right” and “run right and jump.” Even this limited movement set presented some difficulties for the machine-learning system, though, since the preprocessor had to look backward for a few frames before a jump to figure out if and when the “run” started. Any jumps that included mid-air adjustments (i.e., the “left” button) also had to be thrown out because “this would introduce noise to the training dataset,” the researchers write.
After preprocessing (and about 48 hours of training on a single RTX 4090 graphics card), the researchers used a standard convolution and denoising process to generate new frames of video from a static starting game image and a text input (either “run” or “jump” in this limited case). While these generated sequences only last for a few frames, the last frame of one sequence can be used as the first of a new sequence, feasibly creating gameplay videos of any length that still show “coherent and consistent gameplay,” according to the researchers.