Learning to play Minecraft with Video PreTraining

By neub9
2 Min Read

The internet is a treasure trove of publicly available videos that offer valuable learning opportunities. From witnessing a captivating presentation to observing a digital artist at work or learning from a skilled Minecraft player, there’s so much to take in. However, while these videos capture what happened, they often fall short in showing us precisely how it was accomplished – the specific sequence of mouse movements and key presses, for instance. This poses a unique challenge, particularly when we aspire to develop large-scale foundation models in these domains, akin to the advancements made in language with models like GPT. In the language domain, “action labels” are essentially the next words in a sentence, but in the realm of videos, such labels are much more complex to define.

With the abundance of unlabeled video data available on the internet, we’ve introduced a novel yet straightforward semi-supervised imitation learning method called Video PreTraining (VPT). Our approach begins with gathering a small dataset from contractors, recording not just their videos but also their accompanying actions – such as key presses and mouse movements. With this dataset, we train an inverse dynamics model (IDM) that can predict the actions taken at each step in the video. Crucially, the IDM is capable of utilizing both past and future information to make these predictions, making the task far less data-intensive compared to behavioral cloning, which focuses solely on predicting actions based on past video frames and necessitates inferring the individual’s intentions and strategies. Subsequently, we can leverage the trained IDM to label a much larger set of online videos and learn through behavioral cloning.

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *