Wombo is nice app that lets you create short videos of people singing certain songs. You simply upload a photo, select a song and out pops a video. Wombo is one of many deepfake apps, but this one is getting quite some traction. I have been experimenting with different types of deepfake models and I think I can recognise most of them. Wombo calls itself a lip sync app, but Wombo could do much more. I know something of the strengths and weaknesses of deepfake models and how to get the most out of them. I think Wombo can be used to let anything sing and dance. To proof this, I am going to use the first thing closest to me. That being my Coffee cup. Lets see if we can make this fellow dance:
Wombo most likely makes use of the First Order Motion Model (FOMM). This is a motion transfer deep learning model first published in 2019. The model takes a single picture and a single video and tries to transfer the motion in the video to the picture. This result in the picture becoming animated, moving or acting as the source video did. The interesting thing about this model is that it's not necessarily a deepfake model, for it transfers any motion. I use the term Deepfake rather loosely here. "Deepfake" Is not a well defined term, but what I mean is that the model can do much more than Lyp Sync. The model has been trained on a variety of objects including faces. This makes the model extremely versatile and makes it possible to even make other object move, or make animations. Even animals or animated figures can suddenly sing and dance.
The results can vary, Some inputs transfer really well but others come out warped and distorted. So what can you do to make the most lively animation? The app accepts any picture so in theory I should make my cup of coffee dance. This is fun but the result won’t look realistic.
Notice that the teapot behind it moves along? That's because the model does not know that the teapot is separate from the coffee cup. We can help the model out by making the background as clean as possible. Here I made a little stage for my coffee cup to perform in and now the result is much smoother.
The cup is now nicely swinging to the beat in his little saucer. So the first tip is create a clean background for your Wombo star to perform in. But I wonder if we can do even better?
A second trick is to make the picture and the driving video as similar as possible. Remember that FOMM is a motion transfer model. So somewhere hidden behind the app is a video of a person singing. The motions of that person are being transferred to our picture. If the picture and the video look alike, the results can be amazing. It could mean a big deal if we somehow could make our cup of coffee look similair to the driving video. This is hard to do in Wombo because you don't know what the video is that they are using. But if your video fails to capture the dynamic movements of the artist you can always cheat. When I was experimenting with FOMM it happened a lot that the model would not catch the mouth of my picture, resulting in a horrible distorted jaw movement. But a simple trick you can use is to place the mouth from the video onto the picture. This can help the model to realize where the mouth is.
In the app I selected the song: “tell me you know” by Good Kid. I don't know if they are using a video of Good Kid to drive the movement but we can try using Good Kid's own mouth to make our cup sing. Let's take the mouth of one of the band members and place it on our cup.
In the video we can see some distortions below the mustache. This is probably the motion transfer of the mouth movements. Lets place a mouth there on the cup by using a simple Photoshop tool.
Now he is ready to sing. We generate a new video in the Wombo app using the same song and as a finishing touch, I used Imovie to place the video back in the photo of the stage I created earlier. So now it looks like he's singing on stage.
Now lets see if our cup will finally sing...
Seeing the result I realize that I prefer my coffee without teeth, but it did work!
So the second trick is: if the model doesn’t catch a facial feature, you can replace it or make it more prominent in your input image. And now you know, Wombo is not a lip sync app, it's a motion transfer app and now you know how to make anything, even your coffee dance to any tune.