Neural Head Avatar - What Is It And How It Is Used In Megapixel Resolution?

A team of researchers at the Samsung AI Center in Moscow raised the resolution of the neural head avatar technology to megapixels (Russia). Nikita Drobyshev led the project.

Author:Dr. Felix Chaosphere

Reviewer:Xander OddityOct 06, 20227 Shares760 Views

A team of researchers at the Samsung AI Center in Moscow raised the resolution of the neural head avatartechnology to megapixels (Russia). Nikita Drobyshev led the project.

They presented a variety of unique neural architectures and training strategies to attain the required levels of projected picture quality and generalization to novel views and motion utilizing medium-resolution video data as well as high-resolution image data. These might be employed to meet the expected image quality requirements.

They demonstrated how to simplify a trained high-resolution neural head avatar into a lightweight student model.

This model works in real-time and associates neural avatar identities with hundreds of pre-defined source pictures.

Many practical applications of neural head avatar systems demand real-time operation and an identity lock.

The team at the Samsung AI Center presents megapixel portraits, often known as MegaPortraits for short. This technique allows for quickly creating high-resolution human avatars from a single photograph.

The model combines the look of the source frame with the motion of the drivingframe, including the head stance and facial expression. This gives the model the ability to generate an output picture.

Because the source and driver frames are from the same movie, the model's prediction is trained to match the driver frame.

Neural Head Avatars

Neural head avatars are a novel, subject-specific representation of articulated human heads. They explicitly reconstruct the full head geometry and produce photorealistic results even when the viewpoint is changed significantly.

Neural head avatars offer a new and fascinating way of creating virtual head models. They overcome the difficulty of accurate physics-based modeling of human avatars by learning the form and looking straight from films of talking humans.

This allows them to create more lifelike characters. Approaches that can generate realistic avatars from a single image have been developed over the last several years. These methods are referred to as "one-shots."

The general information about human appearance is used in creating the avatars in the one-shot mode. This is accomplished via substantial pretraining on massive datasets, including footage of many distinct individuals.

A woman avatar while she is teaching

Base Model Training

In the first step, two frames x were picked randomly from a training movie and used to train the system.

An appearance encoder (Eapp) takes the source frame and turns it into a set of volumetric features and a globaldescriptor.

A 3D convolutional network called G3D works on these properties before combining them with motion data to make a 4D warping field.

The latent descriptor was used to describe an expression in place of critical points. The latent descriptor was used to describe an expression in place of essential issues.

A 2D convolutional network decodes the resultant 2D feature map to make the output picture.

Different loss functions were used for training, which can be put into two groups.

Training Of Model In High-Resolution Mode

For the second training stage, the primary neural head avatar model was fixed, and an image-to-image translation network was trained to move input images with a resolution of 512x512 to an updated version with a resolution of 1024x1024.

The model was trained with high-resolution pictures, each of which was thought to have its own identity.

It means that source-driver pairings that change as the car moves, like in the first training step, cannot be made.

When training the high-resolution model, two sets of loss functions were used. The main super-resolution goals are in the first group.

The second set of goals is not supervised and was used to ensure that the model worked well for photos taken while driving in the opposite direction.

Student Model Training

A small network for translating images based on what they look like. The one-shot model, also called the student, was made.

The teacher taught the student to predict using the entire model, which includes the base model and an enhancer.

Using the instructor model to make a "false ground truth," the student is only taught in the cross-driving mode.

Since the student network was only trained on a few avatars, it was given an index that lets it choose an image from all of them.

Baseline Methods

Face Vid-to-vid is a new way to reenact yourself in which the source picture and the driving picture look and sound the same.

Its main features are a volumetric representation of the avatar's appearance and an explicit head motion model using 3D keypoints learned without being watched.

This base model used a similar volumetric encoding of the appearance, but the face's motion was encoded implicitly, which made cross-reenactment work better.

The First Order Motion Model (FOMM) shows motion with 2D keypoints and is another solid foundation for the job of self-reenactment.

These essential things, like Face-V2V, were taught without being watched. However, this method doesn't produce realistic images in the cross-reenactment situation.

The HeadGAN system was looked at, which uses the expression coefficients of a 3D model that can change shape to show motion.

A dense 3D keypoints regressor that has already been trained is used to figure out these coefficients. This method does an excellent job of separating motion data from how things look in 3D keypoints, but it limits the space where things can move.

High-Resolution Testing

Since there were no data for self-reenactment, high-resolution synthesis could only be tested in cross-reenactment mode. Parts of a filtered dataset were used for training and testing.

The baseline super-resolution techniques were trained using two random enhanced copies of the training picture as a source and a driver and the output of a pre-trained base model Gbase as input.

Because adding more augmentations could change things like a person's head size, random harvests and rotations were used.

In the quantitative comparison, the quality of the picture was measured using an additional image quality evaluation metric.

By adding more high-frequency features without changing the original picture, this method beat its competitors in quality and quantity.

The distillation architecture gives 130 frames per second on the NVIDIA RTX 3090 graphics card when set to FP16 mode.

The student's model as a whole, with all 100 avatars, takes up 800 MB. This method might work about the same as the instructor model.

Across all avatars, it gets an average Peak Signal-to-Noise Ratio of 23.14 and a Learned Perceptual Image Patch Similarity of 0.208, which is about the instructor model.

Other Studies About 3D Avatar

Several efforts to make 4D head avatars have been sparked by the recent success of neural implicit scene representations for the challenge of 3D reconstruction.

An alternative to talking-head synthesis is direct video production using convolutional neural networks with descriptions of how things look and move.

The method can add motion from an arbitrary video sequence to the look of a single image while keeping the megapixel resolution.

Most of these works use explicit motion representations, like critical points or blend shapes, but some use latent motion parameterization.

During training, if the person learns to separate themselves from their appearance, their movements will be more expressive.

The current video datasets, which include movies with a maximum resolution of 512x512, limit how detailed the talking head models can be.

This problem makes it even harder to use traditional high-quality picture and video synthesis methods to improve the output quality of existing datasets.

This problem could also be considered a super-resolution challenge for a single picture.

On the other hand, the quality of the outputs of the one-shot talking head model changes a lot depending on the motion. This means that typical single-image super-resolution approaches don't work very well.

These old methods rely on supervised training with a known starting point, which can't do with unique motion data because each person only has one picture.

SMPLpix: Neural Avatars from 3D Human Models

Conclusion

Two primary limitations of the system were noticed based on the characteristics of our training set.

First, the VoxCeleb2 and FFHQ datasets used for training tend to have near frontal views, which lowers the quality of rendering for head postures that are substantially non-frontal.

Second, since the high resolution is only achievable with static views, the obtained results have a certain level of temporal flicker.

This problem should be solved in an ideal world using certain losses or architectural decisions.

Last but not least, the mobility of the shoulders is not modeled by this method. More study is required before attempting to solve these problems.

Jump to

Neural Head Avatars

Base Model Training

Training Of Model In High-Resolution Mode

Student Model Training

Baseline Methods

High-Resolution Testing

Other Studies About 3D Avatar

Conclusion

Dr. Felix Chaosphere

Author

Dr. Felix Chaosphere, a renowned and eccentric psychiatrist, is a master of unraveling the complexities of the human mind. With his wild and untamed hair, he embodies the essence of a brilliant but unconventional thinker. As a sexologist, he fearlessly delves into the depths of human desire and intimacy, unearthing hidden truths and challenging societal norms. Beyond his professional expertise, Dr. Chaosphere is also a celebrated author, renowned for his provocative and thought-provoking literary works. His written words mirror the enigmatic nature of his persona, inviting readers to explore the labyrinthine corridors of the human psyche. With his indomitable spirit and insatiable curiosity, Dr. Chaosphere continues to push boundaries, challenging society's preconceived notions and inspiring others to embrace their own inner tumult.

Xander Oddity

Reviewer

Xander Oddity, an eccentric and intrepid news reporter, is a master of unearthing the strange and bizarre. With an insatiable curiosity for the unconventional, Xander ventures into the depths of the unknown, fearlessly pursuing stories that defy conventional explanation. Armed with a vast reservoir of knowledge and experience in the realm of conspiracies, Xander is a seasoned investigator of the extraordinary. Throughout his illustrious career, Xander has built a reputation for delving into the shadows of secrecy and unraveling the enigmatic. With an unyielding determination and an unwavering belief in the power of the bizarre, Xander strives to shed light on the unexplained and challenge the boundaries of conventional wisdom. In his pursuit of the truth, Xander continues to inspire others to question the world around them and embrace the unexpected.

News

Latest In

News

Entertainment

Latest In

Entertainment

Celebrities

Latest In

Celebrities

Sins

Latest In

Sins

Interesting As Fuck

Latest In

Interesting As Fuck

WTF

Latest In

WTF

Neural Head Avatar - What Is It And How It Is Used In Megapixel Resolution?

.nzoqai-1mysgrz{display:-webkit-box;-webkit-line-clamp:1;-webkit-box-orient:vertical;overflow:hidden;-webkit-line-clamp:2;}Neural Head Avatars

Base Model Training

Training Of Model In High-Resolution Mode

Student Model Training

Baseline Methods

High-Resolution Testing

Other Studies About 3D Avatar

People Also Ask

Conclusion

Dr. Felix Chaosphere

Xander Oddity

Neural Head Avatars