April 02, 2024

3D head avatars built with neural implicit volumetric representations have achieved unprecedented levels of photorealism. However, the computational cost of these methods remains a significant barrier to their widespread adoption, particularly in real-time applications such as virtual reality and teleconferencing. While attempts have been made to develop fast neural rendering approaches for static scenes, these methods cannot be simply employed to support realistic facial expressions, such as in the case of a dynamic facial performance. To address these challenges, we propose a novel fast 3D neural implicit head avatar model that achieves real-time rendering while maintaining fine-grained controllability and high rendering quality. Our key idea lies in the introduction of local hash table blendshapes, which are learned and attached to the vertices of an underlying face parametric model. These per-vertex hash-tables are linearly merged with weights predicted via a CNN, resulting in expression dependent embeddings. Our novel representation enables efficient density and color predictions using a lightweight MLP, which is further accelerated by a hierarchical nearest neighbor search method. Extensive experiments show that our approach runs in real-time while achieving comparable rendering quality to state-of-the-arts and decent results on challenging expressions.

The demand of high performing photo-realistic human avatars has dramatically increased with emerging VR/AR applications, VR gaming [1], [2], virtual assistant [3], tele-presence [4], and 3D videos [5], [6]. How to build efficient high quality avatars from monocular RGB videos becomes a promising direction due to the convenience of monocular data acquisition. While early works mostly adopt surface-based models for convenient controllability, recent methods (MonoAvatar [7]) leverage a sophisticated pipeline to build human avatars on neural radiance fields, which delivers vivid animations as well as significantly better rendering quality, especially over challenging parts such as hairs and glasses. On the downside, these approaches tend to be prohibitively slow, and most of the computation is consumed by the neural radiance field inference with large Multilayer Perceptrons (MLPs).

Recently, fast approaches for neural radiance fields (hash encoding in Instant NGPs [8]) have been proposed, which are designed mostly for static scenes or pre-recorded temporal sequences. Despite their great success, it is not straightforward to extend these approaches for human avatars, which requires real-time rendering of dynamic facial performances when controlling the avatar. NeRFBlendshape [9] address these issues by learning multiple feature hash tables, one for each face blendshape. These hash tables are linearly combined with blending weights to render the target facial expression via hash encoding. However, the expressiveness of the avatar is compromised by the global nature of the blending shapes, which cannot accurately capture vertex-level local deformations. On the other hand, INSTA [10] proposes to build the appearance model in a canonical space using a vanilla Instant NGPs [8] with expression codes as the additional MLP input to capture dynamic details, which is then transformed into target expression via a face parametric model (3D morphable model (3DMM)). However, the lightweight MLP used in the vanilla Instant NGPs limits their model capacity, resulting in inferior animation quality, especially on extreme expressions.

In this work, we propose a novel 3D neural avatar system that achieves efficient inference while maintaining fine-grained controllability and high fidelity quality. To achieve this, we introduce mesh-anchored hash table blendshapes, where we attach multiple, small hash tables to each of the 3DMM mesh vertices. These hash tables act as per-vertex “local blendshapes” (, each “blendshape” is controlled by one local hash table) and influence only a local region. The mesh-anchored blendshapes are linearly merged with per-vertex weights predicted by a convolutional neural network in UV space from avatar driving signals, such as expression and head rotation. This results in expression-dependent hash table embeddings which offer several advantages over a global linear combination of blendshapes. Indeed by associating hash tables with individual vertices, we enhance the expressiveness of the model, allowing for more localized and nuanced facial expressions. This contrasts with global blendshapes, which apply uniform transformations across the entire face, limiting expressiveness.

In more detail, our model starts from 3D query points, uses hash encoding [8] to gather the merged hash table embeddings from \(k\)-nearest-neighbor vertices around the query point, and predicts the density and color via a small MLP. The hash encoding [8] allows us to use a very lightweight MLP to significantly reduce computation, leading to efficient inference. Additionally, the vertex-attached hash table blendshapes represent a 3DMM-anchored neural radiance field (NeRF), which can be easily controlled by the underlying 3DMM and produce high fidelity renderings as demonstrated by MonoAvatar [7]. To further accelerate our rendering speed, we propose a hierarchical k-nearest-neighbor search method.

Our contributions are summarized as follows. We propose a novel approach for high quality and efficient 3D neural implicit head avatars. At the core of our model, vertex-attached local hash table blendshapes are proposed to support efficient rendering, controllability, and capturing fine-grained rendering details in dynamic facial performances. We also design a hierarchical querying solution to speed up the \(k\)-nearest-neighbor search when pulling hash table embeddings from neighbor vertices. Extensive experiments on multiple datasets verify that we are able to speed up avatar rendering to real-time (, average \(\ge 30\) FPS to render a \(512 \times 512\) video) while maintaining comparable rendering quality with the state-of-the-art high quality 3D avatar [7] and being largely superior on challenging expressions than existing efficient 3D avatars [9]–[11].

Constructing photorealistic digital humans has been a extensively researched topic. Here, we focus on discussing prior work on implicit monocular head avatars and efficient rendering. We refer readers to state-of-the-art surveys [12]–[14] for a comprehensive literature review.

**High Quality Head Avatar.** Traditionally, high-quality head avatars have been achieved under expensive equipment configurations, such as camera arrays [4], [15], [16], depth sensors [17], and light stages [5], [18], or require laborious manual intervention [19]. Recent research efforts have focused on constructing high quality avatars from monocular RGB videos. One typical class of approaches [20]–[23] use implicit \(3\)D representations (, neural radiance fields (NeRFs), implicit occupancy fields) to build the head avatar, which are parameterized by Multilayer perceptrons (MLPs). Although reasonable results are obtained, their rendering
quality is still unsatisfactory especially for more challenging expressions. More recently, Bai [7] proposed a head avatar based on \(3\)DMM-anchored NeRFs with expression-dependent features produced by a convolution neural network in UV space. Chen [24] designed local deformation fields to capture expression-dependent deformations applied on canonical NeRFs. Despite the impressive results, their methods didn’t demonstrate real-time rendering capability due to expensive
inference using large MLPs. In contrast, with our proposed mesh-anchored hash table blendshapes (3), we achieve much faster rendering speed while maintaining high fidelity results.

**Efficient Neural Radiance Fields.** There has been a plethora of work in recent years attempting to accelerate rendering with neural implicit representations for static objects and scenes. SNeRG [25], DVGO [26] and Plenoxels [27] propose to directly optimize voxel grids of (neural or SH) features for faster performance. However, their approach still requires a large memory footprint
to store per-voxel features in 3D space. KiloNeRF [28] dramatically accelerates the original NeRF by representing the scene with
thousands of tiny MLPs, however, this approach requires a complex training strategy. TensoRF [29] factorizes the feature grid into compact
components, resulting in significantly higher memory efficiency. Concurrently, Instant NGPs [8] utilizes multi-resolution hashing for
efficient encoding, resulting in high compactness. MobileNeRF [30] propose to represent NeRF based on polygons, which allows leveraging
a traditional polygon pipeline to enable their method to run in real-time on mobile devices. 3D Gaussian Splatting [31] represents a radiance field
with 3D Gaussian point clouds, and leverages a point rasterization pipeline to enable fast rendering. These methods, however, cannot be easily extended to controllable dynamic contents. More recently, NeRFBlendshape [9] was proposed to handle controllable expressions by learning multiple hash tables for different global blendshapes and linearly combine them with
expression codes. INSTA [10] transforms all expressions into a shared 3D canonical space, then adopts the vanilla Instant NGPs [8] conditioned on expression codes to model the head avatar. Despite the fast rendering speed achieved, their methods suffer from
unsatisfactory rendering quality. PointAvatar [11] utilizes point clouds to represent the head avatar and uses large MLPs to
predict the colors and motions of each point, leading to slow inference. Another type of works [32]–[37] use 2D convolution neural networks to directly synthesize images (2D neural rendering) from rasterized 3DMM meshes or low
resolution feature maps generated by volumetric rendering. Despite their fast speed, the 2D CNNs may break the 3D consistency, leading to temporally unstable results especially for high frequency details. In contrast, our method simultaneously achieves
controllability, high quality, and efficient rendering with a fully 3D representation.

Given a monocular RGB video, our method learns a neural radiance field (NeRF) based head avatar, which can be rendered under any specified cameras, articulated poses (, neck, jaw, and eyes) and facial expressions defined by a face parametric model (3DMM). We use FLAME [38] as the parametric model in this work, but our method can be generalized to any other mesh-based parametric models. 1 shows the overview of our method.

Our goal is to design a 3D neural implicit head avatar architecture that can simultaneously achieve high image quality, controllability, and computationally efficient rendering. To achieve this, we propose mesh-anchored hash table blendshapes (3.1) as a novel avatar representation that can leverage both advantages from recent high-quality (, 3DMM-anchored NeRF [7]) and efficient (, hash encoding [8]) frameworks. More specifically, we propose to attach multiple small hash tables on each 3DMM vertex. These vertex-attached hash tables form a set of local “blendshapes”, which will be linearly merged with predicted blending weights (3.2), and decoded into a 3DMM-anchored NeRF to support fine-grained control and high fidelity rendering. During NeRF decoding on a query point (3.3), we pull the embeddings from the linearly merged hash tables attached on \(k\)-nearest-neighbor (\(k\)-NN) vertices from the 3DMM mesh. Using hash encoding allows us to use a very light weight MLP (only 2 hidden layers) to predict the final densities and colors, which is the key for efficient rendering. To further accelerate our approach to real-time, we leverage the fact that close query points likely share similar \(k\)-NN vertices, and thus propose to group the query points into voxels and hierarchically search for \(k\)-NN vertices (3.4). Finally, our proposed avatar representation can be trained with only monocular RGB videos without any 3D scans or multi-view data (3.5).

The core of our model is an avatar representation that can represent a 3DMM-anchored neural radiance field (NeRF) while allowing us to adopt the hash encoding [8] technique for acceleration. Recent approaches manage to adopt hash encoding into head avatars with different avatar representations (, global blendshapes [9] and canonical NeRF [10]). In contrast, our solution is built upon the most recent 3DMM-anchored NeRF [7], which is superior for high quality renderings as demonstrated in our experiments (4.2).

We propose mesh-anchored hash table blendshapes as the new avatar representation. Given a target expression \(i\) with 3DMM pose code \(\boldsymbol{\theta}_i\) and expression code \(\boldsymbol{\psi}_i\), we first get the deformed 3DMM mesh \(\mathbf{V}_i=\mathcal{F}_\mathrm{3DMM}(\boldsymbol{\psi}_i, \boldsymbol{\theta}_i)\) with \(J\) vertices. For the \(j\)-th 3DMM mesh vertex \(\mathbf{v}_{ij}\), we attach \(M\) small hash tables \(\{\mathbf{H}_{j}^{(m)}\}_M\) on it, where each hash table has multiple resolutions following instant NGPs [8]. Intuitively, these hash tables form a set of vertex-level “blendshapes” anchored on the mesh, where each “blendshape” is a hash table, whose embeddings encode the information of a local radiance field around vertex \(\mathbf{v}_{ij}\). Given a target expression to render, these hash table blendshapes are linearly summed via expression-dependent weights (3.2), such that the merged embeddings encode the fine details specific to the target expression. Simultaneously, the coarse motion of the target expression is captured by the 3DMM vertex movement, which moves the attached hash tables accordingly, and hence the corresponding local radiance field.

We obtain per-vertex blending weights by running a convolution neural network (CNN) on the 3DMM deformation represented in UV atlas space. Specifically, we calculate the vertex displacements with respect to the neutral face \(\mathbf{D}_i = \mathcal{F}_\mathrm{3DMM}(\boldsymbol{\psi}_i, \boldsymbol{\theta}_i) - \mathcal{F}_\mathrm{3DMM}(\boldsymbol{0}, \boldsymbol{0})\). The displacements are then warped into the UV space and fed into a U-Net to predict a weights map in \(\mathbb{R}^{\mathrm{H_t} \times \mathrm{W_t} \times \mathrm{M}}\), where \(\mathrm{H_t} \times \mathrm{W_t}\) is the UV resolution, and \(M\) is the number of hash table blendshapes on each vertex (pre-defined as 5 in our experiments). The weights map is then sampled back to 3DMM vertices, serving as the expression-dependent weights \(\{w_{ij}^{(m)}\}_M\) to take a weighted sum of the embeddings in the hash tables on each vertex, which produces the merged hash tables \[\hat{\mathbf{H}}_{ij} = \sum_{m=1}^M w_{ij}^{(m)} \mathbf{H}_{j}^{(m)}.\] The U-Net also produces a UV feature map. We sample a per-vertex feature \(\mathbf{f}_{ij}\) from it, similar to MonoAvatar [7]. We empirically found this benefits the geometry quality. The mesh-anchored hash tables \(\hat{\mathbf{H}}_{ij}\) and features \(\mathbf{f}_{ij}\) are decoded into a neural radiance field as described in 3.3.

Given the mesh-anchored hash tables \(\hat{\mathbf{H}}_{ij}\) and features \(\mathbf{f}_{ij}\) described in 3.2, the final step is to decode them into a Neural Radiance Field (NeRF) to render the output image as shown in 2. The key idea is to associate a query point to neighbor vertices, and pull the embeddings from the attached hash tables via hash encoding [8]. Finally, we decode these pulled embeddings and the nearest per-vertex feature into the color and the density of this query point, followed by volumetric rendering to obtain the output image.

For a 3D query point \(\mathbf{q}\) when rendering a particular facial expression \(i\), we first obtain its \(k\)-nearest-neighbors, denoted as \(\{\mathbf{v}_{ik}\}_{k\in \mathcal{N}_{\mathbf{q}}^K}\), from the 3DMM vertices, with \(k^{\ast}\) denoting the nearest vertex index. For each neighbor vertex \(\mathbf{v}_{ik}\) with an attached hash table \(\hat{\mathbf{H}}_{ik}\), we denote \(\mathbf{q}_{ik}\) as the coordinates of \(\mathbf{q}\) in the tangent space of \(\mathbf{v}_{ik}\). We then use \(\mathbf{q}_{ik}\) to query the hash table \(\hat{\mathbf{H}}_{ik}\) using a hash encoding function \(\mathcal{H}(\cdot)\) and obtain the embedding \(\mathbf{h}_{ik} = \mathcal{H}(\mathbf{q}_{ik}; \hat{\mathbf{H}}_{ik})\). To interpolate the embeddings from all \(k\)-nearest-neighbors, we use the weighted sum of the inverse distances \(z_k= 1 / \| \mathbf{q}_{ik} \|_2\). Next, the summed embedding, together with the nearest per-vertex feature \(\mathbf{f}_{ik^{\ast}}\) and the query point tangent coordinate \(\mathbf{q}_{ik^{\ast}}\) of the nearest vertex, are fed into a two-hidden-layer MLP to predict the density and color as \[\begin{gather} \overline{\mathbf{h}}_{i} = \sum_{k\in \mathcal{N}_{\mathbf{q}}^K} \overline{w}_{k} \mathbf{h}_{ik}, \mathrm{ where }~\overline{w}_{k} = \frac{z_k}{\sum_{{k^\prime} \in \mathcal{N}_{\mathbf{q}}^K} z_{k^\prime}},\\ \Bigl[\mathbf{c}_i (\mathbf{q}, \mathbf{d}), \sigma_i (\mathbf{q})\Bigr] = \mathcal{F}_{\mathrm{MLP}} \left(\overline{\mathbf{h}}_i, \mathbf{f}_{ik^{\ast}}, \mathrm{PE}(\mathbf{q}_{ik^{\ast}}), \mathrm{PE}(\mathbf{d}) \right), \nonumber \end{gather}\] where \(\overline{w}_k\) is the normalized inverse-distance based weight, \(\mathbf{d}\) denotes the camera view direction, \(\mathrm{PE}(\cdot)\) denotes positional encoding, \(\mathbf{c}_i\) denotes color, and \(\sigma_i\) denotes density. Finally, we render the output pixel with the given camera ray \(\mathbf{r}\) by volumetric rendering, where we reparameterize the query point with samples on the ray \(\mathbf{q}= \mathbf{r}(t) = \mathbf{o} + t \mathbf{d}\): \[\begin{align} \mathbf{C}_i(\mathbf{r}) &= \int_{t_n}^{t_f} T(t) \sigma_i(\mathbf{r}(t)) \mathbf{c}_i(\mathbf{r}(t), \mathbf{d}) dt, \nonumber \\ \mathrm{where}~T(t) &= \mathrm{exp} \left( - \int_{t_n}^{t} \sigma_i(\mathbf{r}(s)) ds \right). \end{align}\] Following prior works [7], [39], we also introduce a per-frame error-correction warping field during training to reduce misalignments due to the noise in 3DMM tracking and unmodeled per-frame contents such as hair movements. We feed the query point \(\mathbf{q}\), together with a per-frame latent code \(\mathbf{e}_i\), into an MLP \(\mathcal{F}_{\mathcal{E}}(\cdot)\) to obtain a rigid transformation applied on the original query point, denoted as \(\mathbf{q}^{\prime} = \mathcal{T}_i(\mathbf{q}) = \mathcal{F}_{\mathcal{E}}(\mathbf{q}, \mathbf{e}_i)\). The warped query point \(\mathbf{q}^{\prime}\) is then used to compute the density and color for volumetric rendering. Since the warping fields are overfit to corresponding training frames, we disable the warping field during testing similar to previous works [7], [39], and hence \(\mathcal{F}_{\mathcal{E}}\) does not affect rendering efficiency.

As described in 3.3, our method involves a \(k\)-nearest-neighbor (\(k\)-NN) search, which is computationally expensive and cannot be naively accelerated with pre-calculated structures (, KD-Tree) due to the dynamically changing search pool (, the 3DMM vertices driven by poses and expressions). To speed up the process, we propose a hierarchical \(k\)-NN search algorithm following a coarse-to-fine strategy. The key idea is to group nearby query points into a cluster as they likely share similar nearest neighbors. Specifically, we use a 3D grid with resolution 64 and treat all query points that fall in each voxel as a cluster. For each cluster, we first search \(K^\prime\) (where \(K^\prime>K\)) nearest neighbors of the voxel center from all 3DMM vertices. Then, for each query point, we search \(K\) nearest neighbors from the \(K^\prime\) nearest neighbors of the corresponding cluster. In practice, for a 3DMM with the vertices number of \(J=1772\), we set \(K^\prime=12\) and \(K=3\). Our experiments empirically show that, with a proper grid resolution, this design significantly improves the nearest neighbor search speed, and does not introduce noticeable rendering artifacts, even though the \(k\)-NNs may not be accurate on some of the query points.

Only monocular RGB videos are required to train our model. Three losses are used during the training process: (1) a photometric loss that minimizes the \(l_2\)-norm distance between the rendered and ground truth pixel colors over all camera rays \(\mathbf{r}\) from all training frames \(i\). Formally, we have \(\mathcal{L}_\mathrm{rgb} = \sum_i \sum_{\mathbf{r}} \| \mathbf{C}_i(\mathbf{r}) - \mathbf{I}_i(\mathbf{r}) \|_2\); (2) a elastic regularization loss \(\mathcal{L}_\mathrm{elastic}\) applied on the learned error-correction warping field \(\mathcal{T}(\mathbf{q})\), which is introduced in Nerfies [40]; (3) a magnitude regularization loss to encourage small warping fields, which is defined as \(\mathcal{L}_\mathrm{mag} = \sum_{\mathbf{q}} \| \mathbf{q}- \mathcal{T}(\mathbf{q}) \|_2^2\). Finally, we combine all three loss terms: \[\begin{align} \mathcal{L} = \mathcal{L}_\mathrm{rgb} + \lambda_\mathrm{elastic} \mathcal{L}_\mathrm{elastic} + \lambda_\mathrm{mag} \mathcal{L}_\mathrm{mag}, \end{align}\] where we set \(\lambda_\mathrm{elastic} = 10^{-4}\) at the beginning of the training and decay it to \(10^{-5}\) after \(150\)k iterations, and set \(\lambda_\mathrm{mag} = 10^{-2}\). To warm start training, we replace the \(l_2\)-norm distance in the photometric loss \(\mathcal{L}_\mathrm{rgb}\) with the \(l_2\) distance for the first \(10\)k iterations. Please refer to the supplementary for more details.

get a controllable version of NGP: By deforming the FLAME mesh with pose and expression codes,

which is a online controllable version of NGP conditioned on the FLAME poses and expressions.

the rendering quality of Parametric Model Driven Avatar [41], and the rendering speed of NGP .

The core module to achieve this goal is the Expression Dependent Local NGP ([sec:exp_dep_local_ngp]).

May move to sec3.2

The idea is to attach a local 3D feature grid to each vertex of the FLAME mesh, and efficiently represent the 3D feature grid with multi-level hash tables as in NGP [8]. For convenience, we denote these "hash table represented grids" as "NGP grids". As a result, we obtain a controllable version of NGP: By deforming the FLAME mesh with pose and expression codes, these NGP grids will be moved together with their associated vertices, making the decoded radiance field controllable by the given pose and expression codes. However, merely moving the NGP grid with FLAME motions can only model the coarse level deformations of poses and expressions but fails to capture fine-grained details such as wrinkles, since the detail contents inside each NGP grids are fixed. To this end, we further condition the NGP grids on poses and expressions to better model detailed pose and expression variations, by

To combine the advantages from both preliminaries (3.6) (, high rendering quality, controllability, and fast rendering speed), we propose the Expression Dependent Local NGP ([sec:exp_dep_local_ngp]). The core idea is to locally attach a small NGP hash table on each FLAME vertex to encode the neural radiance field anchored on the vertex. These NGP encoded NeRFs can thus be deformed according to the motion of the FLAME mesh driven by the pose and expression codes, inheriting the controllability from Parametric Model Driven Avatar [41]. To capture details (, expression-dependent wrinkles) that cannot be modeled by the coarse FLAME mesh motions, we further condition the hash table features on the FLAME poses and expressions, by attaching multiple NGP hash tables to each vertex and linearly blending them with weights predicted from CNN in UV space. The CNN inherited from Bai [41] brings in the high rendering quality, while the NGP encoded NeRFs enables fast rendering speed.

To better introduce our method, we leverage the Parametric Model Driven Avatar proposed by Bai [41] as a starting point, which is a 3DMM-anchored
neural radiance field controlled via the FLAME poses and expressions, and adapt two main components of their pipeline: *Predicting expression-dependent local features*, and *Neural Radiance Field Decoding*.

. Given a specified FLAME pose codes \(\boldsymbol{\theta}_i\) and expression codes \(\boldsymbol{\psi}_i\), the deformed FLAME mesh is computed as \(\boldsymbol{V}_t=(\boldsymbol{\psi}_t, \boldsymbol{\theta}_t)\), where \(i\) denotes the frame index and the shape code is ignored for brevity. To obtain the expression-dependent local features, we first compute the FLAME vertex displacements \(\boldsymbol{D}_i\) between the deformed mesh \(\boldsymbol{V}_t\) and the neural mesh \(\boldsymbol{V}_{neutral}\) as \(\boldsymbol{D}_i = \boldsymbol{V}_i(\boldsymbol{\psi}_i, \boldsymbol{\theta}_i) - \boldsymbol{V}_{neutral}(\boldsymbol{0}, \boldsymbol{0})\). The displacements are then rasterized into UV space and feed into a U-Net to obtain a feature map. The feature map is then sampled back to FLAME vertices, serving as the expression-dependent local feature \(\{\boldsymbol{z}_i^j\}\) attached on each FLAME vertex with \(j\) denoting the vertex index.

. Given the deformed FLAME mesh \(\boldsymbol{V}_t\) with attached local features \(\{\boldsymbol{z}_i^j\}\), we can decode them into a radiance field and obtain the output image with volumetric rendering. For a \(3\)D query point,

In this section, we first introduce the data and metrics used for training and evaluation (4.1). Then, we show that our avatar model achieves real-time rendering speed, while producing superior rendering quality on challenging expressions than recent efficient avatars [9]–[11] and being comparable to previous high-quality approaches [7] (4.2). Finally, we provide ablation studies to justify the design choices and hyper-parameters of our avatar representation, and demonstrate the rendering speed improvements contributed from each of our newly proposed algorithmic components.

We use monocular RGB videos of multiple subjects (, one video for one subject) to train and evaluate our method, and compare to prior state-of-the-art (SOTA) approaches. Our dataset consists of \(10\) videos in total, which are a mix of videos captured by us, as well as videos from prior works including PointAvatar [11], INSTA [10], and MonoAvatar [7]. We filter out the background of the videos with off-the-shelf segmentation [42] and matting [43] methods, then crop and resize the videos into a VGA resolution that preserves the original aspect ratio. We compute the camera and 3DMM parameters from the videos following the 3DMM fitting optimization used in INSTA [10]. We reserve a short clip from the end of each video as the testing frames, and use the rest frames for training.

Following prior arts [20], we use PSNR, SSIM (higher is preferable), and LPIPS (lower is preferable) to measure the image quality. As observed by Zhang [44], LPIPS is a more effective metric in judging the perceptual quality compared to PSNR and SSIM. When computing PSNR and SSIM, we weigh the mean squared error map and the SSIM map with a foreground mask (eroded and smoothed), in order to focus on non-empty areas and avoid the inaccurate foreground segmentation from dominating the metrics.

To evaluate the computational cost, we measure the rendering speed in frames-per-second (FPS) on a RTX3090Ti and compare across different approaches with their available implementations. We also estimate the number of FLOPs (floating-point operations) of all methods as the theoretical measurement for the rendering speed. When estimating FLOPs, we fix the contribution of the ray-marching part to \(16\) points sampled along each camera ray. This simplifies the estimate, since the ray-marching varies the number of FLOPs needed across cameras and scenes, and it applies to all the considered methods.

We compare our method with several prior works, including: NeRFBlendshape [20], INSTA [10], PointAvatar [11], and MonoAvatar [7]. NeRFBlendshape [20] and INSTA [10] adopt hash encoding [8] into head avatars, leading to efficient renderings. PointAvatar [11] leverages point clouds to represent the head avatar. MonoAvatar [7] is based on a 3DMM-anchored NeRF, and produces high-quality renderings but is slow in speed. We use the same camera and 3DMM parameters to train and test all methods.

From 1, ours and INSTA [10] are the only 2 methods that can achieve real-time rendering (, \(\ge 30\) mean FPS). However, INSTA is quantitatively inferior than our method by a large margin, and gives obvious artifacts in 3, especially for challenging expressions. PointAvatar [11] has the potential to run in real-time with an optimized implementation thanks to its point cloud representation, but their renderings are overall blurrier than ours, leading to worse quantitative results. Although NeRFBlendshape [9] gives relatively good numbers in 1, it produces severe artifacts in dynamically changing regions (, mouth and eyebrows in 3) for several median and large expressions and also gives more floaters, resulting in implausible animations. We highly suggest readers to see the supplementary videos for more comparisons. MonoAvatar [7] gives good rendering qualities and animations, but is one order of magnitudes slower and slightly blurrier on high frequency details such as forehead wrinkles, presumably because that the hash encoding in our model can better capture high frequency contents. Among these compared approaches, our method is the only one that achieves real-time rendering while being one of the best on image qualities.

We also compare the theoretical FLOPs of all methods in 1, where our method requires the least computation mostly because of the smaller MLPs we use (, \(2\) hidden layers of ours vs. \(\ge 5\) hidden layers of others). Note that INSTA [10] is implemented with a highly optimized pure C++ and CUDA codebase, while other methods use python (tensorflow/pytorch) with customized CUDA kernels. This implementation advantage of INSTA makes it running in a high FPS even with a relatively larger FLOPs.

0.3em

LPIPS | SSIM | PSNR | Mean FPS | GFLOPs | |
---|---|---|---|---|---|

PointAvatar [11] | 0.117 | 0.728 | 21.12 | 5.0 | 933 |

INSTA [10] | 0.149 | 0.758 | 22.12 | 46.2 |
266 |

NeRFBlendshape [9] | 0.110 | 0.793 | 22.77 |
11.2 | 223 |

MonoAvatar [7] | 0.114 | 0.798 |
22.74 | 0.5 | 2385 |

Ours | 0.100 |
0.795 | 22.77 |
35.9 |
113 |

In this section we show the impact of the proposed design choices, in particular proving the importance of mesh-anchored hash table blendshapes and the proposed hierarchical \(k\)-NN search.

We hereby investigate alternative design choices and different hyper-parameters to justify the necessity of our mesh-anchored hash table blendshapes.

**Static Hash + 3DMM Param.** We first build a naive alternative approach to incorporate hash encoding into 3DMM-anchored NeRF, by attaching a single hash table to each vertex. For a query point, we concatenate the 3DMM pose and expression
codes (\(\boldsymbol{\theta}_i, \boldsymbol{\psi}_i\)) with the embedding pulled from the hash tables, and send them into the MLP. Note that there is no convolution running in UV space. As shown in 2 and 4, obvious rendering artifacts show up, and the rendering quality metrics drop significantly compared to our full model. This is presumably because the
lightweight MLP, which is crucial for good efficiency, does not have enough capacity to process detailed expression-dependent information from compact 3DMM codes.

**Static Hash + UV CNN.** We then increase the model capacity by adding back the UV CNN branch, but use only one hash table per-vertex, , single hash table without blendshape formulation. As shown in 2 and 4, the rendered images show less artifacts over the previous case, but still contain blurry textures and floaters compared to our full model. This
demonstrates that the blendshape formulation of mesh-anchored hash tables are necessary in order to obtain good expression dependent local embeddings, leading to a superior rendering quality.

**Number of Hash Table Blendshapes.** Here we investigate how the number of hash table blendshapes per vertex influences the final rendering quality. As shown in 2, we can see that more blendshapes per vertex leads to higher rendering qualities, which saturates as the number of tables increases. In 4, increasing the number of blendshapes also gives
better details especially for eyelids and ears. Although further adding more blendshapes may produce a better quality, we choose to use \(5\) blendshapes per-vertex as our final setting to maintain a relatively small model
size and computation cost.

We evaluate our model with and without hierarchical \(k\)-NN search in terms of speed and quality. From the comparison over rendering speeds (w/: \(35.9\) FPS; w/o: \(26.4\) FPS), we can see that hierarchical \(k\)-NN search gives around \(36\%\) improvements on the frame rate, which is crucial for achieving real-time rendering. From 5, we empirically find that enabling hierarchical \(k\)-NN search will not lead to observable drops on the rendering quality, as long as a proper grid resolution is used (, \(64\) in our case).

We also investigate the affect of using different 3D grid resolutions during hierarchical \(k\)-NN search. As shown in 5, we observe more artifacts around the mouth region when using smaller 3D grid resolutions (, \(32\) and \(16\)). Therefore, we choose to use \(64\) resolution in our final setting, which is a good trade-off between quality and speed.

0.35em

LPIPS | SSIM | PSNR | |
---|---|---|---|

Static Hash + 3DMM Param | 0.125 | 0.763 | 21.99 |

1 Blendshape (Static Hash + UV CNN) | 0.115 | 0.785 | 22.52 |

3 Blendshapes | 0.104 | 0.791 | 22.70 |

5 Blendshapes (Ours) | 0.100 |
0.795 |
22.77 |

We present a high quality 3D neural volumetric head avatar that can be rendered efficiently, while only requires monocular RGB videos for construction. We propose the mesh-anchored hash table blendshapes as our avatar representation, which enable a significantly faster rendering speed by utilizing hash encoding and lightweight MLPs, while still maintaining superior controllability to support realistic facial animations, and producing vivid expression-dependent details thank to the local blendshape formulation of hash tables. The experiments indicate that our approach runs in real-time at a \(512 \times 512\) resolution, while giving a rendering quality comparable to state-of-the-art, with better challenging expressions than prior efficient approaches.

As a limitation, we observe floaters under camera viewpoints and expressions that are far from the training distribution, which is a common issue in instant NGPs [8] based approaches. We also notice that performance around the mouth interior regions tends to be less stable because of the relatively poor tracking in these areas on the training data. Fortunately, the fast rendering could enable the possibility to adopt more expensive training strategies, such as regularization terms, adversarial loss, or joint face fitting refinement during the training, which could potentially mitigate these issues and further improve the rendering expressiveness and quality.

**Supplementary Materials**

In this supplementary material, we provide additional method details and more results, including hash encoding details (6), warping fields details (7), network architectures (8), training and testing details (9), data statistics (10), additional experiments (11 and the supplementary webpage), as well as discussions on limitations (12).

As described in Sec.3.1 and 3.2, we attach local hash table blendshapes on each 3DMM vertex, which are linearly blended with expression-dependent weights predicted via the U-Net running in UV space as the merged hash table for each vertex. The hyper-parameters of hash tables are shown in 3.

0.8em

Parameters | Values |
---|---|

Number of levels | \(2\) |

Hash table size | \(2^8\) |

Number of feature channels | \(4\) |

Coarsest resolution | \(32\) |

Finest resolution | \(64\) |

Number of blendshapes per-vertex | \(5\) |

Instead of attaching hash tables to all 3DMM vertices (FLAME [38] in our work), we select a subset of vertices to reduce the computation and model size, as well as ensure a more uniform vertex distribution on the 3DMM surface. More specifically, we subsample the vertices using poisson-disk sampling from meshlab [45] on a template mesh without eyeballs, and manually add 10 iris vertices, resulting in \(1772\) vertices in total.

As described at the end of Sec.3.3, following prior works on deformable NeRF [7], [39], [46], we overfit 3D warping fields on training frames to alleviate the negative influence of misalignments between tracked 3DMM meshes and images due to the noise in 3DMM tracking and unmodeled per-frame contents such as hair movements. These warping fields are discarded during testing as in [39] and [7] since they are overfit to training frames.

More specifically, we first assign a learnable latent code \(\mathbf{e}_i\) for each training frame \(i\). Given a query point \(\mathbf{q}\), we apply positional encoding on its coordinates and concatenate with the latent code \(\mathbf{e}_i\), then feed them into an MLP \(\mathcal{F}_{\mathcal{E}}(\cdot)\) to obtain a rigid transformation consists of 3 components: a \(3\)D rotation \(\mathbf{R} \in SO(3)\) , a rotation center \(\mathbf{c}^{rot}\), and a \(3\)D translation \(\mathbf{t}\). We then compute the warped query point \(\mathbf{q}^{\prime}\) by applying the rigid transformation to the original query point as \[\begin{align} \mathbf{R}, \mathbf{c}^{rot}, \mathbf{t} &= \mathcal{F}_{\mathcal{E}} \left( \mathbf{q}, \mathbf{e}_i \right) \\ \mathbf{q}^{\prime} &= \mathbf{R} \left( \mathbf{q} + \mathbf{c}^{rot} \right) - \mathbf{c}^{rot} + \mathbf{t}. \end{align}\] In practice, we represent \(\mathbf{R}\) with a pure log-quaternion and directly regress it with the MLP \(\mathcal{F}_{\mathcal{E}}(\cdot)\). As described in Sec.3.3, we denote this full warping procedure as \(\mathbf{q}^{\prime} = \mathcal{T}_i(\mathbf{q}) = \mathcal{F}_{\mathcal{E}}(\mathbf{q}, \mathbf{e}_i)\). The warped query point \(\mathbf{q}^{\prime}\) is then used to compute the density and color for volumetric rendering.

Here we introduce the detailed network architectures of three main components described in Sec.3.2 and 3.3: the U-Net running in UV space, the MLP to predict densities and colors, and a warp field MLP predictor. Please refer to the main paper for details on how these components come together to form our avatar model.

As described in Sec.3.2, the U-Net running in UV space takes the 3DMM vertex displacements as the input and outputs expression-dependent weights (to weighted sum hash tables) and per-vertex features. For the encoder side, we use downsampling residual blocks to extract a feature pyramid with the number of channels for each level as \(\{8, 16, 32, 64, 128, 256 \}\), with 128 as the input resolution and downsample to 64, 32, 16, 8, 4, 2. In the decoder side, we use upsampling residual blocks (with transposed convolutions) and set the number of output channels for each level to 128, 64, 64, 64, 64, 64. Finally, we use a \(1 \times 1\) convolution layer to get the weights map and the feature map. The leaky ReLU is applied after each convolutional layer with a slope \(0.2\). The input vertex displacement map has a resolution of \(128 \times 128\). The output expression-dependent weights map has a size of \(128\times128\times5\) (\(4\) channels predicted by network, \(1\) channel set to a constant one) and the output feature map has a size of \(128\times128\times24\). The weights map and the feature map are then sampled back to 3DMM vertices as described in Sec.3.2.

As described in Sec.3.3, for each query point, we use the tiny MLP to decode densities and colors from the summed hash table embedding \(\overline{\mathbf{h}}_{i}\), the nearest per-vertex feature \(\mathbf{f}_{ik^{\ast}}\), the query point tangent coordinates \(\mathbf{q}_{ik^{\ast}}\) of the nearest vertex applied with positional encoding, and the camera view direction with positional encoding. The tiny MLP consists of two hidden layers, where each hidden layer contains a Fully Connected layer with ReLU activation and 64 neurons. Please refer to Sec.3.3 and Fig.3 for more pipeline details. For the positional encoding, we use \(8\) frequency bands on the query point tangent coordinates, and \(4\) frequency bands on the camera view direction.

The warp field MLP \(\mathcal{F}_{\mathcal{E}}\) described in Sec.3.3 and 7 consists of a backbone and three output branches. The backbone contains \(5\) hidden layers, where each layer has \(128\) neurons. Then, we append three branches, each is a \(2\)-layer MLP with \(128\) neurons width, for regressing the three outputs described in 7: pure log-quaternion of the \(3\)D rotation (, SO(3)) \(\mathbf{R}\), rotation center \(\mathbf{c}^{rot}\), and \(3\)D translation \(\mathbf{t}\). ReLU activation is used in all layers except the output layers. We adapt a coarse-to-fine positional encoding strategy as used in Nerfies [40] on the query point coordinates before feeding into the MLP \(\mathcal{F}_{\mathcal{E}}\) for better training stability. We start with \(0\) frequency bands and increase to \(6\) linearly after \(80\)k training iterations.

To obtain a consistent 3D world space, we normalize the 3DMM meshes with their neck poses to align the head in 3D space. During training, we use a hierarchical sampling strategy as in [47], where we use \(32\) coarse and \(32\) fine sample points per ray. During testing, we obtain a union occupancy grid for all training expressions, and run ray marching on those valid voxels to achieve efficient rendering. To ensure stable training, we enable 3D warping fields after 5k iterations. During training, we use the Adam optimizer with \(\beta_{1}=0.9\), \(\beta_{2}=0.999\). In each mini-batch, we random sample \(256\) rays from \(8\) images (\(2048\) rays in total) and set the learning rates to: (1) \(10^{-4}\) for the warping field MLP and exponentially decay to \(10^{-5}\) after \(400\)k. (2) \(5 * 10^{-4}\) for other neural networks and exponentially decay to \(5 * 10^{-5}\) after \(400\)k. We train the model with \(400\)k iterations for each subject.

In 4, we show more details on our data statistics over 10 subjects.

0.8em

Train Frames | |||

Test Frames | Resolution | ||

subject0 | \(1560\) | \(434\) | (\(512\), \(402\)) |

subject1 | \(1480\) | \(740\) | (\(512\), \(422\)) |

subject2 | \(1440\) | \(603\) | (\(512\), \(380\)) |

subject3 | \(1360\) | \(564\) | (\(512\), \(368\)) |

subject4 | \(1450\) | \(304\) | (\(512\), \(398\)) |

subject5 | \(2655\) | \(595\) | (\(512\), \(372\)) |

subject6 | \(1818\) | \(696\) | (\(512\), \(452\)) |

subject7 | \(3912\) | \(817\) | (\(512\), \(512\)) |

subject8 | \(2656\) | \(898\) | (\(512\), \(344\)) |

subject9 | \(2049\) | \(351\) | (\(512\), \(512\)) |

In this section, we provide additional experimental results and comparisons with prior state-of-the-art methods. Please see 6 and 7 for qualitative results and the accompanying
**supplementary webpage** for video results.

In 6 and 7, we show more image results comparing with prior state-of-the-art approaches. PointAvatar [11] and INSTA [10] give overall inferior renderings than ours due to their limited model capacities in capturing static (, glasses, hairs) and dynamic (, expression-dependent deformations and wrinkles) avatar details. NeRFBlendshape [9] produces less stable results, leading to severe artifacts around mouth and obvious floaters on avatar boundaries. MonoAvatar [7] stably generates high quality renderings and animations, but is much slower than our method (, \(0.5\) FPS vs. \(35.9\) FPS) and slightly smoother on some details, for example, hairs, teeth and wrinkles. Our method overall achieves one of the best rendering quality, . Please refer to 12 for more discussions on our limitations.

In the accompanying **supplementary webpage**, we include video results on various subjects with side-by-side comparisons to prior state-of-the-art methods, including PointAvatar [11], INSTA [10], NeRFBlendshape [9], and MonoAvatar [7]. The videos
show that our method is able to produce high-quality renderings while maintaining real-time speed.

Here, we provide comparisons to more state-of-the-art methods that deliver relatively fast solutions for (partially) volumetric head avatar, including AvatarMAV [23] and LatentAvatar [33]. AvatarMAV [23] represents the head avatar by feature grid blendshapes to achieve fast training. LatentAvatar [33] learns a neural expression latent space instead of using 3DMM expression codes, and generate triplanes from this expression latent space. The triplane is rendered into a low resolution feature map, which is then used to synthesis the output RGB images via a 2D CNN.

As shown in 5, our method is able to achieve comparable rendering quality with these SOTA approaches, while supporting real-time rendering simultaneously. Note that LatentAvatar [33] uses heavy CNNs to directly synthesis output images, which leads to good sharpness (shown by LPIPS) but temporally 3D inconsistent high-frequency details. Also, directly synthesis image with CNN is an orthogonal direction to our method, which can also be appended to our pipeline.

0.5em

LPIPS | SSIM | PSNR | Mean FPS | |
---|---|---|---|---|

AvatarMAV [23] | 0.128 | 0.792 | 23.51 | 2 |

LatentAvatar [33] | 0.092 | 0.763 | 21.94 | 16 |

Ours | 0.100 | 0.795 | 22.77 | 35.9 |

Here, we investigate a new ablation setting *No Hash + UV CNN*, where we discard all hash tables while keeping other parts unchanged. In this way, our model decodes the neural radiance field thoroughly from the vertex-attached features as in
MonoAvatar [7] but with a much smaller MLP for fast rendering. This gives the following results: \(0.164\) /
\(0.759\) / \(21.57\) for LPIPS / SSIM / PSNR, which are largely inferior than our full model *Ours*. This indicates that the local hash tables are important for boosting the model
capacity to achieve photorealism rendering quality.

For the purpose of a comprehensive system analysis, we visualize the resulting geometry (as normal maps) of our avatars and compare with prior state-of-the-art approaches. 8 shows the normal map visualization. PointAvatar [11] gives smooth geometry estimation thank to their relighting formulation. But their renderings are overall blurrier than other methods. INSTA [10] generates geometries closing to the 3DMM meshes since they regularize the NeRF depth to the rasterized 3DMM depth on face region, which also leads to incorrect shapes for beard. Moreover, their rendered images are also suffered from unsatisfying quality. Despite NeRFBlendshape [9] gives relatively good renderings, their estimated geometry is very noisy, presumably because that they do not leverage the 3DMM mesh as a shape prior. MonoAvatar [7] gives both decent geometries and renderings, but is one order of magnitudes slower than real-time speed. Our method gives reasonable geometries that are slightly noisier than MonoAvatar, but supports real-time rendering speed while maintaining decent image quality.

Comparing with the state-of-the-art high quality MonoAvatar [7], our method is facing some quality trade-offs. On the positive side, our method captures more high frequency details such as hairs, teeth, and wrinkles thank to the high flexibility of hash table embeddings. However, this also introduces slightly more floaters than MonoAvatar [7] (9), which is a common issue for methods based on instant NGPs [8]. Presumably due to the same reason as well as poor tracking, we also observe a slightly less stale performance around the mouth interior regions and thin structures such as glasses frames. Further improving the robustness and stability without hurting quality and speed is an interesting future direction to explore.

[1]

“VRChat , howpublished=https://hello.vrchat.com, note=Accessed: 2023-02-05.”

[2]

Z. Waggoner, *My avatar, my self: Identity in video role-playing games*. McFarland, 2009.

[3]

“XHolo virtual assitant , howpublished=https://www.digalix.com/en/virtual-assistant-augmented-reality-xholo, note=Accessed: 2023-03-01.”

[4]

S. Orts-Escolano *et al.*, “Holoportation: Virtual 3D teleportation in real-time,” 2016, doi: 10.1145/2984511.2984517.

[5]

K. Guo *et al.*, “The relightables: Volumetric performance capture of humans with realistic relighting,” *TOG*, 2019, doi: 10.1145/3355089.3356571.

[6]

A. Meka *et al.*, “Deep relightable textures - volumetric performance capture with neural rendering,” in *ACM Transactions on Graphics (Proceedings SIGGRAPH
Asia)*, 2020, doi: 10.1145/3414685.3417814.

[7]

Z. Bai *et al.*, “Learning personalized high quality volumetric head avatars from monocular RGB videos,” 2023, pp. 16890–16900.

[8]

T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” *ACM Trans. Graph. , issue_date = July
2022*, vol. 41, no. 4, pp. 102:1–102:15, articleno = 102, Jul. 2022, doi: 10.1145/3528223.3530127.

[9]

X. Gao, C. Zhong, J. Xiang, Y. Hong, Y. Guo, and J. Zhang, “Reconstructing personalized semantic facial NeRF models from monocular video,” *ACM Transactions on Graphics
(Proceedings of SIGGRAPH Asia)*, vol. 41, no. 6, 2022, doi: 10.1145/3550454.3555501.

[10]

W. Zielonka, T. Bolkart, and booktitle=CVPR. Thies Justus, “Instant volumetric head avatars,” 2023, pp. 4574–4584.

[11]

Y. Zheng, W. Yifan, G. Wetzstein, M. J. Black, and booktitle=CVPR. Hilliges Otmar, “Pointavatar: Deformable point-based head avatars from videos,” 2023, pp.
21057–21067.

[12]

M. Zollhöfer *et al.*, “State of the art on monocular 3D face reconstruction, tracking, and applications,” 2018, vol. 37, pp. 523–550, doi: 10.1111/cgf.13382.

[13]

.
[14]

A. Tewari *et al.*, “Advances in neural rendering,” 2022 , organization={Wiley Online Library}, vol. 41, pp. 703–735.

[15]

S. Lombardi, T. Simon, G. Schwartz, M. Zollhoefer, Y. Sheikh, and J. Saragih, “Mixture of volumetric primitives for efficient neural rendering,” *TOG*, vol. 40, no.
4, pp. 1–13, 2021.

[16]

T. Beeler *et al.*, “High-quality passive facial performance capture using anchor frames,” 2011, pp. 1–10.

[17]

C. Cao *et al.*, “Authentic volumetric avatars from a phone scan,” *TOG*, 2022.

[18]

A. Meka *et al.*, “Deep relightable textures: Volumetric performance capture with neural rendering,” *TOG*, vol. 39, no. 6, pp. 1–21, 2020, doi: 10.1145/3414685.3417814.

[19]

“MetaHuman - unreal engine , howpublished=https://www.unrealengine.com/en-US/metahuman,
note=Accessed: 2022-10-17.”

[20]

G. Gafni, J. Thies, M. Zollhofer, and booktitle=CVPR. Nießner Matthias, “Dynamic neural radiance fields for monocular 4d facial avatar reconstruction,” 2021, pp.
8649–8658.

[21]

S. Athar, Z. Xu, K. Sunkavalli, E. Shechtman, and Z. Shu, “RigNeRF: Fully controllable neural 3D portraits,” 2022, pp. 20364–20373, doi: 10.1109/CVPR52688.2022.01972.

[22]

Y. Zheng, V. F. Abrevaya, M. C. Bühler, X. Chen, M. J. Black, and booktitle=CVPR. Hilliges Otmar, “Im avatar: Implicit morphable head avatars from videos,” 2022, pp.
13545–13555.

[23]

Y. Xu, L. Wang, X. Zhao, H. Zhang, and booktitle=ACM. S. 2023. C. P. Liu Yebin, “Avatarmav: Fast 3d head avatar reconstruction using motion-aware neural voxels,” 2023, pp.
1–10.

[24]

C. Chen, M. O?Toole, G. Bharaj, and booktitle=CVPR. Garrido Pablo, “Implicit neural head synthesis via controllable local deformation fields,” 2023, pp. 416–426.

[25]

L. Liu, J. Gu, K. Zaw Lin, T.-S. Chua, and C. Theobalt, “Neural sparse voxel fields,” *Advances in Neural Information Processing Systems*, vol. 33, pp. 15651–15663,
2020.

[26]

C. Sun, M. Sun, and H.-T. Chen, “Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction , booktitle = CVPR,” 2022.

[27]

A. Yu, R. Li, M. Tancik, H. Li, R. Ng, and booktitle=Proceedings. of the I. I. C. on C. V. Kanazawa Angjoo, “Plenoctrees for real-time rendering of neural radiance fields,”
2021, pp. 5752–5761.

[28]

C. Reiser, S. Peng, Y. Liao, and booktitle=Proceedings. of the I. I. C. on C. V. Geiger Andreas, “Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps,”
2021, pp. 14335–14345.

[29]

A. Chen, Z. Xu, A. Geiger, J. Yu, and booktitle=Computer. V. 2022:. 17th. E. C. T. A. I. O. 23–27,. 2022,. P. P. X. Su Hao, “Tensorf: Tensorial radiance fields,” 2022 ,
organization={Springer}, pp. 333–350.

[30]

Z. Chen, T. Funkhouser, P. Hedman, and A. Tagliasacchi, “MobileNeRF: Exploiting the polygon rasterization pipeline for efficient neural field rendering on mobile
architectures,” *arXiv preprint arXiv:2208.00277*, 2022.

[31]

B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” *TOG*, vol. 42, no. 4, pp. 1–14, 2023.

[32]

X. Zhao, L. Wang, J. Sun, H. Zhang, J. Suo, and Y. Liu, “Havatar: High-fidelity head avatar via facial model conditioned neural radiance field,” *TOG*, vol. 43, no.
1, pp. 1–16, 2023.

[33]

Y. Xu *et al.*, “Latentavatar: Learning latent expression code for expressive neural head avatar,” 2023, pp. 1–10.

[34]

H. Kim *et al.*, “Deep video portraits,” *TOG*, vol. 37, no. 4, pp. 1–14, 2018, doi: 10.1145/3197517.3201283.

[35]

M. R. Koujan, M. C. Doukas, A. Roussos, and booktitle=2020. 15th. I. I. C. on A. F. and G. R. (FG. 2020). Zafeiriou Stefanos, “Head2head: Video-based neural head synthesis,”
2020 , organization={IEEE}, pp. 16–23, doi: 10.1109/FG47880.2020.00048.

[36]

M. C. Doukas, M. R. Koujan, V. Sharmanska, A. Roussos, and S. Zafeiriou, “Head2head++: Deep facial attributes re-targeting,” *IEEE Transactions on Biometrics, Behavior,
and Identity Science*, vol. 3, no. 1, pp. 31–43, 2021, doi: 10.1109/TBIOM.2021.3049576.

[37]

L. Wang *et al.*, “Styleavatar: Real-time photo-realistic portrait avatar from a single video,” 2023, pp. 1–10.

[38]

T. Li, T. Bolkart, Michael. J. Black, H. Li, and J. Romero, “Learning a model of facial shape and expression from 4D scans,” *TOG*, vol. 36, no. 6, pp.
194:1–194:17, 2017, doi: 10.1145/3130800.3130813.

[39]

W. Jiang, K. Moo Yi, G. Samei, O. Tuzel, and A. Ranjan, “NeuMan: Neural human radiance field from a single video,” 2022.

[40]

K. Park *et al.*, “Nerfies: Deformable neural radiance fields,” 2021, pp. 5865–5874, doi: 10.1109/ICCV48922.2021.00581.

[41]

Bai Ziqian *et al.*, “Learning personalized high quality volumetric head avatars from monocular RGB videos,” 2023.

[42]

C. Lugaresi *et al.*, “Mediapipe: A framework for building perception pipelines,” *arXiv preprint arXiv:1906.08172*, 2019.

[43]

S. Lin, L. Yang, I. Saleemi, and booktitle=Proceedings. of the I. W. C. on A. of C. V. Sengupta Soumyadip, “Robust high-resolution video matting with temporal guidance,”
2022, pp. 238–247.

[44]

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” 2018, pp. 586–595, doi: 10.1109/CVPR.2018.00068.

[45]

P. Cignoni, M. Callieri, M. Corsini, M. Dellepiane, F. Ganovelli, and G. Ranzuglia, “Eurographics italian chapter conference , editor = Vittorio Scarano and
Rosario De Chiara and Ugo Erra, title = MeshLab: an Open-Source Mesh Processing Tool,” 2008, doi: 10.2312/LocalChapterEvents/ItalChap/ItalianChapConf2008/129-136.

[46]

C.-Y. Weng, B. Curless, P. P. Srinivasan, J. T. Barron, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Kemelmacher-Shlizerman Ira, “Humannerf: Free-viewpoint rendering
of moving people from monocular video,” 2022, pp. 16210–16220.

[47]

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “NeRF: Representing scenes as neural radiance fields for view synthesis,”
*Communications of the ACM*, vol. 65, no. 1, pp. 99–106, 2021, doi: 10.1145/3503250.

\(^{*}\)Work was conducted while Ziqian Bai was an intern at Google.↩︎