Generative AI for Immersive Communication: The Next Frontier in Internet-of-Senses Through 6G


Over the past two decades, the iot has been a transformative concept, and as we approach 2030, a new paradigm known as the ios is emerging. Unlike conventional vr, ios seeks to provide multi-sensory experiences, acknowledging that in our physical reality, our perception extends far beyond just sight and sound; it encompasses a range of senses. This article explores existing technologies driving immersive multi-sensory media, delving into their capabilities and potential applications. This exploration includes a comparative analysis between conventional immersive media streaming and a proposed use case that leverages semantic communication empowered by generative ai. The focal point of this analysis is the substantial reduction in bandwidth consumption by 99.93% in the proposed scheme. Through this comparison, we aim to underscore the practical applications of generative ai for immersive media while addressing the challenges and outlining future trajectories.

1 Introduction↩︎

The advent of the 5th generation (5G) mobile networks and the recent advancements in computing technologies have redefined the concept of internet from basic connectivity to a more advanced digital experience, transitioning from merely faster communication into an immersive interaction with the digital realm. This concept has been recently introduced under the umbrella of Metaverse and digital twins, and has opened up a wide range of applications including vr, ar, holoportation, and tele-operation, among others. Within this realm, four main underpinnings have been remarked as paradigms for linking the cyber and physical worlds, namely, connected intelligent machines, a digitized programmable world, connected sustainable world, and the ios [1]. The ios concept is set to revolutionize the digital interactions by creating a fully immersive environment that transcends traditional boundaries. By integrating sensory experiences such as sight, sound, touch, smell, and taste into the digital realm, this technology promises a more engaging cyber world, where virtual experiences are as rich and multi-dimensional as the physical world.

Figure 1: Key concepts of IoS

Human beings experience the world through different senses, by perceiving sensory signals that are integrated or segregated in the brain. If these senses, especially haptic feedback, are accurately represented to be coherent with the real world, they can positively influence actions and behaviors, such as reaction time and detection [2]. Within this context, the ios technology will allow individuals to experience a wide range of sensations remotely, revolutionizing various verticals, including industry, healthcare, networking, education, and tourism, to name a few. In order to reap the full potential of the ios technology, numerous challenges need to be tackled to achieve a fully immersive multisensory experience. These challenges are pertinent to temporal synchronization of multiple media, addressing motion sickness, ensuring high throughput, and minimizing the e2e latency. The collection of data from various sensor modalities, such as visual, audio, and haptic, plays a vital role in crafting a multisensory experience, in which this data can be synchronized at either the source or the destination (i.e., end devices or edge servers). The failure of virtual experiences to truly replicate our senses introduces confusion in human brains, leading to symptoms like nausea, dizziness, and migraines. To mitigate these drawbacks, it is crucial to enhance the realism of virtual sensations and reduce latency in vr/ar devices, thereby minimizing latency between different modalities and avoiding its mismatch [3]. Furthermore, for accurate control purposes over a distance of up to one mile and to prevent the occurrence of motion sickness, it is crucial to transmit the sensory information at extremely low e2e latency, ideally within 1-10 millisecond [4].

With respect to the kpis for reliable communication of immersive media in ios, it was demonstrated that future 6G networks should realize an E2E latency performance within the range of 1 ms for high-quality video streaming and haptic signals, with data rate requirements that go from tens Mbps to 1 Tbps and reliability performance of \(10^{-7}\) [5]. Also, while taste and smell signals requirements are less stringent than videos and haptics, it is essential to realize a perfect synchronization among signals from different senses to achieve the full potential of the ios. Among various technologies, semantic communication emerges as a promising candidate to achieve ultra-low latency communication through communicating the meanings/semantics of transmitted messages instead of communicating the whole signal, yielding faster and bandwidth efficient transmission.

As advanced ai systems, llms, a subfield of ai, was recently deemed as super-compressors that are capable of extracting the essential information to be communicated using a smaller message (a prompt) [6]. llms are dnns with over a billion parameters, often reaching tens or even hundreds of billions, trained on extensive natural language datasets. This comprehensive parameterization unleashes a level of capability in generation, reasoning, and generalization that was previously unattainable in traditional dnn models [7]. While the recovered messages by llm will not be identical to the original one, they sufficiently represent their meanings and convey the intended messages.

Accordingly, llms are envisioned to evolve into the cognitive hub of the ios, addressing intricate challenges like synchronization and compression through the estimation from partial modalities and a communication enabled by semantic understanding. Additionally, llms are poised to enhance machine control intelligence, thereby improving reliability in teleoperations, through managing various data modalities pertinent to the user and environmental senses, as illustrated in Fig. 1.

In recent developments, llms have advanced to handle diverse modalities beyond text, encompassing audio, images, and video. The resulting mllms can harness multiple data modalities to emulate human-like perception, integrating visual and auditory senses, and beyond [8]. mllms enable the interpretation and response to a broader spectrum of human communication, promoting more natural and intuitive interactions, including image-to-text understanding (e.g., blip-2), video-to-text comprehension (e.g., LLaMA-VID), and audio-text understanding (e.g., QwenAudio). More recently, the development of mllms has aimed at achieving any-to-any multi-modal comprehension and generation (e.g., VisualChatGPT).

In this paper, we aim to set the scene for the integration of llms and the ios technology, in which we develop a case study to demonstrate the benefits that can be obtained from exploiting the capabilities of llms in enhancing the latency performance of immersive media communication. In particular, we conceptualize the 360\(^{\circ}\) video streaming from a uav as a semantic communication task. Initially, we employ object detection and image-to-text captioning to extract semantic information (text) from the input 360\(^{\circ}\) frame. Subsequently, this generated textual information is transmitted to the edge server. In the edge server, an llm is utilized to produce code compatible with A-frame, facilitating the display of the corresponding image through 3d virtual objects on the hmd. Lastly, the generated code is sent to the receiver, allowing for the direct rendering of the 3d virtual content on the hmd.

The contributions of this paper are summarized as follows:

  • Conceptualize the 360\(^{\circ}\) video streaming via uav as a semantic communication framework.

  • Harness the power of image-to-text captioning model and gpt decoder-only llm to generate A-frame code suitable for display on the user’s hmd.

  • Benchmark the proposed framework in terms of bandwidth consumption and communication latency across various components of the semantic communication framework.

  • Assess the quality of the generated 3d objects from our system compared to the captured 360\(^{\circ}\) video images using reverse image-to-text, followed by text comparison through bert model.

The remainder of this paper is organized as follows. Section 2 introduces the ios and discusses its necessity. In Section 3, an overview of the development of mllms and their applications to ios is discussed. Section 4 presents a case study with a proposed testbed, which is implemented and analyzed. Finally, Section 7 highlights challenges and suggests directions for future research.

2 Definitions and keys Concepts of ios↩︎

In this section, we present the key concepts of the ios concerning various interfaces and discuss the imperative nature of the ios.

2.1 Immersive All-Sense Communication↩︎

To deliver a truly immersive experience, indistinguishable from reality, it is imperative to incorporate all human senses, including touch, taste, scent, as well as bcis, in addition to sight and sound. The human brain processes information from all senses to construct a comprehensive understanding of our environment. This necessity has given rise to the conceptualization of the ios, a framework in which signals conveying information for all human senses are digitally streamed. This innovative concept aims to bridge the gap between physical and virtual reality, facilitating telepresence-style communication. Consequently, we categorize the various fundamental aspects of the ios as the Internet of Touch, Internet of Taste, Internet of Smell, Internet of Sound, Internet of Sight, and bci. Concurrently, Generative ai, and more specifically, llms, emerges as a pivotal concept within the ios for semantic communication and synchronization. This is achieved by generating multiple media simultaneously, as illustrated in Fig. 1.

Internet-of-Touch. Haptic sensation refers to the sense of touch, known as tactile sensation, and it enhances immersive multimedia by allowing individuals to feel physical sensations, such as interactions with objects and movements (kinesthetic sensation). In vr training or teleoperation, haptics replicate touch, which is crucial for tasks like surgery. Achieving optimal haptic experiences requires addressing minimal response times and low latency in synchronization with other sensed media, such as audio and video. Haptic interfaces employ various technologies to deliver tactile sensations, ranging from simple vibration feedback to more complex systems providing force feedback, pressure sensitivity, or even localized temperature changes. Devices like haptic gloves, exoskeletons, or tactile feedback controllers enable users to touch, grasp, and interact with virtual objects in a natural and intuitive manner.

Internet-of-Taste. Gustatory perception involves the intricate process of detecting and interpreting flavors. While traditional vr primarily focuses on visual and auditory stimuli, incorporating taste into the virtual environment has the potential to enhance sensory engagement, leading to more realistic and immersive experiences. The technology underlying gustatory interfaces centers on the controlled stimulation of taste receptors. Various approaches are being explored, such as electrically stimulating taste buds [9] or delivering taste-related chemical compounds directly to the mouth [10]. However, it is crucial to note that replicating the sense of taste is the most complex, as it closely depends on other sensations. Presently, the technology is still in the laboratory demonstration stage.

Internet-of-Smell. Digital scent technology, involved in recognizing or generating scents, employs electrochemical sensors and machine learning for scent recognition. Scent synthesis, on the other hand, utilizes chemical or electrical stimulation. Digital noses, electronic devices that detect odors, are increasingly prevalent in tasks such as quality control and environmental monitoring. In the food industry, digital noses ensure product quality by detecting off-flavors and maintaining taste and quality standards. In the perfume industry, digital noses evaluate aroma intensity and longevity, monitoring changes over time. Beyond industries, olfactory interfaces in everyday life enhance emotional and cognitive functions, productivity, and relaxation in virtual environments, as smell influences our daily emotions by 75% [11]. This technology is particularly valuable in vr, contributing to enhancing realism in training, enriching culinary experiences, evoking authentic atmospheres in tourism simulations, and aiding therapeutic applications. The technology behind smell interfaces involves the emission and dispersion of scents in a controlled manner. Different approaches have been explored, including the use of odor-releasing devices, cartridges, or even embedded scent generators within vr headsets. These devices release scents or chemical compounds in response to specific cues or triggers, such as visual events or audio cues, to enhance the user’s sensory experience.

Internet-of-Sight. xr devices, encompassing vr, ar, and mr headsets, glasses, or smart contact lenses, can offer a highly immersive experience for viewing video content along with haptic and other sensations. These devices have the capability to create a profound sense of presence and transport the viewer to a virtual environment, enabling them to feel as if they are physically present in the content. In recent years, the use of 360\(^{\circ}\) video streaming has been on the rise, enabling viewers to experience immersive video content from multiple angles. This technology has gained popularity in various industries, including entertainment, sports, education, and robot teleoperation.

Internet-of-Audio. Spatial audio pertains to the creation and reproduction of audio in a manner that simulates the perception of sound originating from various directions and distances. This process involves positioning sounds in a three-dimensional space to align with the visual environment. Spatial audio is a crucial element in crafting immersive experiences, as synchronized spatial audio reproduction complements visual information, thereby enhancing user immersion and qoe [12].

The brain as a user interface. bcis enable direct communication and control by translating neural activity into machine-readable signals. In the context of the ios, a brain is required to execute actions based on the perception of multiple senses. This can be either a human brain, utilizing a bci for action, or a multimodal ai.

2.2 Why we need ios?↩︎

The ios holds significant potential in contributing to various technological advancements and enhancing user experiences in different domains. For example, in the entertainment domain, the heightened level of immersion can offer more realistic and engaging interactions, revolutionizing how users perceive and interact with digital content. Envisioning scenarios in movies, one not only witnesses but also smells the aftermath of an explosion, immersing oneself in the heat and vibrations of the scene. Furthermore, the ios can contribute to advancements in healthcare by providing more accurate and real-time data for monitoring patients. For example, remote patient monitoring, telemedicine, and neuroimaging technologies can benefit from the ios to improve diagnostics and treatment. At the business level, retail experiences can be enriched through multisensory interactions, and marketing strategies can achieve higher engagement by appealing to multiple senses. Also, with the ios, the way humans interact with machines can become more intuitive and natural. Thought-controlled interfaces, allowing users to perform actions simply by thinking, have the potential to eliminate the need for traditional input devices and enhance the efficiency of human-machine interaction. Moreover, in hazardous situations and environments, workers can utilize telepresence technology enabled by the ios to remotely control robots. This ensures safe operations in scenarios where the physical presence of humans could pose risks, such as handling dangerous materials or navigating challenging terrains.

3 Foundation Models for ios↩︎

In this section, we offer a concise overview of the evolution of foundation models towards mllm and their potential applications in the era of the ios, specifically focusing on image and video transmission.

3.1 Advancement of Language Models↩︎

The progress in nlp research has led to the development of models such as GPT-2, BART [13], and BERT [14]. These models have sparked a new race to construct more efficient models with large-scale architectures, encompassing hundreds of billions of parameters. The most popular architecture is the decoder-only, including llms like gpt-3, Chinchilla and LaMDA. Following the release of open-source llms like OPT and BLOOM, more efficient open-source models have been recently introduced, such as Vicuna, Phi-1/2, LLaMa, FALCON, Mistral, and Mixtral1. This later follows the moe architecture and training process initially proposed in MegaBlocks [15]. Despite having fewer parameters, these models fine-tuned on high-quality datasets, have demonstrated compelling performance on various nlp tasks, surpassing their larger counterparts. Furthermore, instruction tuning the foundation models on high-quality instruction datasets enables versatile capabilities like chat and code source generation, etc. The llm have also shown unexpected capabilities of learning from the context (i.e., prompts), referred to as icl.

3.2 Multimodal large language models↩︎

Extending foundation models to multimodal capabilities has garnered significant attention in recent years. Several approaches of aligning visual input with the pre-trained llm for vision-language tasks have been explored in the literature [16]. Pioneering works such as VisualGPT and Frozen utilized pre-trained llm for tasks like image captioning and visual question answering. More advanced vlms such as Flamingo, BLIP-2, and LLaVA follow a similar process by first extracting visual features from the input image using the CLIP vit encoder. Then, they align the visual features with the pre-trained llm using specific alignment techniques. For instance, LLaVA relies on a simple linear projection, Flamingo uses gated cross-attention, and BLIP-2 introduces the Q-former module. These models are trained on large image-text pair datasets, where only the projection weights are updated, and the encoder and the llm remain frozen, mitigating training complexity and addressing catastrophic forgetting issues.

In this era of mllms, gpt-4 has demonstrated remarkable performance in vision-language tasks encompassing comprehension and generation. Nevertheless, in addition to its intricate nature, the technical details of gpt-4 remain undisclosed, and the source code is not publicly available, impeding direct modifications and enhancements [17]. To address these challenges, the MiniGPT-4 model was proposed. This model combines a vision encoder (vit-G/14 and Q-Former) with the Vicuna llm, utilizing only one projection layer to align visual features with the language model while keeping all other vision and language components frozen. The model is first trained on image-language datasets, then finetuned on high-quality image description pairs (3,500) to improve the naturalness of the generated language and its usability. The TinyGPT-V vision model, introduced by Yuan et al. [18], addresses computational complexity, necessitating only a 24GB GPU for training and an 8GB GPU for inference. The architecture of TinyGPT-V closely resembles that of MiniGPT-4, incorporating a novel linear projection layer designed to align visual features with the Phi-2 language model, which boasts only 2.7 billion parameters. The TinyGPT-V model undergoes a sophisticated training and fine-tuning process in four stages, where both the weights of the linear projection layers and the normalization layers of the language model are updated. As the process progresses, instruction datasets are incorporated in the third stage, and multi-task learning is employed during the fourth stage.

The second step in developing llms is fine-tuning the model on instruction datasets to teach models to better understand human intentions and generate accurate responses. The InstructBLIP [19] is built through instruct tuning of the pre-trained BLIP-2 model on 26 instruction datasets grouped into 11 tasks. During the instruction tuning process, the llm and the image encoder are maintained frozen, while only the Q-former undergoes fine-tuning. Furthermore, the instructions are input to both the frozen llm and the Q-Former. Notably, InstructBLIP exhibits exceptional performance across various vision-language tasks, showcasing remarkable generalization capabilities on unseen data. Moreover, when employed as the model initialization for individual downstream tasks, InstructBLIP models achieve state-of-the-art fine-tuning performance. InstructionGPT-4 [20] is a vision model fine-tuned on a small dataset comprising only 200 examples, which represents approximately 6% of the instruction-following data used in the alignment dataset for MiniGPT-4. The study highlights that fine-tuning the vision model on a high-quality instruction dataset enables the generation of superior output compared to MiniGPT-4.

3.3 Potential Applications of mllms in Semantic Communication.↩︎

The lossy compression of images and videos has always involved a tradeoff between distortion (\(D\)), representing the reconstructed quality, and the coding rate (\(R\)). Distortion quantifies the errors introduced by compression, measured between the original sample \({\boldsymbol{x}}\) and its reconstructed version \({ \hat{\boldsymbol{x}}}\) as the p-norm distance \(||{\boldsymbol{x}} - \hat{ \boldsymbol{x}}||^p_p\). The rate \(R\) denotes the amount of data, in bits or bits/second, required to represent the sample after compression. Compression aims to minimize distortion under rate constraints, typically formulated as the minimization problem of the tradeoff between distortion and rate: \(\min_{\boldsymbol{\hat{x}}} (D + \lambda R)\), where \(\lambda\) is the Lagrangian parameter.

In a real-time video transmission system, end-to-end latency plays a crucial role in determining system performance. Within this context, two distinct scenarios can be distinguished: offline video streaming and live video streaming. In the case of live video streaming, the end-to-end latency encompasses delays introduced by all streaming components, including acquisition, coding, packaging, transmission, depackaging, decoding, and display. Moreover, all these components need to operate at a frame frequency beyond the video frame rate. On the other hand, in the offline scenario, video encoding and packaging are performed offline. This exempts the process from real-time constraints and delays typically introduced by these two components.

Figure 2: Proposed architecture for GenAI enabled immersive communication

The recent advances in llm and mllm represent a transformative shift in video streaming. In this section, we explore three use cases integrating llm and mllm into the video streaming framework. The first use case involves the application of llm for the lossless compression of images or videos, serving as an entropy encoder. Recent research, investigated by the work from DeepMind [6], underscores the potent versatility of llms as general-purpose compressors, owing to their in-context learning capabilities. Experiments utilizing Chinchila 70B, solely trained in natural language, revealed impressive compression rations, achieving 43.4% on ImageNet patches. Notably, this rate outperforms domain-specific image compressors such as png (58.5%).

The second use case harnesses mllm shared at both the transmitter and receiver for a lossy coding setting. The transmitter first generates an accurate description of the image or video content through the image captioning capability of the mllm. Instead of transmitting the image or video, the text description (semantic information) is then sent to the receiver, requiring a significantly lower data rate. At the receiver, the generative capability of the mllm is harnessed to reconstruct the image or video based on the received text description.

In the third use case, the mllm is employed solely at the transmitter to leverage its code-generation capability, representing the image or video for transmission. Subsequently, the code, requiring a lower data rate, is shared with the receiver, enabling direct utilization to render the image or video through the code description. The intricacies of this latter use case are expounded upon and experimentally explored in the subsequent sections of this paper.

4 Case Study↩︎

4.1 Use case description↩︎

To comprehend the challenges at hand and explore potential solutions, let us immerse ourselves in the following scenario. Consider John, a surveillance teleoperator, confronted with immense challenges as he endeavors to remotely operate drones with a fpv. His task involves teleoperating the uav beyond the visual line of sight, navigating through a dense forest using b5g networks. Navigating the forest terrain through fpv proves to be incredibly challenging, primarily due to two significant reasons.

  • On the one hand, the environment is characterized by bandwidth limitations that degrade the video stream perceived by John through fpv. This deterioration can significantly impact the teleoperation of the uav and may lead to potentially catastrophic consequences.

  • On the other hand, the uav is equipped only with a 360\(^{\circ}\) camera, limiting John to receiving video and sound solely through the fpv. However, this does not entirely convey the environment in which the uav is flying. Achieving complete immersion in controlling the uav in such a challenging environment would necessitate additional senses beyond the traditional audiovisual scenes.

To tackle the aforementioned issues, we propose an architecture that harnesses semantic communication through genai to facilitate immersive communication. Specifically, the code-generation capability of genai is employed to replicate the video captured by the uav, thereby minimizing bandwidth consumption. This approach provides the drone teleoperator with a secondary stream that can be utilized in case the primary video stream is interrupted. Moreover, genai will also facilitate the generation of additional senses beyond the conventional audiovisual scenes, thus paving the way for the ios.

4.2 Proposed Architecture for GenAI Enabled Immersive Communication↩︎

The proposed architecture empowers a vr user to visualize animated 3d digital objects crafted using WebXR code generated from llm. The code for the 3d objects is generated based on feedback from the uav’s mounted 360\(^{\circ}\) camera, capturing omnidirectional frames of the environment. It is noteworthy that the 3d objects are rendered while the vr user teleoperates the uav. Simultaneously, the traditional method of transmitting 360\(^{\circ}\) videos is employed as a baseline for comparison, considering factors such as delay, bandwidth consumption, and video quality achieved by our approach.

The architecture is grounded in an edge-to-cloud continuum environment, as illustrated in Fig. 4. The individual components of the end-to-end video streaming architecture are elaborated upon in detail below.

vr user. The vr user is an individual teleoperator, managing one or multiple uavs in a fpv mode using a hmd and its joysticks. The viewing interface is a WebXR-rendered webpage hosted on a web server operating on an edge cloud located in close proximity to the vr user. This web server serves two pages: one displaying the 360\(^{\circ}\) view and another presenting 3d objects from the environment.

Concurrently with the 3d view, the vr headset and other wearables receive a multimedia description file containing values related to temperature and vibration. These values are derived from the environmental image and uav movements, estimated using the llm. The haptic feedback and potential heat dissipation wearable replicate the estimated environment concurrently with the view, minimizing synchronization latency as much as possible. Notably, since the generated code represents an animation, it is not necessary to update the view at a high frequency. However, the virtual camera in the spherical projection moves according to the drone’s position, continuously received from the mqtt broker by the user.

Unmanned Aerial Vehicle. The uav functions as the real-world data capture device, employing its mounted 360\(^{\circ}\) camera to provide live feedback in the form of a 360\(^{\circ}\) video to the vr user. Simultaneously, the uav performs object detection using YOLOv7, trained on the MS COCO Dataset [21], and subsequent captioning using Inception-v3 [22] on one frame every 30 frames. The resulting object annotations and sensorial data, including position, are transmitted to the cloud, specifically to a llm for additional contextualization and detailed description. Furthermore, the uav independently streams its position data, comprising altitude, latitude, and longitude, through Mavlink telemetry messages to a drone api. This information is then relayed to the hmd via the mqtt broker. It is important to highlight that we are not hosting an llm on the uav due to its substantial size, computing demands, and energy consumption. Doing so would significantly reduce flight times.

Cloud server. The cloud server primarily hosts http apis connected to two llms, specifically the first llm, GPT-3.5, and the second llm, GPT-4. Consequently, the first llm is responsible for providing more context (enhanced image captioning) from an image caption and its object annotation, received from the uav. Subsequently, the second llm is prompted with the generated description from the first llm as instruction and is tasked with generating html code using the A-frame framework [23] to produce immersive 3d WebXR content representing the image description in a 3d space.

It is worth noting that we opt not to directly feed captions from the uav to the second llm. Instead, we utilize the first llm to fuse captions, annotations, and uav sensor data. This results in captions that are more detailed compared to standard ones [24]. Furthermore, by leveraging the prompt history stored in the llm memory, we further enhance the accuracy of the descriptions with the llm in-context learning capability. Thereafter, we utilize a second llm for code generation, ensuring that it does not impact the memory context of the first llm. This multi-agents approach also enhances response accuracy, with potential improvements exceeding 6% for GPT-3.5.

The edge cloud, located in close proximity to the drone, plays a crucial role in three fundamental computations: video streaming and transcoding, message transmission through a publish/subscribe broker, and web serving. In the traditional method of streaming 360\(^{\circ}\) videos, a media server is utilized to receive an rtsp video stream of equirectangular projected frames. Subsequently, these frames are transcoded using an avc/H.264 encoder for re-streaming through webrtc, as illustrated in [25].

Table 1: Testbed’s parameters and values.
Parameter Value
Wifi (upload/download) 100Mbps/200Mbps
5G (upload/download) 50Mbps/200Mbps
Ethernet connection 900Mbps/800Mbps
Edge Server CPU 8 cores @ 2.5GHz
Edge Server memory 16GB
Onboard computer memory 8GB
Onboard CPU 4 cores @ 1.5GHz
Distance uav to Server 300m
Distance vr hmd to Edge Server 100m
vr headset Oculus Quest 2

In contrast, in our proposed architecture centered on Generative AI-driven semantic communication, we employ WebXR code generated using the llm, specifically gpt-4, to represent virtual 3d objects and multimodal descriptions. This data is then transmitted to the user through a mqtt broker and stored in the edge server to construct a dt of the environment. An important consideration is that we refrain from hosting the llms at the edge due to their large size and computing requirements.

Envisioned components. Notably, the optimal scenario aims to run all processes near the end user and uav, reducing delays and eliminating the need for a separate cloud server. However, in our specific case, this has not been implemented due to limitations in the power of the edge server. These limitations are inherent in edge devices, rendering them insufficient for running an llm with 70 billion parameters. Consequently, the proposed solution involves creating a fine-tuned version of the llm that is suitable for hosting on the edge server. The procedure for developing this enhanced llm is detailed in the workflow of our proposed architecture and further explained below.

In the existing workflow, the generated code is stored in the edge server within the dt component. To create a fine-tuned model, supervised fine-tuning is required using both the prompt and the corresponding output of the llm. This data must undergo thorough cleaning to eliminate redundancy and errors. Errors can be identified and corrected using another llm, which then augments our dataset by generating similar data. Once we have this refined dataset of prompt pairs, it can be utilized to develop a quantized fine-tuned model that is capable of running directly on the edge server. This approach aims to further minimize communication latency.

5 Experimental results↩︎

In this section, we present the experimental test based on the architecture depicted in Fig. 2, along with measurements, results analysis, and validation.

5.1 Experimental setup↩︎

Table 2: Description of Videos and Their Durations
Video Video Description Duration
1 Thailand Stitched 360\(^{\circ}\) footage 25s
2 Pebbly Beach 2mins
3 Bavarian Alps 2.05mins
4 Crystal Shower Falls 2mins
5 London on Tower Bridge 30s
6 London Park Ducks and Swans 1.05mins
7 View On Low Waterfall with Nice City 10s
8 Doi Suthep Temple 25s
9 Ayutthaya uav Footage 35s
10 uav video of Aalto University Finland 2mins

The experiment entailed a flight test conducted in proximity to Aalto University. Equirectangular videos, coupled with authentic footage captured by the 360\(^{\circ}\) camera affixed to the uav, were streamed to the vr user situated at Aalto University. This streaming process was carried out via both the conventional approach and our novel method. Throughout the experiment, the uav predominantly maneuvered at various altitudes while adhering to a maximum speed of 5m/s.

For the experiments, we utilized 9 video sequences2, boasting a 4K resolution, and streamed by an onboard computer, as detailed in Table 2. We subjected these 9 videos to tests and measurements to assess bandwidth consumption and latency. Furthermore, we employed an additional video for validation purposes, evaluating description similarity results. Notably, the 10th custom video, recorded by our team, underwent evaluation during a flight test conducted with the uav at Aalto University.

Following the global architecture, the testbed comprises a uav equipped with a 5G modem for communication, an edge hmd with a Wi-Fi connection (chosen due to the operator’s indoor location), an edge cloud connected through fiber, and a cloud server connected via fiber. All components are located in Finland within distances less than 1 km from each other, except for the cloud server situated in the USA. Table 1 provides a detailed description of the hardware used.

The network latencies of the testbed are illustrated in Fig. 3. This figure provides a visual representation of the network latency and connection types among communicating devices within our testbed, encompassing edge to cloud, uav to cloud, uav to edge server, and hmd to edge connections. These latency values delineate the spatial distribution of the devices relative to each other and their respective network connection types.

The highest latency, averaging 48ms, is observed between the uav and the cloud server. This primarily stems from the uav’s mobility and the resultant disruption of the 5G communication link due to frequent handovers at high altitudes. Conversely, latency is slightly lower between the uav and the edge server, owing to the closer proximity of the edge server to the uav. Notably, significantly lower delays are observed between the hmd and the edge server, attributed to the stationary nature of the hmd compared to the uav, as well as its proximity to the edge server. The lowest latency, averaging 2ms, is noted between the edge server and the cloud server, which can be attributed to their direct fiber connection.

Figure 3: Network latency between components in the architecture

5.2 Prompts and output↩︎

At first, the first llm is prompted by annotations and objects from the uav and generates the following description for the video taken during our experiment: The image depicts a large red building with a flat roof, surrounded by snow-covered trees and a snow-covered ground. There are two people in the foreground, one of them is holding a camera, and the other appears to be flying a drone.

Thereafter, the second llm is tasked with generating code to render the description in a 3d manner, based on the previous description provided by the first llm, as shown in the following prompt. Notably, the prompt emphasizes the exclusion of external models such as gltf and glb:

Generate A-Frame elements starting with ’a-’ to accomplish the following instruction while meeting the conditions below.


- Do not use ``a-assets`` or ``a-light``.
- Avoid using scripts.
- Do not usegltf,glbmodels.
- Do not use external model links.
- Provide animation.
- Use high-quality detailed models.
- If animation setting is requested, use the animation component instead of the ``<a-animation>`` element.
- If the background setting is requested, use the ``<a-sky>`` element instead of the background component.
- Provide the result in one code block.


You are an assistant that teaches me Primitive Element tags for A-Frame version 1.4.0 and later. Create a ’Description from firstllm.

As an output, the rendered code is represented in Fig. 4, which shows both the 3d content generated based on html code created by the llm and the captured 360\(^{\circ}\) frames from the camera are represented. The view undergoes transformation onto a spherical projection to align with the user’s fov.

Figure 4: Generated 3d view against real captured image


Average Download/Upload Bitrate for Video and generated Description/Code with Standard Deviation


Average latency of both traditional and proposed systems for 360\(^{\circ}\) video streaming

Figure 5: Comparison of Bandwidth requirements and latencies.

5.3 Experimental measurements↩︎

To measure bandwidth consumption during the upload phase of traditional video streaming, we recorded the bitrate using the FFmpeg [26] command at the uav. Simultaneously, we utilized the webrtc statistics api at the hmd level. For our proposed method, we calculated the average size of the description sent from the uav and the size of the received llm-generated code at the hmd. In both traditional and our proposed systems, we analyzed various delays, including the e2e traditional video streaming latency (L). This latency is constituted by the rtsp video stream from the uav to the edge server, the webrtc stream from the edge server to the hmd, and the frame rendering delays at the hmd, as depicted in Equation 1 . \[\text{E2EL}_{\text{Traditional}} = \text{L}_{\text{RTSP}} + \text{L}_{\text{WebRTC}} + \text{L}_{\text{Rendering}} \label{eq:trad}\tag{1}\] The constituting latencies in this latter case have been measured at the edge server, namely for the rtsp streaming, and at the hmd for webrtc streaming and rendering. In our method, it mainly encompasses the latency of text prompt to 3d WebXR code generation from the two llms used, the code transmission through mqtt, and WebXR code rendering. Considering that we can achieve real-time 30 fps object detection [27] using the onboard computer and that captioning is only applied to one frame out of 30, we consider the object detection latency negligible. The e2e latency (\(\text{L}_{\text{Our Method}}\)) can be expressed as shown in Equation 2 . \[\text{E2EL}_{\text{Our Method}} = \text{L}_{\text{Text to Code}} + \text{L}_{\text{MQTT}} + \text{L}_{\text{Code Rendering}} \label{eq:our95method95latency}\tag{2}\]

The constituent latencies were measured by capturing timestamps from the sending device to the moment the response is generated and dispatched back to the sender, thus representing the round-trip latency. To approximate the one-way latency, this round-trip latency was halved. Additionally, we gauged the latency involved in transmitting uav positions and synchronizing camera movement by leveraging the TIMESYNC protocol. It is noteworthy that all measurements presented herein reflect the average latency across the transmitted packet count.

5.4 Results and analysis↩︎

To analyze our system, we measured both upload and download bandwidth consumption, as well as the latency required to stream equirectangular frames of the test videos under consideration. Subsequently, we compared these metrics with those associated with traditional video streaming, focusing on our method, which involves generating virtual 3d objects based on llm through semantic annotations, as illustrated in Fig. 5.

In the case of traditional video streaming, the measured uplink and downlink bandwidth shown in Fig. 5 (a) represent the average size of data streamed from the uav to the user. Using our streaming method involves the uplink handling of semantic annotations and captioning descriptions sent by the drone. From the downlink perspective, it represents the size of the generated code to produce a 3d virtual animation mimicking the real environment for the vr user.

We observe that in traditional video streaming, the uplink and downlink are almost similar, with the downlink being slightly lower due to webrtc adapting to the available network bandwidth and latency. This difference in bandwidth requirements is attributed to the complexity of frames based on the videos’ contents. A similar variation is present in the uplink annotation streaming from our method, which is also due to the different descriptions and detected objects from the videos’ frames. The downlink, on the other hand, consistently has the same size, attributed to the code and output of the llm, which is restricted by the prompt to a predetermined size of generated tokens.

In summary, the bandwidth analysis reveals that our proposed method requires only a few kilobits per second (kbps), with a maximum of 13.9kbps in the uplink for the 9 videos. In comparison, traditional video streaming demands 5.9Mbps, while in the download, our method needs 4kbps, contrasting with 5.8Mbps in traditional streaming resulting in a reduction of approximately 99.93%.

As observed in the latency analysis depicted in Fig 5 (b), our method exhibits a latency approximately 13 times higher, with an average latency of 13.66 seconds, compared to 980ms in traditional streaming. This increase is primarily attributed to the prompt-to-token latency of the large-sized llms, as well as network latencies, given that the llms are situated in the cloud. Additionally, we have suggested an approach to reduce these delays by creating a smaller version of the llms. It is worth highlighting that the decoding and rendering code for animated 3d objects takes significantly less time than processing captured images, with an average duration of 10ms compared to 100ms per 30 frames. Since we continuously update the virtual camera view position according to the streamed uav positions with a delay of 40ms, the uav control is not affected, as static objects will already be presented.

5.5 Validation↩︎

To validate our proposed system, we compared the descriptions generated from a vlm, namely GPT-4, for the virtual 3d generated frames and the equirectangular frames of the 10 videos. Since direct comparison between frames is not feasible using traditional methods such as psnr, we employed average semantic comparison through bert, which is widely used to measure the degree of semantic textual similarity, as depicted in the results in Fig. 6. It is worth noting that the maximum bert similarity score represents the highest probability of 1. Results generally show a good representation of code-based descriptions, with a maximum matching score of 83% observed for the 8th video representing a Buddhist temple and a minimum score of 43% for the 6th video depicting a park with animals and a lake. The lower quality representation in videos 3, 5, and 6 can be attributed to their complexity and rich set of objects. The WebXR framework used, namely A-Frame [28], generates 3D representations using basic geometric objects, making it challenging to recreate such complex scenes. Additionally, the utilized language models (llms) are not fine-tuned to be experts in the WebXR framework. It is worth noting that we have also explored the Code LLaMa, designed specifically for coding tasks. However, this latter’s generated code was very weak compared to GPT-4.

Figure 6: Description similarity between generated frames compared to real frames

6 Challenges & Open Research Directions↩︎

6.1 Multi-user & Scalability↩︎

Scalability for such applications as the one proposed in the use case is quite challenging since the response from an llm when prompted by multiple tasks might be degraded by up to 3% less accuracy for 50 simultaneous prompts [29]. A scalable 6G network is needed in order to accommodate a large number of immersive users, while dynamically being able to adjusting its resources and services to serve varying demands without compromising performance, reliability, or user experience.

6.2 Latency & Real-time Processing↩︎

In order to realize a fully immersive experience through ios, the utilized llms should be capable of processing and interpreting vast amounts of sensory data in real-time, facilitating seamless human-machine interactions. Additionally, they need to be optimized for edge computing architectures to ensure that data processing is as close to the source as possible. The challenge in achieving real-time processing in 6G lies in minimizing latency to the extent that the delay is imperceptible to humans or sensitive systems, which requires major advancements in network infrastructure, edge computing capabilities, and llms.

6.3 Edge Computation Limitations↩︎

Deploying llms on ues or small edge servers presents challenges due to the computational demands of these models. llms require substantial processing power and memory resources. However, mobile devices often have limited resources compared to desktop computers or servers. Consequently, running llms on ues may lead to slower inference times and reduced overall performance. Additionally, their typical constraint to fewer than 7 billion parameters frequently results in decreased response quality, with distortion being a common occurrence in tasks such as image generation [30].

6.4 Energy consumption↩︎

llms are computationally intensive and can consume a significant amount of power during inference. Given the limited battery capacity of mobile devices, running llms for extended periods can quickly drain the battery. This limitation significantly impacts the practicality and usability of llms on mobile devices, especially when offline or in situations without immediate access to power sources.

6.5 Integration & Interoperability↩︎

The seamless interoperability of ios and llms among a vast array of devices, technologies, and protocols constitutes a main challenge for future 6G networks. This integration will require a sophisticated orchestration of network components to ensure that the high-speed, low-latency, accuracy, and reliability are not compromised. This necessitates the development of adaptive network architectures that are capable of handling the diverse demands of sensory-data processing and ai interactions within a large number of users.

7 Conclusion↩︎

In this paper, have laid a foundational framework for integrating llms with the ios within the context of 6G. We have defined the principles of ios and proposed promising use-cases to demonstrate the potentials of llms in realizing low-latency immersive media communication. Within the developed use-cases, we have adopted llms as compressors, and we have presented a use case with a real testbed leveraging GenAI for ios. The measurement methods and analysis of the proposed system have been highlighted, and benchmarked with the traditional way of streaming immersive video. The results showed that llms are very effective compressors, leading to significant bandwidth savings. However, their response latency is high for real-time applications. To address these latency issues, an approach aiming at fine-tuning the llm and hosting it near the user has been designed and presented. As a future work, we aim to explore the use of the fine-tuned llm on the uav as an alternative to traditional captioning and object detection.

Nassim Sehad ( obtained the Bachelor of Science (B.Sc) diploma in the field of telecommunication in 2018 and the diploma of master in the field of networks and telecommunication in September 2020, from the University of Sciences and Technology Houari Boumediene (U.S.T.H.B), Algiers, Algeria. Since 2020 to September 2021 he joined a MOSA!C laboratory at Aalto University Finland as an assistant researcher. Since 2021 till now he joined the Department of Information and Communications Engineering (DICE), Aalto University, Finland, as a doctoral student. His main research topics of interest are multi-sensory multimedia, IoT, cloud computing, networks and AI.

Lina Bariah ( Lina Bariah received the Ph.D. degree in communications engineering from Khalifa University, Abu Dhabi, UAE, in 2018. She is currently a Lead AI Scientist at Open Innovation AI, an Adjunct Professor at Khalifa University, and an Adjunct Research Professor, Western University, Canada. She was a Visiting Researcher with the Department of Systems and Computer Engineering, Carleton University, Ottawa, ON, Canada, in 2019, and an affiliate research fellow, James Watt School of Engineering, University of Glasgow, UK. She was a Senior Researcher at the technology Innovation institute. Dr. Bariah is a senior member of the IEEE.

Wassim Hamidouche ( is a Principal Researcher at Technology Innovation Institute (TII) in Abu Dhabi, UAE. He also holds the position of Associate Professor at INSA Rennes and is a member of the Institute of Electronics and Telecommunications of Rennes (IETR), UMR CNRS 6164. He earned his Ph.D. degree in signal and image processing from the University of Poitiers, France, in 2010. From 2011 to 2012, he worked as a Research Engineer at the Canon Research Centre in Rennes, France. Additionally, he served as a researcher at the IRT b\(<>\)com research Institute in Rennes from 2017 to 2022. He has over 180 papers published in the field of image processing and computer vision. His research interests encompass various areas, including video coding, the design of software and hardware circuits and systems for video coding standards, image quality assessment, and multimedia security.

Hamed Hellaoui ( received the Ph.D. degree in Computer Science from Ecole nationale Supérieure d’Informatique -Algeria- in 2021, and the Ph.D. degree in Communications and Networking from Aalto University -Finland- in 2022. He is currently a Senior Research Specialist at Nokia, Finland. He has been actively contributing to Nokia’s Home Programs on Research and Standardization related to 5GA/6G, as well as to several EU-funded projects. His research interests span diverse areas, including 5G and 6G communications, UAV, IoT, and machine learning.

Riku Jäntti (Senior Member, IEEE) (M’02 - SM’07) ( is a Full Professor of Communications Engineering at Aalto University School of Electrical Engineering, Finland. He received his M.Sc (with distinction) in Electrical Engineering in 1997 and D.Sc (with distinction) in Automation and Systems Technology in 2001, both from Helsinki University of Technology (TKK). Prior to joining Aalto in August 2006, he was professor pro tem at the Department of Computer Science, University of Vaasa. Prof. Jäntti is a senior member of IEEE. He has also been IEEE VTS Distinguished Lecturer (Class 2016). The research interests of Prof. Jäntti include machine type communications, disaggregated radio access networks, backscatter communications, quantum communications, and radio frequency inference.

Merouane Debbah ( is a researcher, educator and technology entrepreneur. Over his career, he has founded several public and industrial research centers and start-ups, and is now Professor at Khalifa University of Science and Technology in Abu Dhabi and founding Director of the KU 6G research center. He is a frequent keynote speaker at international events in the field of telecommunication and AI. His research has been lying at the interface of fundamental mathematics, algorithms, statistics, information and communication sciences with a special focus on random matrix theory and learning algorithms. In the Communication field, he has been at the heart of the development of small cells (4G), Massive MIMO (5G) and Large Intelligent Surfaces (6G) technologies. In the AI field, he is known for his work on Large Language Models, distributed AI systems for networks and semantic communications. He received multiple prestigious distinctions, prizes and best paper awards (more than 35 best paper awards) for his contributions to both fields and according to is ranked as the best scientist in France in the field of Electronics and Electrical Engineering. He is an IEEE Fellow, a WWRF Fellow, a Eurasip Fellow, an AAIA Fellow, an Institut Louis Bachelier Fellow and a Membre émérite SEE.


C. Lab. (2019) 10 hot consumer trends 2030. [Online]. Available:
M. Melo et al., “Do multisensory stimuli benefit the virtual reality experience? a systematic review,” IEEE Transactions on Visualization and Computer Graphics, vol. 28, no. 2, pp. 1428–1442, 2022.
K. R. Pyun et al., “Materials and devices for immersive virtual reality,” Nature Reviews Materials, vol. 7, no. 11, pp. 841–843, 2022.
G. Fettweis et al., “The tactile internet-itu-t technology watch report,” Int. Telecom. Union (ITU), Geneva, 2014.
I. F. Akyildiz et al., “Mulsemedia communication research challenges for metaverse in 6G wireless systems,” arXiv preprint arXiv:2306.16359, 2023.
G. Delétang et al., “Language modeling is compression,” arXiv preprint arXiv:2309.10668, 2023.
T. Brown et al., “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33.1em plus 0.5em minus 0.4emCurran Associates, Inc., 2020, pp. 1877–1901. [Online]. Available:
D. Zhang, Y. Yu, C. Li, J. Dong, D. Su, C. Chu, and D. Yu, “Mm-llms: Recent advances in multimodal large language models,” arXiv e-prints, pp. arXiv–2401, 2024.
N. Ranasinghe et al., “Tongue mounted interface for digitally actuating the sense of taste,” in 2012 16th international symposium on wearable computers.1em plus 0.5em minus 0.4emIEEE, 2012, pp. 80–87.
D. Maynes-Aminzade, “Edible bits: Seamless interfaces between people, data and food,” in Conference on Human Factors in Computing Systems (CHI’05)-Extended Abstracts.1em plus 0.5em minus 0.4emCiteseer, 2005, pp. 2207–2210.
D. Panagiotakopoulos et al., “Digital scent technology: Toward the internet of senses and the metaverse,” IT Professional, vol. 24, no. 3, pp. 52–59, 2022.
Y. E. Choi, “A survey of 3d audio reproduction techniques for interactive virtual reality applications,” IEEE Access, vol. 7, pp. 26 298–26 316, 2019.
M. Lewis et al., “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” arXiv preprint arXiv:1910.13461, 2019.
J. Devlin et al., “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2018. [Online]. Available:
T. Gale et al., MegaBlocks: Efficient Sparse Training with Mixture-of-Experts,” Proceedings of Machine Learning and Systems, vol. 5, 2023.
J. Xing et al., “A survey of efficient fine-tuning methods for vision-language models—prompt and adapter,” Computers & Graphics, 2024.
Z. Yang et al., “The dawn of lmms: Preliminary explorations with gpt-4v (ision),” arXiv preprint arXiv:2309.17421, vol. 9, no. 1, 2023.
Z. Yuan et al., “Tinygpt-v: Efficient multimodal large language model via small backbones,” 2023.
W. Dai et al., “Instructblip: Towards general-purpose vision-language models with instruction tuning,” 2023.
L. Wei et al., “Instructiongpt-4: A 200-instruction paradigm for fine-tuning minigpt-4,” arXiv preprint arXiv:2308.12067, 2023.
C.-Y. Wang et al., YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
S. Degadwala et al., “Image captioning using inception v3 transfer learning model,” in 2021 6th International Conference on Communication and Electronics Systems (ICCES).1em plus 0.5em minus 0.4emIEEE, 2021, pp. 1103–1108.
R. Baruah and R. Baruah, “Building vr for the web with a-frame,” AR and VR Using the WebXR API: Learn to Create Immersive Content with WebGL, Three. js, and A-Frame, pp. 253–287, 2021.
N. Rotstein, D. Bensaı̈d, S. Brody, R. Ganz, and R. Kimmel, “Fusecap: Leveraging large language models for enriched fused image captions,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 5689–5700.
T. Taleb et al., “Vr-based immersive service management in b5g mobile systems: A uav command and control use case,” IEEE Internet of Things Journal, vol. 10, no. 6, pp. 5349–5363, 2022.
J. Newmarch and J. Newmarch, “Ffmpeg/libav,” Linux sound programming, pp. 227–234, 2017.
P. Ganesh, Y. Chen, Y. Yang, D. Chen, and M. Winslett, “Yolo-ret: Towards high accuracy real-time object detection on edge gpus,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 3267–3277.
T. Matahari, “Webxr asset management in developing virtual reality learning media,” Indonesian Journal of Computing, Engineering and Design (IJoCED), vol. 4, no. 1, pp. 38–46, 2022.
A. Maatouk et al., “Teleqna: A benchmark dataset to assess large language models telecommunications knowledge,” arXiv preprint arXiv:2310.15051, 2023.
R. Zhong et al., “Mobile edge generation: A new era to 6g,” arXiv preprint arXiv:2401.08662, 2023.