LLM-ABR: Designing Adaptive Bitrate Algorithms via Large Language Models


Abstract

We present LLM-ABR, the first system that utilizes the generative capabilities of large language models (LLMs) to autonomously design adaptive bitrate (ABR) algorithms tailored for diverse network characteristics. Operating within a reinforcement learning framework, LLM-ABR empowers LLMs to design key components such as states and neural network architectures. We evaluate LLM-ABR across diverse network settings, including broadband, satellite, 4G, and 5G. LLM-ABR consistently outperforms default ABR algorithms.

1 2

1 Introduction↩︎

Large language models (LLMs) have demonstrated remarkable capabilities in generating high-quality text and code [1][4]. Researchers across various disciplines have started using LLMs for their research endeavors. As networking researchers, we aim to harness the power of LLMs to develop innovative networking algorithms.

In this paper, we explore the application of LLMs within the context of networking, specifically targeting the design of adaptive bitrate (ABR) algorithms for video streaming. ABR is a critical component in streaming technology, responsible for dynamically adjusting the quality of video content in response to changing network conditions to provide an optimal viewer experience. The traditional approach to designing ABR algorithms involves a combination of heuristic methods [5], [6], machine learning-based method [7], [8], and empirical testing [9], which can be both time-consuming and complex.

One way to leverage LLMs is to design prompts that directly generate alternate algorithms. However, after spending significant effort, we find that LLMs have good common sense but it is very challenging for LLMs to directly generate high-quality algorithms for a given target scenario (e.g., broadband, 4G or 5G networks). This could be attributed to the insufficient data available for training LLMs for this particular task.

Instead of depending on LLMs to generate an effective final algorithm in one shot, we use LLMs to generate a collection of candidate algorithms featuring diverse designs. Although it is challenging for LLMs to rank these algorithms and pick the best one, LLMs are capable of generating functional code, by combining existing variables or leveraging more advanced techniques such as signal processing filters. Consequently, we can assess these various algorithms within a network simulator and determine the most effective one.

As illustrated in Figure 1, our approach begins by inputting prompts along with the source code of an existing algorithm into LLMs to generate a large number of new designs. The prompts are carefully crafted to inspire LLMs to suggest code that is both high in quality and rich in diversity. In the second step, we apply two pre-checks to filter the designs generated by LLMs. The first check, the compilation check, verifies if the generated code can be compiled successfully through trial execution. Meanwhile, we have observed that LLMs occasionally produce code that fails to perform normalization, leading to excessively large inputs for neural networks. To mitigate this issue, we add an additional normalization check to ensure inputs are appropriately scaled. In the third step, we evaluate the remaining LLM-generated designs in a simulator under certain scenarios and select the one with the best video Quality of Experience (QoE). To minimize the computational resources required for evaluation, we further introduce an early stopping mechanism to filter out designs that are estimated not to perform well.

Figure 1: Overall architecture of LLM-ABR. We leverage LLMs to design states and network architectures by generating code blocks. These code blocks are filtered through two pre-checks and subsequently stored in a candidate pool. The states or network architectures within this pool are then evaluated. We also develop a filtering model to reduce the computational load during evaluation, through an early-stopping mechanism.

Our paper details the process of applying LLMs to ABR design, outlining the steps taken to prepare the data, train the models proposed by LLMs, and efficiently evaluate their performance. We also discuss the broader implications of this work, suggesting that what we have accomplished with ABR can serve as a blueprint for employing LLMs in the design and optimization of other networking algorithms. By demonstrating the feasibility and benefits of this novel application of LLMs, we hope to pave the way for further integration of these powerful models into the field of networking research, potentially transforming how algorithms are crafted and refined in the future.

The rest of the paper is organized as follow. In Section 2, we motivate the use of LLMs for designing ABR algorithms. We present our approach in Section 3 and evaluate its performance in Section 4. We report our insights in Section 5. We survey related work in Section 6 and conclude in Section 7. We will open source the code and the models to the community.

2 Motivation↩︎

In this section, we motivate our use of LLM for designing networking algorithms.

2.0.0.1 LLMs can generate code:

Trained on vast collections of text corpus, LLMs have demonstrated exceptional capabilities in generating code following human instructions. These models are adept at translating user requests into code snippets [2], crafting specific functions [3], and even constructing entire projects from scratch [4]. More recent applications include the development of heuristic greedy algorithms for NP-hard problems [10] and the creation of reward functions for use in robotics [11]. We envision that LLMs can design innovative network algorithms tailored for different network conditions via code generation. To accommodate diverse network environments, previous works such as Oboe [9] and Configanator [12] focus on auto-tuning hyperparameters of known algorithms. However, LLMs can go beyond the capability of existing works by not only searching across hyperparameter values but also proposing new algorithms via code generation.

2.0.0.2 Different networks need different designs:

Network technology is rapidly evolving, necessitating the development of distinct algorithms tailored to various conditions. Taking ABR as an example, it was originally developed in 3G and broadband networks [5], [7]. With the development of 4G networks, the increased dynamics of these networks led to the creation of ABR algorithms employing diverse techniques, including Markov models [13] and ARIMA models [14]. The progression to 5G networks further accelerated the specialization of ABR algorithms [15][17]. Furthermore, the rise of Low-Earth-Orbit (LEO) satellite networks has also prompted the need for ABR algorithms specifically for satellite networks [18]. The creation of new algorithms for emerging network types requires a significant investment of expertise and labor. In our study, we utilize LLMs to automate the design of novel algorithms for diverse network conditions, including broadband, LEO networks, 4G, and 5G. We employ LLMs to generate code for new ABR algorithms and subsequently test their efficacy. Our later evaluation, as shown in Table 7 and Table 8, demonstrate that an algorithm’s effectiveness can vary significantly across different network types, highlighting the necessity for the development of algorithms tailored to each network type.

2.0.0.3 LLM can generate code samples but not the final algorithm:

Generating code of new algorithms via LLMs is a non-trivial task. Our initial attempts to employ LLMs for the creation of a good algorithm through zero-shot or few-shot learning were not successful. We build upon the well-known ABR algorithm, Pensieve [7], and prompt LLMs to improve part of the code to enhance performance. We utilized GPT-3.5 and GPT-4 to produce 130 samples respectively. However, around 80% of the samples generated by GPT-3.5 and GPT-4 could not even outperform a simple heuristic that only selects the lowest bitrate, and none of them outperformed the default Pensieve. This is likely because, after being training on extensive online data, LLMs have developed the common sense that a high data rate is typically beneficial and should be preferred. However, an excessively high data rate can lead to video stalling and performance degradation, especially in low-bandwidth scenarios such as LEO satellite networks. In other words, LLMs are unable to handle specialized scenarios due to a lack of domain-specific data during training. Furthermore, we observe that LLMs, despite their ability to generate code, struggle to assess the potential performance of the generated designs in specific scenarios. To address these challenges, we implement three strategies to better exploit LLMs: (1) We prompt LLMs to generate a wide variety of designs, which we evaluate in a network simulator, rather than directly producing targeted code; (2) The prompts provided to LLMs are carefully crafted to ensure the generation of code that is both diverse and of high quality; (3) We develop several pre-checks and an early stopping mechanism to reduce the computational demands for evaluation.

3 Our Approach↩︎

Our approach involves leveraging LLMs to generate candidate model designs (new input states and network architectures as code blocks), performing pre-checks to filter these candidates prior to training, assessing candidate performance through actual model training, and applying an early-stop mechanism through a filtering model.

3.1 Generating Designs Using LLMs↩︎

We build upon the well-known ABR algorithm Pensieve [7], and generate alternative algorithm designs that achieve better performance with the aid of LLMs. Throughout this paper, Pensieve serves as a primary example to illustrate our methodology. However, it is important to acknowledge that our framework is not limited to this instance alone but is also applicable to a broader range of RL-based algorithms within the field of networking. Pensieve is based on a reinforcement learning (RL) framework, as shown in Figure 2. After downloading chunk \(t\), it constructs a state \(s_t = (\vec{x_t}, \vec{\tau_t}, \vec{n_t}, b_t, c_t, l_t)\) to capture its surrounding network environment. Here, the variables \(\vec{x_t}\), \(\vec{\tau_t}\), \(\vec{n_t}\) represent the past throughput observations, past download times, and the next chunk’s sizes in various bitrates, respectively. Meanwhile, \(b_t\) and \(l_t\) indicate the latest buffer size and the last selected bitrate. While Pensieve initially describes \(c_t\) as the count of remaining chunks, this feature has been replaced with the percentage of remaining chunks during implementation, a modification we have also adopted. This state \(s_t\) is fed into an actor-critic network. The actor network \(\pi_{\theta}(s_t, a_t) \in [0, 1]\), parameterized by \(\theta\), calculates the likelihood of selecting action \(a_t\) given \(s_t\), with \(a_t\) indicating the bitrate to adopt in the next chunk. The critic network \(V^{\pi_{\theta}}(s_t)\) estimates the expected cumulative reward obtainable from \(s_t\), which is beneficial for training but not applied during the inference phase.

Figure 2: The original architecture of Pensieve [7]. It consists of two components: the state on the left and the actor-critic network on the right. We utilize LLMs to generate new designs for both the state and the network architecture.

In the implementation, the construction of state \(s_t\) and the neural networks \(\{\pi_{\theta}, V^{\pi_{\theta}}\}\) are realized through specific code blocks. We utilize the notation \(\mathcal{S}\) to denote the set of any possible strings, \(S_{state}\in \mathcal{S}\) to represent the original code text for state creation, and \(S_{network} \in \mathcal{S}\) to represent the original code text for the actor-critic network. The execution of these code texts is performed by a code interpreter \(\mathcal{I}\), which is Python in this study. Specifically, after downloading chunk \(t\), an observation variable \(O_t\) is formed, which consists of past download times, past chunk sizes, past bitrates, past buffer sizes, the number of chunks left, and the size options of the next chunk. The code \(S_{state}\) is used to extract \(s_t = (\vec{x_t}, \vec{\tau_t}, \vec{n_t}, b_t, c_t, l_t)\) from \(O_t\), which is represented as \(\mathcal{I}(S_{state}, O_t) \equiv s_t\). Subsequently, the neural network code is executed, expressed as \(\mathcal{I}(S_{network}, s_t) \equiv \{\pi_{\theta}, V^{\pi_{\theta}} \}\).

LLMs serve as a tool that maps from \(\mathcal{S}\) to \(\mathcal{S}\), denoted as \(\mathcal{F}: \mathcal{S}\rightarrow\mathcal{S}\), with \(\mathcal{F}\) representing an LLM. We prompt LLMs to map the original state code and network architecture code to refined versions. Formally, the improved state code, \(\widehat{S_{state}}\), is obtained by \(\widehat{S_{state}} = \mathcal{F}(S_{state})\), and the new state is acquired by \(\widehat{s_t} \equiv \mathcal{I}(\widehat{S_{state}}, O_t)\). The improvement of network architecture is similarly represented as \(\widehat{S_{network}} = \mathcal{F}(S_{network})\), and \(\{\widehat{\pi_{\theta}}, \widehat{V^{\pi_{\theta}}} \}\equiv\mathcal{I}(\widehat{S_{network}}, s_t)\). Notably, the improvement of states and network architectures can be combined. By harnessing the generative power of LLMs, we are now able to generate innovative designs for state and network architectures programmatically.

It is important to note that this approach fundamentally differs from Automated Machine Learning (AutoML) [19] and Neural Architecture Search (NAS) [20], which still require pre-defined building blocks for features or networks. For instance, AutoML requires human experts to specify potential operators among features (e.g., addition, subtraction, and multiplication), and then explores all possibilities to generate new features. In contrast, LLMs are uniquely different in their capacity to not only interpret the semantic meanings of variables without any pre-definitions, but also to create innovative designs leveraging a variety of techniques. For example, we observe that LLMs propose to use linear models or sophisticated filters to estimate buffer size trends, an approach that is both insightful and proves effective upon evaluation. These changes are automatically generated by LLMs without human intervention.

We have developed two prompts for LLMs to generate novel code for states and network architectures, respectively. We rewrite Pensieve’s original state creation code into a single function, which is then included in the prompt for creating new state code. Similar methodology is applied to network architectures as well. Through our experimentation, we identify several prompt engineering strategies that enhance performance: (1) Chain-of-Thought (CoT) [21]. We instruct LLMs to conduct an initial review of the existing code, generate ideas, and select the optimal idea before generating the code. This process mimics human reasoning and leads to greater diversity in code generation. (2) Use of meaningful variable names and detailed code annotations. This enables LLMs to form a better understanding of the problem. (3) We observe that LLMs occasionally generate state designs with improper normalization, which results in poor performance. We mitigate this issue by explicitly requesting normalization in our state creation prompt. We provide an outline of our prompts below and refer the reader to the Appendix 8.3 for complete prompts.

text You are trying to design the state code of a reinforcement learning algorithm for adaptive bit rate.

**Describe the background of ABR.** **Describe the input variables.**

“‘python **The default state code** “’ Requirements: - Keep the function name. - Keep the function’s input variables. - Keep the function’s output variables. - Keep ‘import numpy’. You can also import scipy, pandas if needed. - Normalize the input properly when you try to add new states.

Using the following format to output: Analysis and ideas: <Try to analyze the current code, the problem, propose ideas, and choose the best ideas> Code: “‘python <Your improved state design code here> “’

text You are trying to design a network function of a reinforcement learning algorithm for adaptive bit rate.

**Describe the background of ABR.** **Describe the input variables.**

A network function example is: “‘python **The default network architecture code** “’ Requirements: - Keep the function name. - Keep the function’s input variables. - Keep the function’s output variables. - Include "import tensorflow.compat.v1 as tf" and "import tflearn".

Using the following format to output: Analysis and ideas: <Try to analyze the current code, the problem, propose ideas, and choose the best ideas> Code: “‘python <Your improved network design code here> “’

3.2 Filtering and Evaluating Designs↩︎

The states and network architectures generated by LLMs do not always meet expectations. For instance, LLMs might produce code that does not compile or run due to syntax errors, or fail to properly normalize neural network inputs. To catch these common compile-time errors, we implement two pre-checks—a compilation check and a normalization check. Despite these measures, it remains necessary to evaluate the quality of the designs that pass these initial checks. Training every design in a network simulator to assess performance is possible but impractical, given the time and resources required for training RL models. Therefore, we introduce an early stopping mechanism, which learns a classification model to filter out designs with low potential for success early in their training process, allowing only the most promising designs to complete training. Next, we detail the design of the pre-checks and the early stopping mechanism.

The compilation check is a trial run of the code generated by LLMs. Any code that triggers an exception is immediately excluded from further consideration. Following this, a normalization check is performed on the generated state code, recognizing the importance of normalization in preparing inputs for neural networks. For instance, we observe that LLMs occasionally use chunk sizes in bytes as state features, resulting in abnormally high input values (e.g., over one million) that prevent the neural network from converging. Explicitly specifying the normalization requirement in the LLM prompt still does not fully resolve this problem. To filter out state designs with improperly normalized features, we test the state code with random inputs \(O_t\) and examine whether the output \(\widehat{S_{state}}\) contains any values above a predefined threshold \(T\), where \(T\) is set to 100 in our study. State designs that fail this normalization check are rejected. We note that the normalization check is applied only to the state code, not the code for neural network architecture.

After the compilation and normalization checks, we assess the proposed design by actually training it in a network simulator. However, RL training is particularly time-consuming as numerous epochs are required for reward convergence. Thus, we introduce an early stopping mechanism using rewards collected from the first \(K\) runs of the modified code. To implement our early stopping mechanism, we leverage the training rewards observed in the first \(K\) training episodes from our simulated ABR environment to train a 1D-CNN for binary classification, i.e., to classify whether the state design is likely to be a top performer in comparison to the previously observed state designs. We utilize the classification prediction of our model to early stop suboptimal state designs. Ideally, our classifier would early stop all state designs except for the top performers (e.g., positive outliers such as the top 1% of state designs). However, we observe poor classification performance when assigning only the top 1% of state designs with a positive label due to the large class imbalance between positive and negative classes. To address the large class imbalance, we employ a variant of label smoothing [22]. Label smoothing tackles class imbalances and improves model generalization by introducing uncertainty into class labels; specifically, label smoothing represents each class label as a probability distribution as opposed to a deterministic hard label. In contrast to traditional label smoothing, we introduce uncertainty by relaxing what constitutes the positive class; that is, we assign a positive label to the top \(20\%\) of state designs rather than the top \(1\%\).

Furthermore, to improve our early stopping performance, we adjust the positive threshold of our classifier to minimize the error at the top \(1\%\) while maximizing the number of state designs filtered in the bottom \(99\%\). In other words, we train our classifier with label smoothing to ensure that the filter is directionally correct and adjust the positive threshold to maximize its filter rate while reducing the probability of early stopping the top performing state designs. Our results demonstrate that while utilizing only \(400\) state designs for training, our model achieves an \(87\%\) true negative rate when classifying \(1800\) previously unseen state designs (i.e., we early stop approximately \(1550\) of the estimated \(1782\) sub optimal state designs), reducing the number of training episodes required by nearly \(228\) million on average for our training setup (see Figure 6). However, among the top \(1\%\), our model exhibits a \(12\%\) false negative rate. It is important to note that no false negatives are present among the top five state designs throughout our evaluations. We compare our early stopping model to alternative methods and report quantitative results in Section 4.8.

4 Evaluation↩︎

4.1 Datasets↩︎

Table 1: Datasets used in our study. FCC is a public dataset for broadband networks, whereas the Starlink, 4G, and 5G traces are manually collected by us.In the table, “Train Count” and “Test Count” represent the number of traces in the training and test splits, and “Train Hours” and “Test Hours” denote the duration of the traces. “Throughput” refers to the average throughput, measured in Mbps. The table also details the training epochs and the model evaluation intervals (in epochs) on each dataset.
Dataset Train Count Train Hours Test Count Test Hours Throughput Train Epochs Eval Interval
FCC 85 10.0 290 25.7 1.3 40,000 500
Starlink 13 0.9 12 0.8 1.6 4,000 100
4G 119 10.0 121 10.0 19.8 40,000 500
5G 117 10.0 119 10.0 30.2 40,000 500

To evaluate the state and network architecture designs proposed by LLMs, we perform a trace-driven assessment using realistic traces. In total, four distinct trace datasets are utilized, as detailed in Table 1. Each dataset characterizes a representative category of network environments:

  • FCC: This dataset is a measurement of the U.S. broadband network [23]. We split the FCC dataset into training and test sets following the methodology in [24]. To improve training efficiency, the training set is further subsampled to a duration of 10 hours, while the test set is kept unchanged.

  • Starlink: Given the growing interest in low-earth-orbit (LEO) satellite networks [25][27], we gather throughput traces from a stationary Starlink RV terminal in the west coast of U.S. in January 2023 using iperf, which generates TCP streams. Although Starlink’s throughput is sufficient for streaming high-resolution video during off-peak hours, its bandwidth decreases significantly during peak times due to shared usage of satellite links. To simulate network congestion, we adjust the throughput in Starlink traces to one-eighth of the original value.

  • 4G and 5G: We measure the downlink throughput using iperf from an Azure server located in the central U.S. to a local client over the internet, where the wireless hop uses either 4G or 5G networks. We test three mobility scenarios: stationary indoor, walking, and driving.

4.2 Experiment Settings↩︎

Following the settings in Pensieve [7], we use a video consisting of 48 chunks, with each chunk lasting 4 seconds. For the FCC and Starlink datasets, we adopt the original bitrate levels {300, 750, 1200, 1850, 2850, 4300} kbps. In contrast, the 4G and 5G datasets have higher bandwidth, and thus we select a higher bitrate ladder {1850, 2850, 4300, 12000, 24000, 53000} kbps, which aligns with YouTube’s recommended video encoding settings [28]. The same QoE function as Pensieve is adopted.

We determine the total training epochs based on dataset size, setting 40,000 training epochs for the FCC, 4G, and 5G datasets, and 4,000 epochs for the Starlink dataset. To mitigate the impact of random noise and accurately assess the performance of the LLM-generated states and network architectures, we conduct five separate training sessions for each design using different seeds. For FCC, 4G, and 5G, we assess model performance on the test set at 500-epoch intervals throughout the training, while a 100-epoch interval is used for the Starlink dataset accounting for its size. The performance score for each training session is calculated by averaging the rewards of the last 10 model checkpoints. Then, the median of these performance scores across five sessions is reported as the overall performance of each state or network architecture.

4.3 Designing States↩︎

Table 2: Ratios of samples that successfully pass the compilation check and the normalization check.
All Compiled Well Normalized
GPT-3.5 3,000 1,237 (41.2%) 822 (27.4%)
GPT-4 3,000 2,059 (68.6%) 1,505 (50.2%)

Figure 3: Performance of the best GPT-generated states compared against the default designs in simulation. The best state is selected based on the highest median score from five trials as described in Section 4.2. The scores are further smoothed by averaging over 10 consecutive epochs for better visualization.

Table 3: Results of the best states identified by GPT-3.5 and GPT-4across four network scenarios in simulation.The score is the median score of five trialsas described in Section 4.2.
Dataset Method Score Impr.
FCC Default 1.070
FCC w/ GPT-3.5 1.089 1.7%
FCC w/ GPT-4 1.090 1.9%
Starlink Default 0.308
Starlink w/ GPT-3.5 0.472 52.9%
Starlink w/ GPT-4 0.482 56.3%
4G Default 11.705
4G w/ GPT-3.5 13.226 13.0%
4G w/ GPT-4 14.973 27.9%
5G Default 27.848
5G w/ GPT-3.5 28.447 2.2%
5G w/ GPT-4 28.636 2.8%

We utilize GPT-3.5 and GPT-4 to generate 3,000 states each, evaluating only those that pass compilation and normalization checks. As shown in Table 2, 68.6% of the state code generated by GPT-4 is “compilable,” i.e., runnable without errors for Python, compared with 41.2% for GPT-3.5. Meanwhile, 50.2% of the states generated by GPT-4 pass the normalization check, while only 27.4% of the states from GPT-3.5 have well-normalized features. This demonstrates GPT-4’s superior ability to produce proper code blocks.

The alternative state designs suggested by GPT can be non-trivial. Upon review, we find that GPT introduces not only basic features such as bitrate variance and the exponential moving average of throughput, but also imports new Python packages to implement advanced functions. Notably, several states employ the linear regression model from the statsmodel package to predict future throughput as a new feature. Another example uses the Savitzky-Golay filter [29] from the scipy package to analyze buffer size trends.

In Table 3 and Figure 3, we compare the performance of the best state generated by GPT-3.5 and GPT-4 against the default state in each network scenario. Each state design is trained five times, with the median performance score reported (§4.2) to ensure reliable results. Across all network scenarios, we find that both GPT-3.5 and GPT-4 consistently outperform the default state design, with GPT-4 achieving greater overall improvement. The most pronounced improvement is observed on the Starlink traces, where GPT-3.5 and GPT-4 enhance performance by 52.9% and 56.3% over the default, respectively (Table 3). The default state exhibits an overfitting issue on these traces, resulting in a reward decline after the 3,000th epoch (Figure 3). In contrast, the best generated state successfully avoids overfitting and continues to improve throughout the training. On the FCC traces, although the performance boost is modest, Figure 3 shows that the best generated state from GPT-3.5 and GPT-4 can both reliably exceed the default state’s performance after the 30,000th epoch. On 4G and 5G traces, the default state is stuck at low rewards due to its preference toward lower bitrates. This issue is effectively addressed by the GPT-generated states (Figure 3). In Section 5, we will shed light on the best states generated for each network scenario and provide insights into state design.

4.4 Emulation Results of New States↩︎

Figure 4: Starlink emulation results of the best state designs. The best generated state designs from both GPT-3.5 and GPT-4 and the default Pensieve design are included in this figure. This figure shows the observed average QoE score, video bitrate, and rebuffering ratio across 65 trials.

Table 4: Emulation results of the best states identified by GPT-3.5 and GPT-4 for the Starlink, 4G, and 5G datasets.
Dataset Method Score
Starlink Default -0.0482
Starlink w/ GPT-3.5 0.0899
Starlink w/ GPT-4 0.0759
4G Default 4.976
4G w/ GPT-3.5 8.010
4G w/ GPT-4 9.233
5G Default 17.26
5G w/ GPT-3.5 17.43
5G w/ GPT-4 21.55

To further assess the performance of the states identified in Section 4.3, we conducted emulation experiments using the Starlink, 4G, and 5G datasets. The dash.js framework was utilized for rendering videos on a webpage, a Chrome browser was initiated with the help of the Selenium package for video playback. The network conditions were emulated using the Mahimahi emulator. These experiments’ results are presented in Table 4. It is observed that there is a variance between the emulation outcomes and the simulation results shown in Table 3. Nevertheless, the optimal states identified using GPT-3.5 and GPT-4 significantly surpass the performance of the default state. Additionally, the Cumulative Distribution Function (CDF) plot of the Starlink emulation experiment, depicted in Figure 4, illustrates that the optimal state from GPT-4 tends to favor lower bit rates to minimize rebuffering, whereas the optimal state from GPT-3.5 achieves a balance by maintaining a reasonable bit rate and reducing rebuffering to some extent. These experiments validate that the proposed states are effective not only in simulation but also in emulation environments.

4.5 Designing Network Architectures↩︎

Due to budget constraints, our investigation into network architecture design is limited to GPT-3.5 only. We employ GPT-3.5 to produce 3,000 network architectures, and apply the compilation check to filter out invalid designs (the normalization check does not apply here). 760 architectures pass the compilation check and are further evaluated in various network scenarios. The most effective architectures, along with the default one, are presented in Table 5 and Figure 5. The performance improvements from GPT-3.5 range from 1.4% to 50.0% across different network scenarios. The largest gains are observed with Starlink traces again, due to the aforementioned overfitting issue in the default design (Figure 5). For 4G and 5G traces, although the overall improvements are modest (2.6% and 3.0%), the new network architecture consistently outperforms the baseline across all epochs, as indicated in Figure 5. Conversely, the marginal improvement on FCC traces is likely not statistically significant. Overall, we discover that modifying the network architecture tends to yield less improvement than updating the state design. An in-depth analysis of the most effective network architecture for each trace set will be presented in Section 5.

Figure 5: Performance of the best GPT-generated network architectures compared against the default in simulation. The best network architecture is selected based on the highest median score from five trials as described in Section 4.2. The scores are further smoothed by averaging over 10 consecutive epochs for better visualization.

Table 5: Results of the best network architectures identified by GPT-3.5 and GPT-4across four network scenarios in simulation.The score is the median score of five trialsas described in Section 4.2.
Dataset Method Score Impr.
FCC Default 1.070
FCC w/ GPT-3.5 1.085 1.4%
Starlink Default 0.308
Starlink w/ GPT-3.5 0.462 50.0%
4G Default 11.705
4G w/ GPT-3.5 12.007 2.6%
5G Default 27.848
5G w/ GPT-3.5 28.688 3.0%

4.6 Costs of LLMs↩︎

GPT-3.5 and GPT-4 were each used to generate 3,000 alternative state designs. The generation process with GPT-3.5 involved approximately 5,151,000 input tokens and 1,470,000 output tokens, amounting to a cost of $4.8. Generating states using GPT-4 required the same number of input tokens (5,151,000) but produced a larger number of 1,850,000 tokens. Given the higher API cost of GPT-4, this process translated to a total cost of $265.5. Additionally, GPT-3.5 was utilized to create 3,000 network architectures, incurring a cost of $2.8 with 2,823,000 input tokens and 930,000 output tokens.

4.7 Combining States and Network Architectures↩︎

Table 6: Results of combining states and network architectures proposed by GPT-3.5. We show the maximum improvement of three approaches over the default configuration:designing states only, designing network architecture only, and a combination of both.
Dataset State Network Arch Combination
FCC 1.7% 1.4% 2.2%
Starlink 52.9% 50.0% 61.1%
4G 13.0% 2.6% 16.5%
5G 2.2% 3.0% 3.1%

Next, we explore the potential for performance improvement by integrating the states from Section 4.3 with the network architectures from Section 4.5. Based on the evaluation results, we select the top 30 states and the top 30 network architectures from GPT-3.5, and generate 900 unique combinations. Each combination of state and network architecture is trained five times, and the best result is shown in Table 6. We observe that this combination approach consistently leads to enhanced performance across varying network scenarios. More importantly, the optimal combination does not always involve pairing the top-ranked state with the top-ranked architecture. Specifically, the best performance is achieved with the FCC traces using the second-ranked state and third-ranked network architecture; the Starlink traces with the 10th state and 3rd architecture; the 4G traces with the 6th state and 19th architecture; and the 5G traces with the 2th state and 17th architecture.

4.8 Early Stopping Modeling↩︎

In this section, we compare five different early stopping mechanisms. We consider state designs in the \(99\)th percentile of final training rewards to be worth while to fully train, i.e., state designs that achieve the top \(1\%\) of training rewards are assigned a positive label while all others are assigned a negative label and should be stopped early. The methods tested are as follows. 1) Reward Only. The first \(10\)k training rewards are embedded via a 1D-CNN. The extracted features are then utilized for classification. 2) Text Only. The generated state design is embedded with OpenAI’s text-embedding-ada-002 model. The text embedding is then fed to a fully connected network for classification. 3) Text plus Reward. Text plus Reward fuses the features extracted from the first \(10\)k training rewards with the generated state design’s text embedding for downstream classification. 4) Heuristic Max. Early stopping is based on the maximum reward obtained during the first \(10\)k training epochs. 5) Heuristic Last. Heuristic Last leverages the final reward achieved prior to the early stopping point (episode \(10\)k) for classification.

To evaluate each method, we first collect approximately \(2000\) state designs with their corresponding training metrics from each of our network environments: FCC, 4G, 5G, and Starlink. The collected state designs are then split into train and test splits. To match our target environment, the target labels are assigned strictly based on reward percentiles observed in the training split. The positive threshold of the ML models and the decision threshold of the heuristics are adjusted to minimize the error at the \(99\)th percentile. We then perform a five-fold cross validation on each network environment, training with \(20\%\) of the collected state designs which corresponds to \(400\) training samples. For each training split size, we report the average false negative rate and true negative rate across all validation folds and network environments (see Figure 6).

Based on the reported results, Reward Only appears to offer the best trade off in terms of early stopping errors and resource savings. Specifically, utilizing only \(400\) state design samples, Reward Only is able to early stop \(87\%\) of the suboptimal state designs with a false early stop rate of \(12\%\) on average. In our case, this translates to computational savings on the order of hundreds of millions of training epochs. Although the error rate is non-zero, the resource savings are immense. Furthermore, if a near-zero error rate is required, the decision threshold can be adjusted, more state samples can be collected, or the early stopping criteria can be reduced (e.g., the \(95\)th percentile becomes the target) in exchange for resource savings. In contrast, the next closest competitor is Heuristic Max, which achieves an \(89\%\) true early stopping rate with an \(18\%\) error rate. One explanation for the relatively low false negative rate obtained by the Reward Only method is the strong correlation between the early reward observations and the final reward obtained. The 1D-CNN of the Reward Only approach is able to exploit the temporal correlation of early rewards and capture the relationship between the reward observations and the final reward better in comparison to alternative methods when a relatively small number of state samples are available. Similarly, Text plus Reward incorporates the temporal reward features; however, the text embeddings add little signal and appear to disrupt the model, leading to a regression in the false negative rate. Thus, based on the trade-off between early stopping errors and resource savings, the Reward Only method appears to offer the most advantageous trade-off.

Figure 6: Comparison between different early stopping classifiers.

4.9 Cross-dataset Evaluation↩︎

In this section, we explore the research question: Does an optimal state design for one network type maintain its efficacy across different types? To this end, optimal state designs are trained across different networks, and the results are documented in Table 7. Our findings suggest that an optimal state design typically underperforms when applied to a different scenario. For instance, the optimal state for the Starlink traces results in a performance decline of -79.3% and -33.6% on the 4G and 5G traces, respectively. The optimal state for Starlink prioritizes improved normalization of existing features without adding new ones, potentially conflicting with the principles for 4G and 5G, which suggest adding complex new features to enrich the model with more information and to prevent it from always selecting lower bitrates. Notably, the 5G traces prove to be particularly challenging, with the optimal states for FCC, Starlink, and 4G all demonstrating a tendency to select only low bitrates, resulting in the same performance decrease of -33.6%.

Table 7: The best states identified within each dataset (represented by rows) are evaluated in other datasets (represented by columns). We show the improvement compared with the default state.
FCC Starlink 4G 5G
FCC 1.9% 18.7% 1.1% -33.6%
Starlink 0.1% 56.3% -79.3% -33.6%
4G -0.5% 19.0% 27.9% -33.6%
5G -1.8% 26.4% -35.9% 2.8%
Table 8: The best network architectures identified within each dataset (represented by rows) are evaluated in other datasets (represented by columns). We show the improvement compared with the default network architecture.
FCC Starlink 4G 5G
FCC 1.4% 28.2% -37.8% -0.4%
Starlink -2.2% 50.0% -35.3% -33.6%
4G -2.2% -11.2% 2.6% -60.8%
5G -22.4% -53.8% -3.9% 3.0%

Additionally, the cross-dataset evaluation of network architectures, as shown in 8, supports a similar conclusion: the most effective network architecture within one trace set generally fails to achieve comparable success in another. This highlights the significance of our work that distinct states and network architectures should be designed to match the unique characteristics of each network type.

4.10 Designing States for Specific Scenes↩︎

Table 9: Results of designing states for specific scenes. We show the maximum improvement compared with the default state of three approaches.
Dataset Method Score Impr.
4G (Overall) Default 11.705
4G (Overall) w/ GPT-3.5 13.226 13.0%
4G (Stationary) Default 7.528
4G (Stationary) w/ GPT-3.5 14.232 89.1%
4G (Walking) Default 14.011
4G (Walking) w/ GPT-3.5 23.255 66.0%
4G (Driving) Default 12.657
4G (Driving) w/ GPT-3.5 14.841 17.3%

In the previous sections, we highlight the importance of designing specific states for different network types, such as broadband, LEO satellite networks, 4G and 5G. To advance our research, we explore the potential benefits of creating scenario-specific states within a singular network type. Our 4G dataset is collected from three scenarios: stationary indoor settings, walking, and driving. For each scenario, we sample 10 hours for training and another 10 hours for testing. Then, we evaluate the state designs proposed by GPT-3.5. The outcomes of this evaluation are presented in Table 9. The findings indicate that designing scenario-specific states significantly outperforms the use of a universal state, with the improvements of 89.1%, 66.0% and 17.3%, respectively. In contrast, the best universal state for 4G by GPT-3.5 results in a 13.0% improvement. Despite the clear benefits of designing scenario-specific states, it is important to note that scene detection should be performed firstly through an auxiliary method to select the optimal state.

5 Insights from Code Samples↩︎

We conduct a detailed analysis of the code samples that perform well across different networks and report the insights below. The top performing samples are released in Appendix 8.1 and 8.2.

5.1 Insights from Optimal States↩︎

In this section, we delve into the optimal states generated for each network scenario. It is noteworthy that the best state designs vary across networks, highlighting the need for customizing the input states of RL-based ABR algorithms to specific scenarios.

FCC:  On the FCC traces, we observe that the optimal states proposed by both GPT-3.5 and GPT-4 update the normalization strategy for certain features. These features are originally normalized within the range of \([0, 1]\), while they are remapped in the optimal states to a new range of \([-1, 1]\). For instance, the feature that captures the remaining percentage of chunks, initially ranging from \([0, 1]\), is adjusted by both GPTs to span \([-1, 1]\). Other features such as the bitrate and the buffer size are similarly normalized to \([-1, 1]\). These findings indicate that adopting an appropriate normalization strategy is crucial for enhancing performance.

Starlink:  The Starlink dataset is smaller than the other datasets. The optimal state design from GPT-3.5 removes two variables—the download times of previous chunks and the size options of the next chunk. In contrast, the best state from GPT-4 opts to apply more aggressive normalization (by increasing the normalizing factor) and smooths both throughput and download times. This implies that for models trained on smaller datasets, meticulous feature selection and processing are essential for successful generalization.

4G:  On the 4G traces, the default state tends to prefer low bitrates, resulting in low rewards. Consequently, the optimal states introduce new features to enable the selection of higher bitrates if the video playback has sufficient buffer. Specifically, GPT-3.5 applies a linear model to predict the download time and incorporates the trends of throughput and download time into the state. On the other hand, GPT-4 includes the trend of playback buffer size, potentially signaling the model to increase the bitrate as the buffer accumulates.

5G:  The best states for the 5G traces are similar to those for 4G. GPT-3.5 introduces a predicted throughput feature, while GPT-4 adds the buffer size difference as a feature. These additions allow the model to make more informed bitrate decisions and achieve higher rewards.

To summarize, our study identifies key principles for designing states for RL-based ABR algorithms. First, selecting an appropriate normalization strategy is crucial. E.g., scaling features to the range of \([-1, 1]\) is empirically preferrable to the range \([0, 1]\) for Pensieve. Second, in scenarios where the training data is limited, it is advisable to carefully engineer features, e.g., by removing features of less importance, or applying more aggressive feature normalization and smoothing. This approach likely mitigates the risks of overfitting and strengthens generalization to unseen environments. Lastly, for ABR models exhibiting anomalous behaviors (e.g., a preference for lower bitrates despite adequate buffering), integrating additional indicative features into the state (e.g., buffer size trend) can refine model behavior and performance.

5.2 Insights from Optimal Network Architectures↩︎

The optimal network architectures for each dataset are summarized as follows:

FCC:  The number of hidden neurons is increased from 128 to 256. Additionally, the activation function is changed from ReLU to Leaky ReLU.

Starlink:  The number of hidden neurons is increased from 128 to 256. Meanwhile, a Recurrent Neural Network (RNN) is employed to process time series features, instead of the 1D convolutional neural network (CNN).

4G:  The number of hidden neurons is increased from 128 to 256, and a Long Short-Term Memory (LSTM) is employed to replace the 1D CNN.

5G:  The number of hidden neurons is increased from 128 to 256. Meanwhile, The actor and critic networks are modified to share the same hidden layer but keep separate final layers.

These empirical findings indicate a universal preference for 256 hidden neurons over 128 in Pensieve. Furthermore, different network scenarios may benefit from different time series processors, such as RNN or LSTM.

5.3 Universal Designs↩︎

Our investigation in Section 4.9 reveals that the most effective state and network architecture designs for a specific network scenario often fail to perform consistently across varying scenarios. Nevertheless, the potential for universal designs still remains. They can perform well across a broad range of scenarios, despite possibly not reaching the highest performance in every scenario. This section explores these universal designs.

To account for the scale difference in rewards across network scenarios, we score each state design using its percentile of performance among all states within each trace set. For example, a state with a percentile score of 80 on the FCC traces indicates that it outperforms 80% of the state designs in that trace set. Then, we average the percentile scores of each design across four network scenarios to gauge its overall efficacy, referred to as average percentile. The same methodology is applied to assess network architectures.

The default state definition achieves an average percentile of 83.7. Out of all state designs, 65 states outperform the default, with the best universal state design achieving an average percentile of 95.0, i.e., surpassing 95% of all states. The strategy adopted in this state is not sophisticated: normalizing the percentage of the remaining chunk into \([-1, 1]\) instead of \([0, 1]\), and adjusting the normalizing constants for download time and throughput to 5 seconds and 3 Mbps, respectively. No features are added or removed.

As for network architectures, the default design fares with an average percentile of 68.1, with 22 alternative architectures showing better performance. The best architecture, with an 83.5 average percentile, enhances the default by doubling hidden neurons from 128 to 256, switching from a 1D CNN to LSTM for time series analysis, and updating the final hidden layer from 128 to 256 neurons.

It is important to note that while these designs are broadly effective, they may not deliver the best performance for every network type. They are, however, valuable guidelines for those seeking a universal design for diverse network scenarios. Specifically, the best universal state design improves performance by 1.0% on FCC traces, 22.4% on Starlink, and 0.6% on 4G, but sees a 33.1% decrease on 5G networks. Similarly, the best universal network architecture offers a 0.3% improvement on FCC traces, 37.9% on Starlink, and 1.0% on 4G, but suffers a 33.6% decline on 5G networks. These results motivate future research to pursue a universally superior design that is effective across all network types.

6 Related Work↩︎

We classify the related work into (i) ABR algorithms, (ii) LLMs for RL, and (iii) AutoML for RL.

6.1 Adaptive Bitrate Algorithms↩︎

To accommodate network fluctuation, adaptive bitrate streaming (ABR) has become the standard for video streaming. ABR works by estimating a user’s current bandwidth and selecting the most appropriate video quality from multiple pre-encoded bitrate ladders to optimize the QoE, which typically consists of video quality, smoothness and rebuffering.

Various ABR algorithms have been developed. Buffer-based algorithms [6] adapt bitrates based on the playback buffer’s occupancy level. Throughput-based algorithms [5] estimate the available network throughput to select bitrates. Hybrid algorithms [30] combine multiple data sources, such as buffer occupancy and throughput estimates, for bitrate selection. There have also been machine learning-based algorithms [8], [31], which use historical data to predict future network conditions and optimize bitrate selections.

Reinforcement learning (RL) has been applied to ABR. RL optimizes its control policy according to the performance of previous bitrate selections and shows potential to outperform heuristics. Pensieve [7] is a widely recognized ABR algorithm employing the actor-critic RL algorithm [32]. It can handle network variation and unexpected scenarios. QARC [33] is based on Q-learning and adds a convolutional neural network (CNN) layer and a recurrent neural network (RNN) to the model.

6.2 LLMs for Reinforcement Learning↩︎

Recent pioneering research has explored the use of large language models (LLMs) in reinforcement learning, as LLMs encode extensive knowledge that can aid in policy learning. ELLM [34] utilizes LLMs to shape exploration in RL by suggesting exploratory goals. LLM-MCTS [35] employs LLM as a common-sense world model and heuristic policy, assisting Monte Carlo Tree Search in state and action selection. LMA3 [36] leverages a pretrained LLM to represent, generate, and learn certain abstract goals. AMRL [37] introduces a modular framework that utilizes LLMs to provide knowledge. LMPriors [38] incorporates prior knowledge from LLMs into RL by reward shaping. These methods are limited to text-based tasks [35], [36] or simple games [34], [37], [38], while our method addresses video streaming, a complex real-world challenge. Besides, LaGR-SEQ [39] uses LLMs to propose solutions to tasks partially completed by an RL agent, targeting applications such as cube-stacking and robotics. However, it does not generate directly executable code as our approach does. Eureka [11] employs LLMs to design rewards for robotics in simulation; by comparison, the reward in our task is well-defined, motivating us to focus on state and network architecture instead.

6.3 AutoML for Reinforcement Learning↩︎

Automatic Machine Learning (AutoML) focuses on automatically handling various tasks that require machine learning expertise. AutoML is used in RL for optimizing hyperparameters [40][42], and searching for reward designs [43], [44], network architectures [45], and loss functions [46]. These prior works require domain knowledge to design the search space, specifying the concrete elements (e.g., hyperparameters and operators) for AutoML to search. In contrast, our method does not require a pre-defined search space since LLMs can propose state and network designs based on their internal knowledge. Moreover, our method directly generates code using LLMs, taking one step further compared with AutoML and proving more useful for non-domain experts.

7 Conclusion↩︎

In this paper, we explore the application of Large Language Models (LLMs) in the development of adaptive bitrate (ABR) algorithms tailored for diverse network environments. Our findings reveal that LLMs are adept at generating functional code snippets, and by employing various filtering techniques, we can efficiently pinpoint promising code solutions. We conduct an in-depth analysis of code variants that demonstrate superior performance across different network conditions. This analysis holds significant value for the future creation of ABR algorithms.

While our current focus is on the design of ABR algorithms, the methodology we introduce has the potential for broader implications. LLMs possess the versatility to tackle a multitude of networking challenges, ranging from the optimization of routing protocols to the fortification of security frameworks. The capability of LLMs to process and generate expert-like code paves the way for transforming intricate technical specifications and objectives into executable algorithmic strategies. This is an avenue we aim to investigate more extensively in the future.

8 Appendix↩︎

We present the top performing states designed by LLMs in Appendix 8.1, and the top performing network architectures in Appendix 8.2. The prompts we utilized to generate the states and network architectures are presented in Appendix 8.3.

8.1 Top Performing States↩︎

import numpy as np

def state_func(
    bit_rate_kbps_list,
    buffer_size_second_list,
    delay_second_list,
    video_chunk_size_bytes_list,
    next_chunk_bytes_sizes,
    video_chunk_remain_num,
    total_chunk_num,
    all_bit_rate_kbps,
):
    # Normal state 1: The normed last bit rate
    normed_last_bit_rate = bit_rate_kbps_list[-1] / float(np.max(all_bit_rate_kbps))
    # Normal state 2: The normed last buffer size second (buffered video second)
    buffer_norm_factor = 10.
    normed_last_buffer_size = buffer_size_second_list[-1] / buffer_norm_factor
    # Normal state 3: The percentage of the remaining video chunks
    remaining_chunk_percentage = 2.0 * float(video_chunk_remain_num / total_chunk_num) - 1.0
    # Normal state 4: Average throughput over the history window
    history_window = 8
    avg_throughput_MBps = np.mean([video_chunk_size_bytes_list[-i] / 1000. / 1000. / delay_second_list[-i] for i in range(1, history_window+1)])
    # Normal state 5: Variation in throughput over the history window
    throughput_variation = np.std([video_chunk_size_bytes_list[-i] / 1000. / 1000. / delay_second_list[-i] for i in range(1, history_window+1)])
    # Normal state 6: Buffer occupancy rate
    buffer_occ_rate = 2.0 * (1.0 - buffer_size_second_list[-1] / delay_second_list[-1]) - 1.0
    # Finally, the normal states
    normal_states = [
        [normed_last_bit_rate],
        [normed_last_buffer_size],
        [remaining_chunk_percentage],
        [avg_throughput_MBps],
        [throughput_variation],
        [buffer_occ_rate]
    ]
    # Time series state 1: Estimated throughput in the near history
    throughput_MBps_list = [(video_chunk_size_bytes_list[-i] / 1000. / 1000. / delay_second_list[-i]) for i in range(1, history_window+1)]
    # Time series state 2: The normed download time (delay) in the near history
    delay_norm_factor = 10.
    normed_delay_list = [(delay_second_list[-i] / delay_norm_factor) for i in range(1, history_window+1)]
    # Time series state 3: Treat next chunk sizes in MB as time series states
    next_chunk_bytes_MB = [x / 1000. / 1000. for x in next_chunk_bytes_sizes]
    # Finally, the time series states
    time_series_states = [
        throughput_MBps_list,
        normed_delay_list,
        next_chunk_bytes_MB
    ]
    # Return the states
    return {
        "normal_states": normal_states,
        "time_series_states": time_series_states,
    }
import numpy as np

def state_func(
    bit_rate_kbps_list,
    buffer_size_second_list,
    delay_second_list,
    video_chunk_size_bytes_list,
    next_chunk_bytes_sizes,
    video_chunk_remain_num,
    total_chunk_num,
    all_bit_rate_kbps,
):
    # Constants
    buffer_norm_factor = 10.
    delay_norm_factor = 10.
    history_window = 8
    # Min-Max normalization for the last bit rate and buffer size
    normed_last_bit_rate = (bit_rate_kbps_list[-1] - min(all_bit_rate_kbps)) / (max(all_bit_rate_kbps) - min(all_bit_rate_kbps)) * 2 - 1
    normed_last_buffer_size = (buffer_size_second_list[-1] / buffer_norm_factor - 0.5) * 2
    # Normal state: Percentage of the remaining video chunks
    remaining_chunk_percentage = video_chunk_remain_num / total_chunk_num * 2 - 1
    # Calculate change rate for the buffer size
    if len(buffer_size_second_list) > 1:
        buffer_change_rate = buffer_size_second_list[-1] - buffer_size_second_list[-2]
    else:
        buffer_change_rate = 0
    normed_buffer_change_rate = buffer_change_rate / buffer_norm_factor * 2  # Assuming change rate won't exceed buffer_norm_factor
    # Normal states
    normal_states = [
        [normed_last_bit_rate],
        [normed_last_buffer_size],
        [remaining_chunk_percentage],
        [normed_buffer_change_rate],
    ]
    # Time series states
    throughput_MBps_list = []
    normed_delay_list = []
    for i in range(history_window):
        # Throughput calculations
        history_chunk_size_bytes = video_chunk_size_bytes_list[-(history_window - i)]
        history_delay_second = delay_second_list[-(history_window - i)]
        throughput = history_chunk_size_bytes / 1000. / 1000. / history_delay_second
        throughput_MBps_list.append(throughput)
        # Normalized delays
        delay = delay_second_list[-(history_window - i)]
        normed_delay = (delay / delay_norm_factor - 0.5) * 2
        normed_delay_list.append(normed_delay)
    # Apply exponential moving average to time series states
    alpha = 0.5  # Smoothing factor
    smoothed_throughput_MBps_list = np.convolve(throughput_MBps_list, [1-alpha, alpha], 'valid').tolist()
    smoothed_normed_delay_list = np.convolve(normed_delay_list, [1-alpha, alpha], 'valid').tolist()
    # Next chunk sizes normalization
    max_chunk_size_MB = max(next_chunk_bytes_sizes) / 1000. / 1000.
    next_chunk_bytes_MB = [(x / 1000. / 1000. / max_chunk_size_MB * 2 - 1) for x in next_chunk_bytes_sizes]
    # Time series states
    time_series_states = [
        smoothed_throughput_MBps_list,
        smoothed_normed_delay_list,
        next_chunk_bytes_MB,
    ]
    # Return the states
    return {
        "normal_states": normal_states,
        "time_series_states": time_series_states,
    }
import numpy as np

def state_func(
    bit_rate_kbps_list,
    buffer_size_second_list,
    delay_second_list,
    video_chunk_size_bytes_list,
    next_chunk_bytes_sizes,
    video_chunk_remain_num,
    total_chunk_num,
    all_bit_rate_kbps,
):
    # Current state features
    normed_last_bit_rate = bit_rate_kbps_list[-1] / float(np.max(all_bit_rate_kbps))
    buffer_norm_factor = 10.
    normed_last_buffer_size = buffer_size_second_list[-1] / buffer_norm_factor
    remaining_chunk_percentage = float(video_chunk_remain_num / total_chunk_num)
    # New state features
    # Normalized bitrate change over the last few chunks
    bit_rate_change = (bit_rate_kbps_list[-1] - bit_rate_kbps_list[-2]) / float(np.max(all_bit_rate_kbps))
    # Network throughput in near history
    network_throughput_MBps = []
    for i in range(len(bit_rate_kbps_list)):
        throughput = video_chunk_size_bytes_list[i] / 1000. / 1000. / delay_second_list[i]
        network_throughput_MBps.append(throughput)
    # Normalized new state features
    normed_bit_rate_change = bit_rate_change / float(np.max(all_bit_rate_kbps))
    normed_network_throughput = [x / (1000.0 * np.max(all_bit_rate_kbps)) for x in network_throughput_MBps]
    normal_states = [
        [normed_last_bit_rate, normed_last_buffer_size, remaining_chunk_percentage, normed_bit_rate_change]
    ]
    time_series_states = [
        normed_network_throughput
    ]
    return {
        "normal_states": normal_states,
        "time_series_states": time_series_states,
    }
import numpy as np

def state_func(
    bit_rate_kbps_list,
    buffer_size_second_list,
    delay_second_list,
    video_chunk_size_bytes_list,
    next_chunk_bytes_sizes,
    video_chunk_remain_num,
    total_chunk_num,
    all_bit_rate_kbps,
):
    # Normalization and state calculation constants
    buffer_norm_factor = 60.  # Increase to 60 seconds or consider a logarithmic approach
    throughput_norm_factor = np.max(all_bit_rate_kbps) / 1000.  # Normalize by max bitrate in Mbps
    delay_norm_factor = 5.  # Increase the normalization factor for delay
    alpha = 0.5  # Exponential averaging factor
    history_window = 8
    # normal states
    normed_last_bit_rate = bit_rate_kbps_list[-1] / float(np.max(all_bit_rate_kbps))
    normed_last_buffer_size = np.log(buffer_size_second_list[-1] + 1) / np.log(buffer_norm_factor)  # Logarithmic scaling
    remaining_chunk_percentage = float(video_chunk_remain_num / total_chunk_num)
    normal_states = [
        [normed_last_bit_rate],
        [normed_last_buffer_size],
        [remaining_chunk_percentage],
    ]
    # time series states
    throughput_MBps_list = [video_chunk_size_bytes_list[-1] / 1000. / 1000. / delay_second_list[-1] / throughput_norm_factor]
    normed_delay_list = [delay_second_list[-1] / delay_norm_factor]
    for i in range(1, history_window):
        # Exponentially averaged throughput and delay
        exp_avg_throughput = alpha * (video_chunk_size_bytes_list[-(i+1)] / 1000. / 1000. / delay_second_list[-(i+1)]) \;
                            + (1 - alpha) * throughput_MBps_list[-1]
        throughput_MBps_list.append(exp_avg_throughput / throughput_norm_factor)
        exp_avg_delay = alpha * (delay_second_list[-(i+1)] / delay_norm_factor) \;
                        + (1 - alpha) * normed_delay_list[-1]
        normed_delay_list.append(exp_avg_delay)
    # Reverse lists to align with historical ordering
    throughput_MBps_list.reverse()
    normed_delay_list.reverse()
    # Normalizing next chunk size values using max possible chunk size for normalization
    max_next_chunk_size_MB = np.max(next_chunk_bytes_sizes) / 1000. / 1000.
    next_chunk_bytes_MB = [x / 1000. / 1000. / max_next_chunk_size_MB for x in next_chunk_bytes_sizes]
    time_series_states = [
        throughput_MBps_list,
        normed_delay_list,
        next_chunk_bytes_MB,
    ]
    # Return the states
    return {
        "normal_states": normal_states,
        "time_series_states": time_series_states,
    }
import numpy as np
from scipy import stats

def state_func(
    bit_rate_kbps_list,
    buffer_size_second_list,
    delay_second_list,
    video_chunk_size_bytes_list,
    next_chunk_bytes_sizes,
    video_chunk_remain_num,
    total_chunk_num,
    all_bit_rate_kbps,
):
    # Current code remains unchanged for normal states
    normed_last_bit_rate = bit_rate_kbps_list[-1] / float(np.max(all_bit_rate_kbps))
    buffer_norm_factor = 10.
    normed_last_buffer_size = buffer_size_second_list[-1] / buffer_norm_factor
    remaining_chunk_percentage = float(video_chunk_remain_num / total_chunk_num)
    normal_states = [
        [normed_last_bit_rate],
        [normed_last_buffer_size],
        [remaining_chunk_percentage],
    ]
    # Time series state 1: Estimated throughput in near history
    history_window = 8
    throughput_MBps_list = []
    for i in range(history_window):
        history_chunk_size_bytes = video_chunk_size_bytes_list[-(history_window - i)]
        history_delay_second = delay_second_list[-(history_window - i)]
        throughput_MBps_list.append(history_chunk_size_bytes / 1000. / 1000. / history_delay_second)
    # Time series state 2: The normed download time (delay) in near history
    delay_norm_factor = 10.
    normed_delay_list = [x / delay_norm_factor for x in delay_second_list]
    # Time series state 3: Treat next chunk sizes in MB as ts states, too.
    next_chunk_bytes_MB = [x / 1000. / 1000. for x in next_chunk_bytes_sizes]
    # New normal state: Quality of Experience (QoE)
    qoe = (normed_last_bit_rate * normed_last_buffer_size) / np.mean(normed_delay_list)
    # New time series state 4: Variability in throughput and delay
    throughput_variance = np.var(throughput_MBps_list)
    delay_variance = np.var(normed_delay_list)
    # New time series state 5: Trend of throughput and delay
    throughput_trend = stats.linregress(np.arange(history_window), throughput_MBps_list).slope
    delay_trend = stats.linregress(np.arange(history_window), normed_delay_list).slope
    time_series_states = [
        throughput_MBps_list,
        normed_delay_list,
        next_chunk_bytes_MB,
        [qoe],
        [throughput_variance, delay_variance],
        [throughput_trend, delay_trend]
    ]
    return {
        "normal_states": normal_states,
        "time_series_states": time_series_states,
    }
import numpy as np
# scipy.stats could be used for z-score standardization if we were allowed to import it.

def state_func(
    bit_rate_kbps_list,
    buffer_size_second_list,
    delay_second_list,
    video_chunk_size_bytes_list,
    next_chunk_bytes_sizes,
    video_chunk_remain_num,
    total_chunk_num,
    all_bit_rate_kbps,
):
    # Constants for normalization
    buffer_norm_factor = 10.
    delay_norm_factor = 10.
    max_bit_rate = np.max(all_bit_rate_kbps)
    history_window = min(8, len(bit_rate_kbps_list))  # min to handle shorter lists
    # Normal states
    normed_last_bit_rate = bit_rate_kbps_list[-1] / float(max_bit_rate)
    normed_last_buffer_size = buffer_size_second_list[-1] / buffer_norm_factor
    remaining_chunk_percentage = float(video_chunk_remain_num) / total_chunk_num
    # Bit rate change variance (smoothness metric)
    bit_rate_changes = np.diff(bit_rate_kbps_list[-history_window:]) / max_bit_rate
    smoothness_metric = np.var(bit_rate_changes)
    # Future chunk size ratio
    last_chunk_size_MB = video_chunk_size_bytes_list[-1] / (1000. * 1000.)
    future_chunk_size_ratio = [x / (last_chunk_size_MB * 1000. * 1000.) for x in next_chunk_bytes_sizes]
    # Normal states list
    normal_states = [
        [normed_last_bit_rate],
        [normed_last_buffer_size],
        [remaining_chunk_percentage],
        [smoothness_metric],
    ]
    # Time series states
    # Throughput in near history (standardized)
    throughput_MBps_list = np.array([
        video_chunk_size_bytes_list[-(history_window - i)] / (delay_second_list[-(history_window - i)] * 1000. * 1000.)
        for i in range(history_window)
    ])
    norm_throughput_MBps_list = (throughput_MBps_list - np.mean(throughput_MBps_list)) / np.std(throughput_MBps_list)
    # Download time (delay) in near history (standardized)
    normed_delay_list = (np.array(delay_second_list[-history_window:]) / delay_norm_factor - 1)
    # Exponential Moving Average of throughput and delay
    ema_throughput = np.average(throughput_MBps_list, weights=np.exp(np.arange(history_window)))
    ema_delay = np.average(normed_delay_list, weights=np.exp(np.arange(history_window)))
    # Buffer size trend
    buffer_size_trend = np.diff(buffer_size_second_list[-(history_window + 1):]) / buffer_norm_factor
    # Time series states list
    time_series_states = [
        norm_throughput_MBps_list.tolist(),
        normed_delay_list.tolist(),
        future_chunk_size_ratio,
        [ema_throughput],
        [ema_delay],
        buffer_size_trend.tolist(),
    ]
    # Return the states
    return {
        "normal_states": normal_states,
        "time_series_states": time_series_states,
    }
import numpy as np

def state_func(
    bit_rate_kbps_list,
    buffer_size_second_list,
    delay_second_list,
    video_chunk_size_bytes_list,
    next_chunk_bytes_sizes,
    video_chunk_remain_num,
    total_chunk_num,
    all_bit_rate_kbps,
):
    # existing normal states
    normed_last_bit_rate = bit_rate_kbps_list[-1] / float(np.max(all_bit_rate_kbps))
    buffer_norm_factor = 10.
    normed_last_buffer_size = buffer_size_second_list[-1] / buffer_norm_factor
    remaining_chunk_percentage = float(video_chunk_remain_num / total_chunk_num)
    normal_states = [
        [normed_last_bit_rate],
        [normed_last_buffer_size],
        [remaining_chunk_percentage],
    ]
    # existing time series states
    history_window = 8
    throughput_MBps_list = []
    for i in range(history_window):
        history_chunk_size_bytes = video_chunk_size_bytes_list[-(history_window - i)]
        history_delay_second = delay_second_list[-(history_window - i)]
        throughput_MBps_list.append(history_chunk_size_bytes / 1000. / 1000. / history_delay_second)
    delay_norm_factor = 10.
    normed_delay_list = [x / delay_norm_factor for x in delay_second_list]
    next_chunk_bytes_MB = [x / 1000. / 1000. for x in next_chunk_bytes_sizes]
    time_series_states = [
        throughput_MBps_list,
        normed_delay_list,
        next_chunk_bytes_MB,
    ]
    # new features: variance in throughput
    throughput_variance = np.var(throughput_MBps_list)
    normal_states.append([throughput_variance])
    # new features: prediction of future throughput
    # Utilize a prediction model to estimate future throughput and normalize the result
    predicted_throughput = [0.2, 0.3, 0.4]  # Placeholder for predicted values
    predicted_throughput_normed = [x / float(np.max(throughput_MBps_list)) for x in predicted_throughput]
    normal_states.append(predicted_throughput_normed)
    # new features: statistical features of network conditions
    # Incorporate statistical features such as mean and standard deviation of delay and buffer size
    mean_delay = np.mean(delay_second_list)
    std_buffer_size = np.std(buffer_size_second_list)
    normal_states.append([mean_delay, std_buffer_size])
    return {
        "normal_states": normal_states,
        "time_series_states": time_series_states,
    }
import numpy as np

def state_func(
    bit_rate_kbps_list,
    buffer_size_second_list,
    delay_second_list,
    video_chunk_size_bytes_list,
    next_chunk_bytes_sizes,
    video_chunk_remain_num,
    total_chunk_num,
    all_bit_rate_kbps,
):
    # Constants for normalization
    buffer_norm_factor = 10.0
    delay_norm_factor = 10.0
    bitrate_norm_factor = np.max(all_bit_rate_kbps)
    size_norm_factor = 1000.0 * 1000.0  # for converting bytes to MB
    # Normalized last bit rate
    normed_last_bit_rate = bit_rate_kbps_list[-1] / bitrate_norm_factor
    # Normalized last buffer size (clipped to max 10 seconds)
    normed_last_buffer_size = np.clip(buffer_size_second_list[-1], 0, buffer_norm_factor) / buffer_norm_factor
    # Percentage of remaining video chunks
    remaining_chunk_percentage = video_chunk_remain_num / total_chunk_num
    # Normal state list
    normal_states = [
        [normed_last_bit_rate],
        [normed_last_buffer_size],
        [remaining_chunk_percentage],
    ]
    # Historical states for bit rates and buffer size differences
    history_window = 8  # for time series state
    normed_bitrate_history = [br / bitrate_norm_factor for br in bit_rate_kbps_list[-history_window:]]
    buffer_size_diffs = np.diff(buffer_size_second_list[-history_window - 1:]) / buffer_norm_factor
    buffer_size_diffs = np.clip(buffer_size_diffs, -1, 1).tolist()  # clip to ensure it stays in range [-1, 1]
    # Estimated throughput in near history normalized
    throughput_MBps_list = [(video_chunk_size_bytes_list[-(history_window - i)] / size_norm_factor) / delay_second_list[-(history_window - i)] for i in range(history_window)]
    # The normed download time (delay) in near history
    normed_delay_list = [(x / delay_norm_factor) for x in delay_second_list[-history_window:]]
    # Throughput stability (variance)
    throughput_variance = np.var(throughput_MBps_list) / np.var(all_bit_rate_kbps)
    # Sizes for the next chunk normalized
    next_chunk_sizes_norm = [size / size_norm_factor for size in next_chunk_bytes_sizes]
    # Time series states list
    time_series_states = [
        normed_bitrate_history,
        buffer_size_diffs,
        throughput_MBps_list,
        normed_delay_list,
        next_chunk_sizes_norm,
        [throughput_variance],  # included as a single-element list for consistency
    ]
    return {
        "normal_states": normal_states,
        "time_series_states": time_series_states,
    }

8.2 Top Performing Network Architectures↩︎

import tensorflow.compat.v1 as tf
import tflearn

def network_func(normal_input_list, ts_input_list, action_dim):
    with tf.variable_scope('actor'):
        normal_features = [
            tflearn.fully_connected(normal_input, 256, activation='leaky_relu')
            for normal_input in normal_input_list
        ]
        ts_features = [
            tflearn.flatten(tflearn.conv_1d(
                tf.expand_dims(ts_input, axis=1), 
                256, 1, activation='leaky_relu'
            ))
            for ts_input in ts_input_list
        ]
        merged_features = tflearn.merge(normal_features + ts_features, "concat")
        pi_features = tflearn.fully_connected(merged_features, 256, activation='leaky_relu')
        pi = tflearn.fully_connected(pi_features, action_dim, activation='softmax')

    with tf.variable_scope('critic'):
        normal_features = [
            tflearn.fully_connected(normal_input, 256, activation='leaky_relu')
            for normal_input in normal_input_list
        ]
        ts_features = [
            tflearn.flatten(tflearn.conv_1d(
                tf.expand_dims(ts_input, axis=1), 
                256, 1, activation='leaky_relu'
            ))
            for ts_input in ts_input_list
        ]
        merged_features = tflearn.merge(normal_features + ts_features, "concat")
        value_features = tflearn.fully_connected(merged_features, 256, activation='leaky_relu')
        value = tflearn.fully_connected(value_features, 1, activation='linear')

    return pi, value
import tensorflow.compat.v1 as tf
import tflearn

def network_func(normal_input_list, ts_input_list, action_dim):
    with tf.variable_scope('actor'):
        normal_features = [
            tflearn.fully_connected(normal_input, 256, activation='relu')
            for normal_input in normal_input_list
        ]
        ts_features = [
            tflearn.simple_rnn(tf.expand_dims(ts_input, axis=2), 128, return_seq=False)
            for ts_input in ts_input_list
        ]
        merged_features = tflearn.merge(normal_features + ts_features, "concat")
        pi_features = tflearn.fully_connected(merged_features, 256, activation='relu')
        pi = tflearn.fully_connected(pi_features, action_dim, activation='softmax')

    with tf.variable_scope('critic'):
        normal_features = [
            tflearn.fully_connected(normal_input, 256, activation='relu')
            for normal_input in normal_input_list
        ]
        ts_features = [
            tflearn.simple_rnn(tf.expand_dims(ts_input, axis=2), 128, return_seq=False)
            for ts_input in ts_input_list
        ]
        merged_features = tflearn.merge(normal_features + ts_features, "concat")
        value_features = tflearn.fully_connected(merged_features, 256, activation='relu')
        value = tflearn.fully_connected(value_features, 1, activation='linear')

    return pi, value
import tensorflow.compat.v1 as tf
import tflearn

def network_func(normal_input_list, ts_input_list, action_dim):
    with tf.variable_scope('actor'):
        # Process normal input features using fully connected layers
        normal_fc = tf.concat(normal_input_list, axis=1)
        normal_fc = tflearn.fully_connected(normal_fc, 256, activation='relu')
        
        # Process time series input features using LSTM cells
        ts_input = tf.concat(ts_input_list, axis=1)
        ts_input = tf.expand_dims(ts_input, -1)  # Add an extra dimension for input to LSTM
        lstm_cell = tf.nn.rnn_cell.LSTMCell(128)
        lstm_output, _ = tf.nn.dynamic_rnn(lstm_cell, ts_input, dtype=tf.float32)
        ts_lstm = tflearn.flatten(lstm_output)
        
        # Merge processed features and pass through actor network
        merged_features = tf.concat([normal_fc, ts_lstm], axis=1)
        pi_features = tflearn.fully_connected(merged_features, 128, activation='relu')
        pi = tflearn.fully_connected(pi_features, action_dim, activation='softmax')

    with tf.variable_scope('critic'):
        # Process normal input features using fully connected layers
        normal_fc_critic = tf.concat(normal_input_list, axis=1)
        normal_fc_critic = tflearn.fully_connected(normal_fc_critic, 256, activation='relu')
        
        # Process time series input features using LSTM cells
        lstm_output_critic, _ = tf.nn.dynamic_rnn(lstm_cell, ts_input, dtype=tf.float32)  # Reuse LSTM cell
        ts_lstm_critic = tflearn.flatten(lstm_output_critic)
        
        # Merge processed features and pass through critic network
        merged_features_critic = tf.concat([normal_fc_critic, ts_lstm_critic], axis=1)
        value_features = tflearn.fully_connected(merged_features_critic, 128, activation='relu')
        value = tflearn.fully_connected(value_features, 1, activation='linear')

    return pi, value
import tensorflow.compat.v1 as tf
import tflearn

def network_func(normal_input_list, ts_input_list, action_dim):
    with tf.variable_scope('actor_critic'):
        normal_features = [
            tflearn.fully_connected(normal_input, 256, activation='relu')
            for normal_input in normal_input_list
        ]
        ts_features = [
            tflearn.flatten(tflearn.conv_1d(
                tf.expand_dims(ts_input, axis=1), 
                256, 1, activation='relu'
            ))
            for ts_input in ts_input_list
        ]
        merged_features = tflearn.merge(normal_features + ts_features, "concat")
        hidden_layer = tflearn.fully_connected(merged_features, 512, activation='relu')
        
        pi = tflearn.fully_connected(hidden_layer, action_dim, activation='softmax')
        value = tflearn.fully_connected(hidden_layer, 1, activation='linear')

    return pi, value

8.3 Prompts↩︎

You are trying to design the state code of a reinforcement learning algorithm for adaptive bit rate.

Adaptive Bit Rate (ABR) is a streaming technology used in multimedia applications, particularly in video streaming, to optimize the delivery of content over networks with varying bandwidth conditions. The main goal of ABR is to provide a smooth and uninterrupted viewing experience for users by dynamically adjusting the quality of the video based on the available network conditions.

The input variables include:

bit_rate_kbps_list:
Historical bit rates in kbps, candidate values can be [300., 750., 1200., 1850., 2850., 4300.]
bit_rate_kbps_list[-1] is the most recent bit rate.
For example, [300, 750, ..., 1200, 1850, 4300, 4300] means the last two chunks we downloaded is 4300kbps.

buffer_size_second_list:
Historical buffer size (buffered video length) in second
buffer_size_second_list[-1] is the most recent buffer size.

delay_second_list:
Historical delay (download time of the chunk) in second
delay_second_list[-1] is the download time of the most recent downloaded chunk.

video_chunk_size_bytes_list:
Historical downloaded video chunk sizes in bytes
video_chunk_size_bytes_list[-1] is the size of the most recent downloaded chunk.
Thus video_chunk_size_bytes_list[-1] / delay_second_list[-1] is the download throughtput of the most recent chunks

next_chunk_bytes_sizes:
The sizes of the next one chunk in different bit rate levels.
For example, this can be [181801, 450283, 668286, 1034108, 1728879, 2354772],
which means the next chunk will be 181801 bytes if we select 300kbps (all_bit_rate_kbps[0]); is 450283 bytes if we select 750kbps(all_bit_rate_kbps[1])
We always have len(next_chunk_bytes_sizes) = len(all_bit_rate_kbps)

video_chunk_remain_num:
How many remaining video chunks there are. It is a single number.

total_chunk_num:
A single number. total_chunk_num is 48 in most cases.

all_bit_rate_kbps:
all_bit_rate_kbps=[300., 750., 1200., 1850., 2850., 4300.] in most cases


```python
import numpy as np

def state_func(
    # historical bit rates in kbps, candidate values can be [300., 750., 1200., 1850., 2850., 4300.]
    # bit_rate_kbps_list[-1] is the most recent bit rate.
    # For example, [300, 750, ..., 1200, 1850, 4300, 4300] means the last two chunks we downloaded is 4300kbps.
    bit_rate_kbps_list,
    # historical buffer size (buffered video length) in second
    # buffer_size_second_list[-1] is the most recent buffer size.
    buffer_size_second_list,
    # historical delay (download time of the chunk) in second
    # delay_second_list[-1] is the download time of the most recent downloaded chunk.
    delay_second_list,
    # historical downloaded video chunk sizes in bytes
    # video_chunk_size_bytes_list[-1] is the size of the most recent downloaded chunk.
    # Thus video_chunk_size_bytes_list[-1] / delay_second_list[-1] is the download throughtput of the most recent chunks
    video_chunk_size_bytes_list,
    # The sizes of the next one chunk in different bit rate levels.
    # For example, this can be [181801, 450283, 668286, 1034108, 1728879, 2354772],
    # which means the next chunk will be 181801 bytes if we select 300kbps (all_bit_rate_kbps[0]); is 450283 bytes if we select 750kbps(all_bit_rate_kbps[1])
    # We always have len(next_chunk_bytes_sizes) = len(all_bit_rate_kbps)
    next_chunk_bytes_sizes,
    # How many remaining video chunks there are. It is a single number
    video_chunk_remain_num,
    # A single number. total_chunk_num is 48 in most cases.
    total_chunk_num,
    # all_bit_rate_kbps=[300., 750., 1200., 1850., 2850., 4300.] in most cases
    all_bit_rate_kbps,
):
    # normal state 1: The normed last bit rate
    normed_last_bit_rate = bit_rate_kbps_list[-1] / float(np.max(all_bit_rate_kbps))
    # normal state 2: The normed last buffer size second (buffered video second)
    buffer_norm_factor = 10.
    normed_last_buffer_size = buffer_size_second_list[-1] / buffer_norm_factor # in 10-second
    # normal state 3: The percentage of the remaining video chunks.
    remaining_chunk_percentage = float(video_chunk_remain_num / total_chunk_num)
    # Finally, the normal states. Each entry in normal_states should be a list.
    normal_states = [
        [normed_last_bit_rate],
        [normed_last_buffer_size],
        [remaining_chunk_percentage],
    ]

    # time series states
    # use 8 as the time series length for time series state 1 and 2
    history_window = 8
    # time series state 1: Estimated throughput in near history
    # use the unit mega byte per second (it is equiv to kilo byte / ms)
    throughput_MBps_list = []
    for i in range(history_window):
        history_chunk_size_bytes = video_chunk_size_bytes_list[-(history_window - i)]
        history_delay_second = delay_second_list[-(history_window - i)]
        throughput_MBps_list.append(history_chunk_size_bytes / 1000. / 1000. / history_delay_second)
    # time series state 2: The normed download time (delay) in near history
    delay_norm_factor = 10.
    normed_delay_list = [x / delay_norm_factor for x in delay_second_list]
    # time series state 3: Treat next chunk sizes in MB as ts states, too. We use Mega Byte since Byte is too large for NN.
    next_chunk_bytes_MB = [x / 1000. / 1000. for x in next_chunk_bytes_sizes]
    # Finally, the time series states. Each entry in timeseries_states should be a list.
    time_series_states = [
        throughput_MBps_list,
        normed_delay_list,
        next_chunk_bytes_MB,
    ]

    # Return the states
    return {
        "normal_states": normal_states,
        "time_series_states": time_series_states,
    }
``` 

Try to improve the state design for me. Remember:

1. Please keep the function name `state_func`
2. Please keep the function's input variables. Do not add any new inputs or remove any inputs. But you can decide how to use the existing inputs in the parameter list.
3. Please keep the function's output variables. The outputs should always be {"normal_states": normal_states, "time_series_states": time_series_states}, while normal_states and time_series_states are list of list. Specifically every element in normal_states would be a list of values that will be sent to a fully connected network; and every element in time_series_states will be treated as time series and sent to a 1D conv network.
4. Please keep `import numpy`. You can also import scipy, pandas if needed.
5. Please normalize the input properly when you try to add new states. It is always better to have the output states within the range [-1, 1]. Do not directly put kbps, bytes, or kbps / second in the output.

Using the following format to output:

Analysis and ideas:
<try to analyze the current code, the problem, propose ideas, and choose the best ideas>

Code:
```python
<Your improved state design code here>
```
You are trying to design a network function of a reinforcement learning algorithm for adaptive bit rate.

Adaptive Bit Rate (ABR) is a streaming technology used in multimedia applications, particularly in video streaming, to optimize the delivery of content over networks with varying bandwidth conditions. The main goal of ABR is to provide a smooth and uninterrupted viewing experience for users by dynamically adjusting the quality of the video based on the available network conditions.

We have 3 inputs: normal_input_list, ts_input_list, and action_dim.

Their meanings are:
  - normal_input_list[0]: normed last bit rate. It is a tf.placeholder and has the shape of `[None, 1]`.
  - normal_input_list[1]: normed last buffer size. It is a tf.placeholder and has the shape of `[None, 1]`.
  - normal_input_list[2]: normed remaining chunk percentage. 0  - ts_input_list[0]: normed throughput MBps. It is the normed throughput numbers in history. It is a tf.placeholder and has the shape of `[None, history_window_n]`.
  - ts_input_list[1]: normed download time list. It is download time of each chunks. It is a tf.placeholder and has the shape of `[None, history_window_n]`.
  - ts_input_list[2]: next chunk sizes in MB. It is the chunk sizes of the next chunk in different bitrates. It is a tf.placeholder and has the shape of `[None, bit_rate_n]`.


A network function example is:
```python
import tensorflow.compat.v1 as tf
import tflearn

def network_func(normal_input_list, ts_input_list, action_dim):
    with tf.variable_scope('actor'):
        normal_features = [
            tflearn.fully_connected(normal_input, 128, activation='relu')
            for normal_input in normal_input_list
        ]
        ts_features = [
            tflearn.flatten(tflearn.conv_1d(
                tf.expand_dims(ts_input, axis=1), 
                128, 1, activation='relu'
            ))
            for ts_input in ts_input_list
        ]
        merged_features = tflearn.merge(normal_features + ts_features, "concat")
        pi_features = tflearn.fully_connected(merged_features, 128, activation='relu')
        pi = tflearn.fully_connected(pi_features, action_dim, activation='softmax')

    with tf.variable_scope('critic'):
        normal_features = [
            tflearn.fully_connected(normal_input, 128, activation='relu')
            for normal_input in normal_input_list
        ]
        ts_features = [
            tflearn.flatten(tflearn.conv_1d(
                tf.expand_dims(ts_input, axis=1), 
                128, 1, activation='relu'
            ))
            for ts_input in ts_input_list
        ]
        merged_features = tflearn.merge(normal_features + ts_features, "concat")
        value_features = tflearn.fully_connected(merged_features, 128, activation='relu')
        value = tflearn.fully_connected(value_features, 1, activation='linear')

    return pi, value
```

The example uses fully connected network to process all states in normal_input_list, and uses time series network to process the states in ts_input_list.

Please notice that:
- You MUST include "import tensorflow.compat.v1 as tf" and "import tflearn" in the beginning of your code. Use tflearn or tensorflow v1 to program the network.
- You MUST use the same function name: `network_func` in your output.
- You MUST use the two outputs `pi, value` in your returning. We strictly follow the actor-critic structure but you can design the intermediate network structure.
- The two outputs: `pi` should be exactly the same dimention with `action_dim`, which can be viewed as a classification. `value` should be exactly a single value which is used to evaluate the states.
- DO NOT change the input: normal_input_list, ts_input_list, action_dim.
- You can change the hidden_num, the intermediate network structure, or use other tensorflow v1 functions to program the network.

Using the following format to output:

Analysis and ideas:
<try to analyze the current code, the problem, propose ideas, and choose the best ideas>

Code:
```python
<Your improved network design code here>
```

References↩︎

[1]
B. Mann et al., “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020.
[2]
S. Zhou, U. Alon, F. F. Xu, Z. Wang, Z. Jiang, and G. Neubig, “Docprompting: Generating code by retrieving the docs,” arXiv preprint arXiv:2207.05987, 2022.
[3]
M. Chen et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
[4]
S. Hong et al., “Metagpt: Meta programming for multi-agent collaborative framework,” arXiv preprint arXiv:2308.00352, 2023.
[5]
X. Yin, A. Jindal, V. Sekar, and B. Sinopoli, “A control-theoretic approach for dynamic adaptive video streaming over HTTP,” in Proceedings of the 2015 ACM conference on special interest group on data communication, 2015, pp. 325–338.
[6]
K. Spiteri, R. Urgaonkar, and R. K. Sitaraman, “BOLA: Near-optimal bitrate adaptation for online videos,” IEEE/ACM transactions on networking, vol. 28, no. 4, pp. 1698–1711, 2020.
[7]
H. Mao, R. Netravali, and M. Alizadeh, “Neural adaptive video streaming with pensieve,” in Proceedings of the conference of the ACM special interest group on data communication, 2017, pp. 197–210.
[8]
F. Y. Yan et al., Learning in situ: a randomized experiment in video streaming,” in USENIX symposium on networked systems design and implementation (NSDI), Feb. 2020.
[9]
Z. Akhtar et al., “Oboe: Auto-tuning video ABR algorithms to network conditions,” in Proceedings of the 2018 conference of the ACM special interest group on data communication, 2018, pp. 44–58.
[10]
B. Romera-Paredes et al., “Mathematical discoveries from program search with large language models,” Nature, vol. 625, no. 7995, pp. 468–475, 2024.
[11]
Y. J. Ma et al., “Eureka: Human-level reward design via coding large language models,” in The twelfth international conference on learning representations, 2023.
[12]
U. Naseer and T. A. Benson, “Configanator: A data-driven approach to improving \(\{\)CDN\(\}\) performance.” in 19th USENIX symposium on networked systems design and implementation (NSDI 22), 2022, pp. 1135–1158.
[13]
C. Qiao, G. Li, Q. Ma, J. Wang, and Y. Liu, “Trace-driven optimization on bitrate adaptation for mobile video streaming,” IEEE Transactions on Mobile Computing, vol. 21, no. 6, pp. 2243–2256, 2020.
[14]
D. Kumar, S. Aishwarya, A. Srinivasan, and L. A. Raj, “Adaptive video streaming over HTTP using stochastic bitrate prediction in 4G wireless networks,” in 2016 ITU kaleidoscope: ICTs for a sustainable world (ITU WT), 2016, pp. 1–8.
[15]
M. F. Tuysuz and M. E. Aydin, “QoE-based mobility-aware collaborative video streaming on the edge of 5G,” IEEE Transactions on Industrial Informatics, vol. 16, no. 11, pp. 7115–7125, 2020.
[16]
A.-T. Tran, N.-N. Dao, and S. Cho, “Bitrate adaptation for video streaming services in edge caching systems,” IEEE Access, vol. 8, pp. 135844–135852, 2020.
[17]
E. Ramadan, A. Narayanan, U. K. Dayalan, R. A. Fezeu, F. Qian, and Z.-L. Zhang, “Case for 5G-aware video streaming applications,” in Proceedings of the 1st workshop on 5g measurements, modeling, and use cases, 2021, pp. 27–34.
[18]
J. Zhao and J. Pan, “QoE-driven joint decision-making for multipath adaptive video streaming,” in GLOBECOM 2023-2023 IEEE global communications conference, 2023, pp. 128–133.
[19]
X. He, K. Zhao, and X. Chu, “AutoML: A survey of the state-of-the-art,” Knowledge-based systems, vol. 212, p. 106622, 2021.
[20]
T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A survey,” Journal of Machine Learning Research, vol. 20, no. 55, pp. 1–21, 2019.
[21]
J. Wei et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022.
[22]
R. Müller, S. Kornblith, and G. E. Hinton, “When does label smoothing help?” Advances in neural information processing systems, vol. 32, 2019.
[23]
FCC, [Accessed 10-03-2024]Measuring Broadband America — fcc.gov.” https://www.fcc.gov/general/measuring-broadband-america, 2024.
[24]
Z. Xia, Y. Zhou, F. Y. Yan, and J. Jiang, “Genet: Automatic curriculum generation for learning adaptation in networking,” in Proceedings of the ACM SIGCOMM 2022 conference, 2022, pp. 397–413.
[25]
F. Michel, M. Trevisan, D. Giordano, and O. Bonaventure, “A first look at starlink performance,” in Proceedings of the 22nd ACM internet measurement conference, 2022, pp. 130–136.
[26]
Z. Lai, H. Li, and J. Li, “Starperf: Characterizing network performance for emerging mega-constellations,” in 2020 IEEE 28th international conference on network protocols (ICNP), 2020, pp. 1–11.
[27]
M. M. Kassem, A. Raman, D. Perino, and N. Sastry, “A browser-side view of starlink connectivity,” in Proceedings of the 22nd ACM internet measurement conference, 2022, pp. 151–158.
[28]
Google, [Accessed 10-03-2024]YouTube recommended upload encoding settings - YouTube Help — support.google.com.” https://support.google.com/youtube/answer/1722171?hl=en, 2024.
[29]
A. Savitzky and M. J. Golay, “Smoothing and differentiation of data by simplified least squares procedures.” Analytical chemistry, vol. 36, no. 8, pp. 1627–1639, 1964.
[30]
T. Huang, C. Zhou, R.-X. Zhang, C. Wu, X. Yao, and L. Sun, “Stick: A harmonious fusion of buffer-based and learning-based approach for adaptive streaming,” in IEEE INFOCOM 2020-IEEE conference on computer communications, 2020, pp. 1967–1976.
[31]
Y. Sun et al., “CS2P: Improving video bitrate selection and adaptation with data-driven throughput prediction,” in Proceedings of the 2016 ACM SIGCOMM conference, 2016, pp. 272–285.
[32]
I. Grondman, L. Busoniu, G. A. Lopes, and R. Babuska, “A survey of actor-critic reinforcement learning: Standard and natural policy gradients,” IEEE Transactions on Systems, Man, and Cybernetics, part C (applications and reviews), vol. 42, no. 6, pp. 1291–1307, 2012.
[33]
T. Huang, R.-X. Zhang, C. Zhou, and L. Sun, “QARC: Video quality aware rate control for real-time video streaming via deep reinforcement learning,” in In proc. Of ACM multimedia, 2018.
[34]
Y. Du et al., “Guiding pretraining in reinforcement learning with large language models,” in International conference on machine learning, 2023, pp. 8657–8677.
[35]
Z. Zhao, W. S. Lee, and D. Hsu, “Large language models as commonsense knowledge for large-scale task planning,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[36]
C. Colas, L. Teodorescu, P.-Y. Oudeyer, X. Yuan, and M.-A. Côté, “Augmenting autotelic agents with large language models,” in Conference on lifelong learning agents, 2023, pp. 205–226.
[37]
L. Wolf and M. Musolesi, “Augmented modular reinforcement learning based on heterogeneous knowledge,” arXiv preprint arXiv:2306.01158, 2023.
[38]
K. Choi, C. Cundy, S. Srivastava, and S. Ermon, “LMPriors: Pre-trained language models as task-specific priors,” in NeurIPS 2022 foundation models for decision making workshop, 2022.
[39]
T. G. Karimpanal et al., “LaGR-SEQ: Language-guided reinforcement learning with sample-efficient querying,” arXiv preprint arXiv:2308.13542, 2023.
[40]
L. Espeholt et al., “Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures,” in International conference on machine learning, 2018, pp. 1407–1416.
[41]
S. Paul, V. Kurin, and S. Whiteson, “Fast efficient hyperparameter tuning for policy gradients,” arXiv preprint arXiv:1902.06583, 2019.
[42]
Z. Xu, H. P. van Hasselt, M. Hessel, J. Oh, S. Singh, and D. Silver, “Meta-gradient reinforcement learning with an objective discovered online,” Advances in Neural Information Processing Systems, vol. 33, pp. 15254–15264, 2020.
[43]
A. Faust, A. Francis, and D. Mehta, “Evolving rewards to automate reinforcement learning,” arXiv preprint arXiv:1905.07628, 2019.
[44]
V. Veeriah et al., “Discovery of useful questions as auxiliary tasks,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[45]
J. K. Franke, G. Koehler, A. Biedenkapp, and F. Hutter, “Sample-efficient automated deep reinforcement learning,” in International conference on learning representations, 2020.
[46]
T. He et al., “Reinforcement learning with automated auxiliary loss search,” Advances in Neural Information Processing Systems, vol. 35, pp. 1820–1834, 2022.

  1. \(^{\dagger}\)Lili Qiu is the corresponding author.↩︎

  2. \(^{\ddagger}\)Aashish Gottipati and Kenuo Xu contributed to this work during their internships at Microsoft Research.↩︎