Learning to Control Camera Exposure via Reinforcement Learning

Kyunghyun Lee*
LG AI Research


Ukcheol Shin*


Byeong-Uk Lee


Adjusting camera exposure in arbitrary lighting conditions is the first step to ensure the functionality of computer vision applications 1

. Poorly adjusted camera exposure often leads to critical failure and performance degradation. Traditional camera exposure control methods require multiple convergence steps and time-consuming processes, making them unsuitable for dynamic lighting conditions. In this paper, we propose a new camera exposure control framework that rapidly controls camera exposure while performing real-time processing by exploiting deep reinforcement learning. The proposed framework consists of four contributions: 1) a simplified training ground to simulate real-world’s diverse and dynamic lighting changes, 2) flickering and image attribute-aware reward design, along with lightweight state design for real-time processing, 3) a static-to-dynamic lighting curriculum to gradually improve the agent’s exposure-adjusting capability, and 4) domain randomization techniques to alleviate the limitation of the training ground and achieve seamless generalization in the wild. As a result, our proposed method rapidly reaches a desired exposure level within five steps with real-time processing (1ms). Also, the acquired images are well-exposed and show superiority in various computer vision tasks, such as feature extraction and object detection.

1 Introduction↩︎

Camera exposure control is the task of adjusting exposure level by controlling exposure time, gain, and aperture to achieve a desired level of brightness and image quality for a given scene. Poorly adjusted exposure parameters result in over-exposed, under-exposed, blurry, or noisy images, which can cause performance degradation in image-based applications and, in the worst cases, even life-threatening accidents. Therefore, finding proper camera exposure is the first primary step to ensure the functionality of computer vision applications, such as object detection [1], [2], semantic segmentation [3], [4], depth estimation [5], [6], and visual odometry [7], [8].

There are several essential requirements in camera exposure control. The rapid convergence must be guaranteed to maintain an appropriate exposure level under dynamic light-changing scenarios. Also, the exposure control loop is one of the lowest loops in the camera system. Therefore, lightweight algorithm design must be considered for on-board level operation. Finally, the quality of a converged image should not be sacrificed to meet the requirements.

Further, the number of simultaneously controlled parameters is also important because it affects the converge time and final quality of the converged image. One-by-one control methods [9][11] control exposure parameters in a one-by-one manner to achieve a desired exposure level, rather than joint controlling exposure parameters. However, the converged parameters are often not optimal, such as [long exposure time, low gain] and [short exposure time, high gain] pairs. As a result, the values result in undesirable image artifacts, such as motion blur due to long exposure time or severe noise due to high gain.

Joint exposure parameter control [12][16] often needs multiple searching steps in a wide range of searching space to find an optimal combination. As a result, they cause a flickering effect and slow convergence speed. Also, the recent methods require high-level computational complexity due to its optimization algorithm [12], [14], image assessment metric [11][14], and GPU inference [15].

In this paper, we propose a new joint exposure parameter control method that exploits reinforcement learning to achieve instant convergence and real-time processing. The proposed framework consists of four contributions:

  • A simplified training ground to simulate real-world’s diverse and dynamic lighting changes.

  • Flickering and image attribute-aware reward design, along with lightweight and intuitive state design for real-time processing.

  • A static-to-dynamic lighting curriculum learning to gradually improve agent’s exposure adjusting capability.

  • Domain randomization techniques to alleviate the limitation of the training ground and achieve seamless generalization in the wild without additional training.

The proposed method is thoroughly validated in three different environments: light-controlled darkroom, exposure control dataset [13], and real-world environments. We demonstrate that our proposed method rapidly adjusts camera exposure within five steps with real-time processing of 1 ms. Also, the images acquired from our method are well-exposed and show superiority in numerous computer vision tasks, such as feature extraction and object detection.

2 Related Work↩︎

Figure 1: Training framework overview. Our DRL agent is trained with the SAC algorithm in the light-controlled dark room environment. For each episode, a lighting condition is assigned by the current curriculum level. The lighting condition can be fixed at random brightness or dynamically changed within each episode, depending on the level. Given the lighting condition, the agent takes a vectorized intensity history for a randomly selected RoI patch as a state. Afterward, the agent estimates exposure time and gain differences that maximize a reward function. With this framework, the trained agent successfully generalized into a real environment without additional training.

2.1 Optimization-based Exposure Control↩︎

One branch to control camera exposure parameters is exploiting white-box and black-box optimizations to find optimal parameters for the desired exposure level. Camera built-in Auto-Exposure (AE) control methods [9], [10] adjust exposure parameters (, exposure time and gain) based on differentiable optimization by using the equation between Exposure Value (EV) and exposure parameters [17]. They control exposure parameters one-by-one to achieve pre-defined image brightness. These built-in AE methods provide real-time processing ability but result in non-optimum solutions (, long exposure time and low gain) and limited scalability. The former limitation causes motion blur and severe image noise due to long exposure time and high gain. The latter limitation indicates the methods cannot be extendable to maximize other image attributes, such as image gradient or entropy.

Recent AE algorithms are designed to maximize desirable image attributes for computer vision applications, such as image gradient [11], [18], [19], entropy [12], [14], noise level [13], and optical flow [20]. However, these algorithms mainly focused on image metrics for better quality, not the control method. Therefore, they adopt heuristic control algorithm [11] or black-box optimization methods, such as Bayesian optimization [12], [14] and Nelder-Mead optimization [13]. These black-box optimizations and attribute assessment metrics often require multiple explorations that cause a flickering effect, multiple steps to converge, or heavy computation time. Differing from the previous method, the proposed method provides rapid convergence, real-time processing, and potential scalability by exploiting Deep Reinforcement Learning (DRL).

2.2 Data-driven Exposure Control↩︎

Another emerging branch of AE is utilizing a neural network to predict appropriate exposure parameters. [15] proposed an exposure parameter estimation network that predicts optimal exposure time and gain for each given image. The neural network, consisting of a few convolution and linear layers, is trained with Ground Truth (GT) exposure parameters in a supervised manner. However, the GT label generation needs a time-consuming and complicated process that collects multiple images with varying exposure parameters for every scene. Also, a heavy computation caused by the use of convolution layers is another drawback. Differing from the method, our method trains a neural network by maximizing a reward function without relying on specific GT data or generation processes.

3 DRL for Automatic Exposure Control↩︎

Applying DRL to Automatic Exposure (AE) control task while achieving rapid convergence speed and real-time processing presents several challenges. Our proposed method provides effective solutions for the following questions.

  1. Environment: what is the most effective form of training environment to learn camera exposure control?

  2. Reward: what aspects does the agent need to maximize?

  3. State: where is the bottleneck for real-time processing?

  4. Generalization: how to achieve seamless generalization in the wild lighting condition?

3.1 Training Environment↩︎

DRL requires a large number of samples and interaction with the environment to train the agent. Also, the environment needs to provide a diverse and wide range of problems for the agent to solve. The optimal form of exposure control is to instantly adjust exposure parameters for a variety of lighting conditions, from static lighting conditions to dramatic lighting changes. To this end, the training environment for camera exposure control must provide diverse lighting change scenarios to the agent.

Numerous options exist, such as a simulation, a real-world environment with natural sunlight, and a controlled real-world environment. The simulation can provide various images and lighting conditions but has the limitations of an imperfect exposure parameter implementation and a sim-to-real domain gap. On the other hand, the real-world environment with natural sunlight has no domain gap, but the lighting conditions change very slowly. Therefore, we construct a controlled real-world environment in a darkroom with controllable LEDs to adjust lighting conditions.

The constructed light-controlled darkroom is shown in Fig. 1. The environment has one machine vision camera, a random target object, a light controller, and an LED bar. The environment provides a random lighting scenario from dark to bright light conditions, and the agent adjusts exposure parameters to capture a high-quality image of the target object in the suggested scenario. The detailed sensor specification can be found in the supplementary material.

3.2 State, Action, and Reward Design↩︎

3.2.1 State Design: Vectorized Intensity History↩︎

The widely adopted state design in related fields is utilizing a feature map from a pre-trained network [15], [19]. However, we found the CNN feature is not effective in the exposure control task due to its disadvantages: 1) unclear relation between CNN feature and exposure level, 2) generalization problem of CNN feature, and 3) heavy computation for on-board devices due to multiple convolution layers.

Therefore, instead of using the CNN feature or other complicated features as a state, we utilize vectorized image intensity as a straightforward and lightweight state representation for the exposure control task. As shown in Fig. 1, we first vectorize image intensity map from a Region-of-Interest (RoI) patch of gray-scale image \(\mathbb{R}^{H_{RoI}\times W_{RoI}}\) to the intensity vector \(\mathbb{R}^{S}\). The RoI patch can be an entire image area or specific regions decided by a domain randomization process. After that, we stack consecutive frame’s intensity vectors to embed previous state history, as follows: \[\begin{align} v_t &= f(I_t^{RoI}), \quad v_t \in \mathbb{R}^{S}, \\ s_t &= \text{concat}(v_{t-n}, ..., v_{t-1}, v_t), \end{align}\] where \(I_t\) is a normalized image at time step \(t\), \(f(\cdot)\) is an averaging process through the x-axis of a gray-scale image, defined as \(\frac{1}{H_{RoI}}\sum_x\text{gray}(I_t^{RoI})\), \(v_t\) represents the vectorized intensity that has a dimension of \(S\), and \(s_t\) is a state vector. To ensure a fixed and reasonable state length, we resize the given RoI patch to have \(S=128\) and stack previous states with \(n=3\). The proposed state design is effective, straightforward, and computationally efficient compared to the CNN feature, as described in Sec. 4.1.

3.2.2 Action Design: Relative and Continuous Action↩︎

Obviously, we have two controllable parameters, exposure time and gain, to adjust camera exposure. However, there are numerous options for its action design, such as 1) discrete vs. continuous action space, and 2) absolute vs. relative action range. A discrete action space discretizes the action range into a few action values. It has the advantage that the training process is simplified, but there is an approximation gap to the optimal values. The absolute and relative action range is about the change of camera parameters. In the absolute action range, an action value is directly matched to the specific value of camera parameters. On the other hand, in the relative action range, an action value indicates the amount of change in camera parameters.

Among the options, we select continuous-relative action space. This is because our goal is rapid convergence with minimum exploration step, but discrete action space needs multiple steps and often does not converge, depending on its quantization level. Also, we empirically found that the absolute action range often induces a flickering effect and unstable convergence, as described in Sec. 4.1.

3.2.3 Reward Design: Flickering and Image Attribute↩︎

The desirable behavior of exposure control is maximizing image attributes, such as sharp edge, moderate brightness, and low-level noise, and maintaining image attributes during exposure parameter transition. Therefore, we designed the reward function from three perspectives: 1) a moderate brightness level to provide clear visibility and edge information, 2) a smoothed exposure transition to ensure stable convergence and prevent flickering, and 3) a low-level noise to provide clear image and avoid too-high gain value. The designed reward functions are as follows: \[\begin{align} \mathcal{R}_{mean} &= \textstyle\frac{1}{P}\textstyle\sum_{xy}|I_t^{RoI} - M|^{p_m}, \\ \mathcal{R}_{flk} &= \textstyle\frac{1}{P}\textstyle\sum_{xy}||I_{t}^{RoI} - I_{t-1}^{RoI}||, \\ \mathcal{R}_{noise} &= \textstyle\frac{1}{P}\textstyle\sum_{xy} sobel(I_{t}^{RoI}), \\ \mathcal{R}_{total} &= w_m\mathcal{R}_{mean} + w_f\mathcal{R}_{flk} + w_n\mathcal{R}_{noise}, \end{align}\] where \(P\) is the number of pixels, \(M=0.5\) indicates mid-tone brightness, \(p_m=0.5\) is a parameter for non-linearity and \(sobel\) is a gradient operator. Also, we set \(w_m=1.5\), \(w_f=-1.0\), \(w_n=-0.1\) in practice. The proposed reward design might be a primitive and basic form for camera exposure control, however, it can be easily extendable by incorporating modern image assessment metrics [13], [14], [20].

3.3 Static-to-dynamic Lighting Curriculum↩︎

In the wild, the agent must be able to control exposure parameters for a variety of lighting change scenarios. However, training every scenario simultaneously results in an unstable training process and poor generalization. Therefore, we propose static-to-dynamic curriculum strategy that starts with a simple control task and gradually experiences dynamic and dramatic lighting change scenarios. In the end, the trained models possess a comprehensive exposure control capability for diverse lighting conditions.

We divide the difficulty of lighting conditions into three levels: easy, normal, and hard. The easy level has static lighting conditions with moderate brightness. The normal level also has a fixed brightness but with a darker or brighter than easy level. Lastly, in the hard level, the LED brightness dynamically changes from dark to bright or the opposite way during each scenario. The probability of each level is gradually updated according to the proceeded training episode \(t_e\). The probability set \([p_{e}, p_{n}, p_{h}]\) starts from \([1,0,0]\), through \([0,1,0]\), and ends with [\(p_{e}^f, p_{n}^f, p_{h}^f] = [0.1, 0.4, 0.5]\). In summary, the probability for each difficulty level is updated as follows: \[\begin{align} p_{e} &= \begin{cases} 1, & t_e < T_{e} \\ \frac{(t_e - T_{e})}{(T_{n} - T_{e})}, & T_{e} \leq t_e < T_{n} \\ p_{e}^f, & T_{n} \leq t_e \end{cases} \\ p_{n} &= \begin{cases} 0, & t_e < T_{e} \\ 1-\frac{(t_e - T_{e})}{(T_{n} - T_{e})}, & T_{e} \leq t_e < T_{n} \\ p_{n}^f, & T_{n} \leq t_e \end{cases} \\ p_{h} &= \begin{cases} 0, & t_e < T_{n} \\ p_{h}^f, & T_{n} \leq t_e \end{cases} \end{align}\] We use \(T_e=25,000, T_n=45,000\) in practice.

3.4 Spatial Domain Randomization↩︎

In the wild, the agent encounters various surrounding environments and object contexts, such as office, road, tunnel, and mountain. Although the light-controlled darkroom can provide various lighting scenarios, it is difficult to contain diverse environments and contexts because it only has a few target objects with a fixed background. Therefore, without proper randomization techniques, the agent may overfit to perform exposure control for only a few target objects, resulting in generalization failure in the wild.

The main idea is to provide as much diverse image structure and context information as possible by augmenting the image from the darkroom environment. Specifically, we spatially augment the images with random flipping, cropping, rotating, and resizing but do not change color and brightness information. Each augmentation and its parameter is randomly selected at the beginning of each training episode and fixed during the episode. With the proposed domain randomization technique, the trained agent can be generalized in the real world without any fine-tuning.

3.5 Policy Optimization↩︎

As our action space is continuous, we use the SAC [21] algorithm. We excluded on-policy algorithms like PPO [22] because they are widely known to be less sample efficient than the off-policy algorithms like SAC, TD3, and DDPG. We tested TD3 [23] and DDPG [24] as well, but SAC showed the best result.

SAC algorithm is a kind of actor-critic algorithm, which has critic \(Q(\theta)\) and actor \(\pi(\phi)\). The objective functions to update the critic are as follows: \[\begin{align} & J_Q(\theta) = \mathbb{E}_{(s_t, a_t)\sim \mathcal{D}} \left[ \frac{1}{2}\left( Q_\theta(s_t, a_t) - \hat{Q}(s_t, a_t) \right)^2\right], \\ & \hat{Q}(s_t, a_t) = r(s_t, a_t) + \gamma \mathbb{E}_{s_{t+1}\sim p}[V_{\bar{\theta}}(s_{t+1})], \end{align}\] where \(\mathcal{D}\) is a replay buffer, \(r(s_t, a_t)\) is a reward function and \(\gamma\) is a discount factor. The objective functions for updating the actor is as follows: \[\begin{align} J_\pi(\phi) = \mathbb{E}_{s_t \sim \mathcal{D}}[\mathbb{E}_{a_t\sim\pi_{\phi}}[\alpha \text{log}(\pi_\phi(a_t|s_t))-Q_\theta(s_t, a_t)] ], \end{align}\] with \(\alpha\) defined as a temperature parameter.

4 Experiments↩︎

In this section, we validate our proposed method in three different environments: light-controlled darkroom, exposure control dataset [13], and real-world environments. Throughout the experiments, we provide the validation result of DRL design components, ablation study of reward and training strategy, convergent step comparison, comparison with built-in AE for object detection and feature extraction, and computational time analysis.

Table 1: Self-evaluation of DRL-AE framework in the light-controlled darkroom. DR and CL indicate domain randomization and curriculum learning. "-" indicates the agent doesn’t converge. The best performance in each block is highlighted in bold.
Framework Methods Reward Frames to
Component per Frame Converge
DDPG [24] 1.11 -
TD3 [23] 1.03 -
SAC [21] 1.61 5
CNN 0.85 -
Vector 1.61 5
Absolute 0.65 -
Relative 1.61 5
\(R_{flk}\) \(R_{noise}\)
2-3 1.41 -
1.44 16
1.35 9
1.61 5
2-3 - -
1.64 15
1.61 5


4.1 Self-evaluation in Light-controlled Darkroom↩︎

We first validate our DRL design components and their variants in the light-controlled darkroom. We utilize reward per frame and the averaged number of frames to converge as evaluation metrics to measure image quality and convergence speed, respectively. Here, when the difference between current and previous images is less than a certain threshold, we regard it as the convergence. The testing scenarios include various lighting conditions, such as fixed lighting, progressive light changes, and dynamic light changes. The results are shown in Tab. 1.

We found SAC method [21] shows the best result among the off-policy RL methods. Other algorithms can reach up to 1.0 reward per frame, but they usually do not converge well by showing oscillation. For the state design, the CNN feature is not desirable for the exposure control task due to its intensity-agnostic property. Also, absolute action space seems to make the overall learning process difficult because it needs to estimate the optimum values directly.

The reward function and the training strategies play an important role in the stable and rapid convergence process and image attribute preservation. \(R_{noise}\) suppress high noise level and regularize gain parameter control, leading to better convergence. Also, \(R_{flk}\) makes the agent preserve the image attribute during the exposure transition. CL makes the agent encompass the comprehensive exposure control capability for the test set’s various lighting conditions. Additionally, DR allows the agent to quickly converge for arbitrary context by increasing generalization ability.

Figure 2: Convergent step comparison in exposure control dataset [13].Within three frames, our method already reaches a well-exposed image (a) with minimum exploration (b).On the other hand, Shin  [13] search local areas with multiple steps (about 30 frames) to converge.

Figure 3: Real-world generalization.We compare our method with the camera’s built-in exposure control algorithm in real-world scenarios.Camera lenses are occluded at the initial and suddenly removed in the first frame.Our agent converges to a well-exposed image within 3-5 frames.Yet, the built-in AE algorithm is still in the middle of adjusting the exposure parameters and is far from the well-exposed image, especially in the indoor case.Note that our agent is only trained in the light-controlled darkroom, and this is the zero-shot inference result in the wild.

4.2 Convergent Step Comparison↩︎

Exposure Control Dataset [13]. The dataset provides multiple images with many different pairs of exposure and gain values, which are captured from the real world. The dataset consists of several locations, including indoor and outdoor places, with a wide range of exposure and gain values. Outdoor images can have the exposure time from \(100\mu s\) to \(7450 \mu s\), at intervals of \(150 \mu s\), and the gain from \(0 dB\) to \(20 dB\) with \(2 dB\) interval. Similarly, indoor images have exposure time from \(4 ms\) to \(67 ms\) with \(3 ms\) intervals and gain from \(0 dB\) to \(24 dB\) with \(2 dB\) intervals.

We evaluate our method with [13]. The final converged points are slightly different because of the difference between the proposed reward function and the assessment metric of [13]. Both algorithms converge to a comparable point, as shown in Fig. 2. It only takes three frames to converge with our method. However, the Nelder-Mead optimization method in [13] takes at least 30 timesteps to converge completely. Therefore, it is hard to use it in real scenarios, although they may find a more optimal point.

Real-world Indoor and Outdoor Environment. We evaluate our method with the camera’s built-in AE control algorithm in real-world scenarios. The purposes of this experiment are twofold: 1) comparing convergence speed with the built-in AE algorithm, and 2) testing zero-shot generalization performance in the wild.

Before starting, we cover the camera lens with enough time to converge in the dark, then quickly remove it to test the convergence speed for a sudden lighting change. Fig. 3 shows the captured initial five images during each optimization. Our method converges to a well-exposed image within 3-5 frames in both indoor and outdoor scenes. However, the built-in AE algorithm takes much longer to converge: 30 frames for indoors and 10 frames for outdoors.

Also, we found that our agent shows satisfactory zero-shot generalization performance, even though it is only trained in the light-controlled darkroom with limited object context. We believe our state design (, vectorized intensity history) and spatial domain randomization bring this result by removing the potential domain gap issue of the CNN feature and augmenting object context as much as possible.

Figure 4: SIFT [25] feature extraction result. Captured images from the proposed algorithm and built-in AE are processed to detect SIFT features. The images were simultaneously captured in real-time from two separate cameras equipped on a driving vehicle. Our method can provide plenty of SIFT features over the image plane. On average, our method detects 38% more features across a total of 5355 images.

4.3 Real-time Driving Env: Feature Extraction↩︎

In this experiment, two cameras are attached to the top of a moving car, and images are captured simultaneously. One camera is used for our algorithm, and the other camera is for the built-in AE algorithm. We tested the algorithms on real-world driving scenarios, including campus and urban roads. Our algorithm runs on a laptop equipped with an i7-7700HQ@2.80GHz CPU unit. Given an image, the agent predicts exposure time and gain commands in real-time. The estimated actions are transmitted to the attached camera. After driving sequences acquisition, we extract SIFT [25] features from each captured image. Please note that the images of DRL-AE and built-in AE have slightly different views due to the difference in the installed position.

Fig. 4 shows the comparison of feature extraction results. From the total of 5355 image pairs, our method produces 1,306 SIFT features on average with a 1,157 median value. On the other hand, the built-in AE method only results in 946 features on average, with a 711 median value. Therefore, 38% more features are detected on average, and the difference is up to 62% for the median value. The number of detected features and feature repeatability during exposure transition are critical for Visual Odometry (VO) and SLAM tasks. So, we believe our method can be valuable for VO, SLAM, and visual tracking tasks as well.

Figure 5: Object detection result.Captured images from the proposed algorithm and built-in AE are processed to detect target objects.The experiment used the same image sequence as the SIFT experiment.We utilize Yolo-v5 [26] for car and pedestrian detection.On average, our method detects 5% more objects compared to the built-in AE algorithm.

4.4 Real-time Driving Env: Object Detection↩︎

Similar to the feature extraction experiment, the captured images are processed with YOLO-v5 [26]. The images are taken from campus and urban road scenes, so we only take into account cars and pedestrians. Fig. 5 shows the comparison of object detection results from DRL-AE and built-in AE methods. Recent detection models, including YOLO-v5, adopt modern augmentation methods to make the model robust the image brightness changes. Therefore, YOLO-v5 tends to detect objects well in even poorly exposed images. However, our algorithm detects 5% more objects in terms of the total number of detected objects. Furthermore, the objects in our image are detected much earlier than the built-in AE method. This is highly critical for autonomous vehicles that are driving at high speed. Earlier detected objects can prevent human injury and potential accidents.

Figure 6: RoI-aware camera exposure control. Our agent is able to control camera exposure for a specific RoI or entire image. Given a RoI box, the agent adjusts the camera parameters to maximize the image attribute for a specific RoI area. It allows the camera to capture the detailed context for the regions of interest.

4.5 Real-world Env: RoI-aware Exposure Control↩︎

Our DRL-AE framework can also take arbitrary input sizes because the framework resizes the image before the vectorized intensity processing. Also, our domain randomization strategy produces a random size of the Region of Interest (RoI) patch by using crop, flip, resize, and rotation in the training stage. Therefore, our agent is able to control camera exposure for a specific RoI or entire image. Fig. 6 shows the RoI-aware exposure control results.

The agent adjusts the camera parameters to maximize the image attribute for a specific RoI area. It allows the camera to capture the detailed context for the regions of interest. We expect that DRL-AE can be combined with object detection, object tracking, and human gaze and attention. As a result, the combination can lead to adaptive exposure control schemes, such as attention-aware or detection-aware exposure control.

4.6 Computation Time Analysis↩︎

Our agent has a simple Multi-Layer Perceptron (MLP) architecture with two hidden layers of 256 units. Also, our method does not require complex matrix computation like convolutions. Therefore, our agent can be run on a CPU device in real-time. We measure the inference time of the agent on the Ryzen 5950x CPU. Tab. 2 shows the computation time results, compared with shin [13]. The network’s inference time takes 1 ms regardless of image resolution because the input image is resized to a fixed resolution. Also, even including other operations, such as image resizing and RoI cropping, it takes a maximum 6 ms, which is still in the real-time range. Therefore, it can run at 170-1000 Hz on a CPU device.

Table 2: Processing time analysis. Our algorithm does not use any complex metric or computation; the agent consists of two MLP layers with 256 hidden units, and the vectorized intensity history does not need complex operations. Therefore, our agent can be run on a CPU device in real-time. Here, we measure the processing time on the Ryzen 5950x CPU.
Method Image Size Processing Time (ms)
Shin [13] 1600 x 1200 108.7
800 x 600 18.2
Ours 1600 x 1200 1.0
800 x 600 1.0

5 Conclusion & Future Work↩︎

Conclusion. In this paper, we proposed a novel joint exposure parameter control framework that exploits Deep Reinforcement Learning (DRL) to achieve instant exposure convergence and real-time processing. The proposed framework, named DRL-AE, effectively solves the challenges when applying DRL to the exposure control task, such as 1) training environment to provide diverse lighting change scenarios, 2) flickering and image attribute-aware reward design, 3) lightweight state design by using vectorized intensity history, and 4) domain generalization via spatial domain randomization strategy.

The proposed method is thoroughly validated in three different environments: light-controlled dark room, exposure control dataset [13], and real-world environments. We demonstrate that our proposed method instantly adjusts camera exposure within five steps with real-time processing of 1 ms on a CPU device. Also, our method shows satisfactory generalization performance in the wild. The images acquired from our method are well-exposed and show superiority in numerous computer vision tasks, such as feature extraction and object detection2. To the best of our knowledge, our approach is the first solution that applies DRL to control camera exposure. We hope our paper encourages active research of advanced camera exposure control algorithms to achieve robust visual perception ability.

Future Work. This paper shows that DRL can be used in the field of camera exposure control. There are lots of open research topics, such as motion-aware AE control, advanced reward function, aperture control, hardware generalization over various cameras, and further domain generalization in the real world. In the future, we plan to extend the current darkroom environment to generate object or camera motion, allowing the agent to consider motion blur for exposure parameter control. Controlling camera aperture by using a mechanical aperture control module is another research direction.


This research was supported by a grant (P0026022) from R&D Program funded by Ministry of Trade, Industry and Energy of Korean government.

6 Appendix↩︎

In this supplementary material, we provide

  • Implementation and training details

  • Further discussion of RL component design

  • Discussion and future works

7 Implementation Details↩︎

Network Architecture. We adopt simple Multi-Layer Perceptron (MLP) layers as our DRL agent architectures (, actor and critic). Tab. [tbl:architecture] shows the architecture detail from the input to the output layers. Both actor and critic networks consist of one input layer, two intermediate layers, and one output layer. The dimensions of each network’s input and output layers are determined by the dimensions of state and action vectors. Given the state vector, the actor network estimates next step actions (, exposure time and gain difference). The critic network receives a state and action as input and estimates a q-value. This q-value is then utilized in the soft actor-critic training process [21].

Training Settings. We use a machine equipped with a Ryzen 5950x CPU and NVidia 3080Ti GPU for the agent training. We use Adam optimizer [27] with an initial learning rate of \(3\cdot10^{-4}\). The agent is trained with a batch size of 256 for 500k timesteps in a light-controlled darkroom environment. We set a maximum exposure value of 100 \(ms\) and a maximum gain of 40 \(dB\) for the control bound of the machine vision camera. In the darkroom environment with controlled LED lighting, the agent stores various exposure transition sets in the replay buffer for each episode. The actor and critic networks are optimized using the transition batches sampled from the buffer. The maximum episode length is set to 200 steps. It usually takes about 20 seconds per episode, including training and image acquisition time. Also, the agent is validated every 2000 time steps. The total training time usually takes 18 hours in these training conditions. For the other hyperparameters, we follow the default setting described in [21] and summarize them in Tab. [tbl:hyperparameters].

max width=0.95

max width=0.95

Light-controlled Darkroom. We build the darkroom environment to freely control the lighting. Our aim is to provide a various range of lighting conditions to the RL agent, within a short training time compared to the real sunlight condition. The darkroom environment is made with aluminum profiles and black acrylic plates of 5 mm thickness. Fig. 7 and Fig. 8 shows the constructed environment and each component’s specification. We use a global shutter machine vision camera from Teledyne FLIR, which has a 3.2MP Sony IMX 265. For the light controller unit, we built a program based on the STM32F446RE Necleo board, which has a 180MHz ARM Cortex M4 CPU and a flash memory of 512 kbytes. Our LED bar is based on the WS2812B LEDs, which have 144 LEDs per meter. We use two LED bars for the darkroom environment. The light controller communicates with the RL gym environment located on the RL server through serial communication.

Figure 7: Light-controlled darkroom.

Figure 8: Hardware specification used in the darkroom.

8 Design Philosophy for RL Component↩︎

In this section, we describe the underlying philosophy of designing the RL components for the exposure control task.

8.1 State Design↩︎

CNN Feature. Perhaps, a naïve state designing is utilizing the CNN model to extract a feature map from the image and use the feature map as a state. However, CNN-based state design has three disadvantages. First, the extracted feature map has no clear relationship with the camera exposure level. Usually, CNN backbones (, ImageNet, VGG, ResNet) are trained in a brightness-agnostic manner via their data augmentation strategy. Therefore, the extracted feature usually includes semantic information rather than brightness information. Second, CNN brings additional domain gap problems in real-world inference. Third, the state extraction with the CNN model is computationally heavy, introduces additional learnable parameters, and requires lots of system memory to store the image states in the replay buffer. As a result, CNN state brings undesirable properties for deep reinforcement learning, such as reducing replay buffer size, limiting sample diversity, increasing training time, generalization problems, and unclear representation of brightness.

Intensity Value. Therefore, instead of using the CNN model, we utilize the averaged intensity values along the x-axis. The primary purpose of auto-exposure control is to ensure a proper image brightness level by adjusting camera exposure settings. Therefore, image intensity is the primary cue and a straightforward and effective representation for auto-exposure control. Also, we reduce the dimension by averaging intensity value along the x-axis rather than utilizing entire images to ensure high-level memory efficiency with minimum computation burden. Lastly, we stack 1-dimensional averaged intensity values of 3 frames and define the stacked intensity values as a state. We empirically found that frame stacking has a significant impact on making better decisions. By stacking consecutive frames, the agent is able to implicitly observe the lighting condition change and camera exposure change over a short period.

8.2 Action Design↩︎

Discrete vs. Continuous Action Space. We initially considered designing the system to obtain quantized exposure and gain values using a discrete action space. However, due to the sensitive nature of the parameter optimization, which often causes oscillation, it was difficult to fully account for the changes even with a higher level of quantization. Therefore, we formulate the auto exposure control problem as a continuous action control task.

Absolute vs. Relative Action Range. We can consider absolute and relative actions as output values. The former indicates the agent estimate desired absolute action values (, absolute values of exposure time and gain, such as 10 \(ms\), 4\(dB\)). The latter means the output is a relative difference in action values (, \(\pm\)​10\(ms\), \(\pm\)​2\(dB\)) Relative control is more stable but has the disadvantage of slower convergence. On the other hand, absolute control can reach the desired value in one step but is more likely to be unstable. In practice, we found that absolute control faced difficulties in learning good policies. Therefore, we make the agent estimate relative action for the exposure control task.

8.3 Reward Design↩︎

Designing reward functions is crucial in deep reinforcement learning, as it determines the desired objective of the learning process. In this context, we will briefly mention the rationale behind the three reward functions presented in the main text. A common desired goal for auto exposure control is to acquire a high-quality image that has moderate brightness, low-level noise, and sharp edge information. Also, the convergence of the control process should be fast but stable.

Therefore, the proposed reward functions are designed for these desired objectives. First, the mean reward term \(\mathcal{R}_{mean}\) helps to ensure that the image has a median brightness. Instead of designing it linearly, we made it decrease more steeply around the median by adding non-linearity with \(p_m\) to maintain the center. The second flickering term \(\mathcal{R}_{flk}\) suppresses the image flickering effect caused by the action’s vibration and ensures smooth exposure transition while preserving image attributes. Lastly, the noise term \(\mathcal{R}_{noise}\) reduces the overall image noise caused by excessively high gain, encouraging a balanced control between the exposure and gain parameters.

9 Discussion and Future Work↩︎

Figure 9: Impact of AE control on post-processing methods.

Camera AE vs. ISP. There are two main processes for camera image processing: Automatic Exposure (AE) control and Image Signal Processing (ISP). AE control includes automatic gain, exposure, and aperture control as well. The roles of AE control and ISP are quite different. The former aims to get high-quality images while rapidly adjusting exposure levels within hardware limitations. After that, the latter enhances the quality of the acquired image for its purpose by using various ISP tools, such as demosaicking, deblurring, denoising, color space correction, gamma correction, tone-mapping, HDR, and more. Therefore, the AE control (hardware level) and ISP (software level) are complementary rather than competitive relations. As the former stage obtains a better quality image, the quality in the later stage improves. We evaluate our method by combining conventional contrast enhancement & tone mapping method (, photoshop) (e) and Zero-DCE [28] (f). As shown in Fig. 9, the exposure control results highly affect its final outputs ((b)vs(e), (c)vs(f)).

Figure 10: Example image from Unreal-based camera simulator [29]. (a) Interface overview, (b) Auto-Exposure, (c) Over-exposed, (d) Under-exposed.

Sim2Real via Camera Simulator. From the perspective of object, motion, and lighting diversity, a simulated camera model could be beneficial to learning camera exposure control with deep reinforcement learning. Modern photorealistic simulations, such as Unreal, Unity, and Blender, provide similar quality images to the real world and support partial functionality for auto-exposure control.

However, introducing simulation causes two domain gap issues: the domain gap 1) between actual and simulated environments and 2) between simulated and real camera acquisition models. The former issue leads to large performance differences between the models trained on simulated data and real data, as we can see in domain adaptation literature. For the latter issue, the simulated camera model provides an incomplete image acquisition model. As shown in Fig. 10-(c), when the camera captures the image with high gain or ISO, the image must contain severe noise within the image. However, the simulation doesn’t support this functionality. Therefore, due to these issues, we decided to utilize the darkroom environment to investigate the possibility of deep reinforcement learning for automatic exposure control.

Future Work: DRL-AE in Simulation. Although the simulation has some disadvantages, it has many advantages, such as faster interaction speed, easy parallelization, and diverse controllable parameters. Therefore, in future work, we also plan to study Sim2Real based camera exposure control and compare the Sim2Real model with the real-world model trained with this paper’s method.

Future Work: Motion-aware AE Control. Considering the motion blur is another future direction. As the proposed darkroom environment has only a fixed target object, it is difficult to consider motion blur that frequently happens in the real world. In future work, we plan to extend the current darkroom environment to make object motion, thus allowing the agent to consider a motion blur for their exposure parameter control.

Future Work: Various Reward Functions. In this paper, the proposed reward design might be a primitive and basic form for camera exposure control. However, it can be easily extendable by incorporating modern image assessment metrics [13], [14], [20]. Also, we can utilize human preference, network inference results (detection confidence), and the number of detected features as a reward function.

Future Work: Aperture Control. The machine vision camera used in the experiment has a fixed aperture size, which is not controllable with software. However, the aperture is also one parameter that affects camera exposure level and depth of field. Therefore, we plan to control aperture size by using a mechanic aperture control module.


pp. 6517–6525. IEEE, 2017.
pp. 2980–2988. IEEE, 2017.
pp. 234–241. Springer, 2015.
pp. 13916–13925, 2021.
pp. 15–22. IEEE, 2014.
pp. 100–108. International Society for Optics and Photonics, 1999.
pp. 857–864. IEEE, 2018.
pp. 1165–1172. IEEE, 2019.
pp. 3894–3901. IEEE, 2017.
pp. 7044–7051. IEEE, 2022.
pp. 1861–1870. PMLR, 2018.
pp. 1587–1596. PMLR, 2018.
pp. 1150–1157. Ieee, 1999.
pp. 1780–1789, 2020.

  1. *Both authors contributed equally to this work.↩︎

  2. Further visualization and video demos are available at https://sites.google.com/view/drl-ae.↩︎