March 30, 2024
Generative Artificial Intelligence (AI) has pioneered new methodological paradigms in architectural design, significantly expanding the innovative potential and efficiency of the design process. This paper explores the extensive applications of generative AI technologies in architectural design, a trend that has benefitted from the rapid development of deep generative models. Generative Adversarial Networks (GANs) and Variational Autoencoder (VAE) have been extensively applied before, significantly advancing design innovation and efficiency. With continual technological advancements, state-of-the-art Diffusion Models and 3D Generative Models are progressively integrated into architectural design, offering designers a more diversified set of creative tools and methodologies. This article further provides a comprehensive review of the basic principles of generative AI and large-scale models and highlights the applications in the generation of 2D images, videos, and 3D models. In addition, by reviewing the latest literature from 2020, this paper scrutinizes the impact of generative AI technologies at different stages of architectural design, from generating initial architectural 3D forms to producing final architectural imagery. The marked trend of research growth indicates an increasing inclination within the architectural design community towards embracing generative AI, thereby catalyzing a shared enthusiasm for research. These research cases and methodologies have not only proven to enhance efficiency and innovation significantly but have also posed challenges to the conventional boundaries of architectural creativity. Finally, we point out new directions for design innovation and articulate fresh trajectories for applying generative AI in the architectural domain. This article provides the first comprehensive literature review about generative AI for architectural design, and we believe this work can facilitate more research work on this significant topic in architecture.
Keywords: Generative AI, Architectural Design, Diffusion Models, 3D Generative Models, Large-scale models.
Nowadays, generative artificial intelligence (AI) techniques increasingly expand their power and revolution in architectural design. Here, generative AI refers to the artificial intelligence technologies dedicated to content generation, such as text, images, music, and videos. Generative AI benefits from the rapid development of deep generative models, including Generative Adversarial Networks (GANs), Variational Autoencoder (VAE), and Diffusion Models (DMs). GANs and VAE are traditional generative models, and have been widely explored in architectural design, as illustrated in Figure 1. In this paper, we focus on the recent progress of generative AI, especially the revolutionary diffusion models. DMs achieved state-of-the-art performance in various content generation tasks such as text-to-image and text-to-3D-models.
Architectural design may encompass multiple themes and scopes, with each project having distinct design requirements and individual styles, leading to diversity and complexity in design approaches. In this work, we adopt 6 main steps in the architectural design process for the literature review: 1) architectural preliminary 3D forms design, 2) architectural layout design, 3) architectural structural system design, 4) detailed and optimization design of architectural 3D forms, 5) architectural facade design, and 6) architectural imagery expression. After exploring the research papers from 2020 to 2023, we observed there has been a significant increase in the number of research papers in architectural design using Generative AI. The number of research papers using Generative AI technology in different architectural design steps reveals the development trends within each subfield, as illustrated in Figure 2 (a). Most researches are concentrated in the area of architectural plan design. Research in preliminary 3D form design of architecture and architectural image expression has rapidly increased in the past two years. More research needs to be done by scholars on architectural, structural system design, architectural 3D form refinement and optimization design, and architectural facade design.
This sustained growth trend distinctly demonstrates that generative AI in architectural design are expanding at an unprecedented rate while also reflecting the architectural design and computer science community have high level of attention and increasing investment in Generative AI technologies. The most used generative AI techniques are illustrated in Fig 2 (b). In computer science, many studies focus on GAN and VAE, while research on DDPM, LDM, and GPT is in the initial stages. The situation is the same in architecture.
Leveraging the recent generative AI models in architectural design could significantly improve design efficiency, and provide architects with new design processes and ideas to expand the possibilities of architectural design and revolutionize the entire design process. However, the use of advanced generative models in architectural design has not been explored extensively. The primary reasons for hindering the use of advanced generative models in architectural design may have two aspects: the professional barriers and the issue of training data.
In terms of professional barriers, deep learning and architectural design are highly specialized fields requiring extensive professional knowledge and experience. The aim of this study is to narrow the professional barriers between architecture and computer science, and assist architectural designers in bridging Generative AI technologies with applications, promoting interdisciplinary research, and delineating future research directions. This review systematically analyzes and summarizes case studies and research outcomes of Generative AI applications in architectural design, and showcases the possibilities and potential of the intersection between computer science and architecture. This interdisciplinary perspective encourages collaboration among experts from different fields to address complex issues in architectural design, thus advancing scientific research and technological innovation.
In terms of the issue of training data, deep learning models require high-quality training data to analyze and verify their generalization ability. However, data in the field of architecture is usually unstructured. The search and organization of architectural training data pose a significant challenge, making it difficult right from the initial stages of model training. In addition, high-performance Graphics Processing Units (GPUs) are required to train the millions of data for deep learning models, especially those dealing with complex images and datasets. The scarcity of high-performance GPUs and the difficulty of mastering GPU programming skills may prevent the architects to explore the recent diffusion model and large foundation models.
This article first introduces the development and application directions of generative AI models, then elaborates on the methods of applying generative AI in the architectural design process, and finally, forecasts the potential application development of generative AI in the architectural field.
In section 2, the article offers an in-depth introduction to the principles and evolution of various generative AI models, with a focus on Diffusion Models (DMs), 3D Generative Models, and Foundation Models. In section 2.1, the article elaborates on the principles and development of Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). In section 2.2, the discourse on Diffusion Models elaborates on the working mechanisms and the developmental trajectories of DDPM and LDM. In section 2.3, the segment on 3D Generative Models zeroes in on 3D shape representation, encompassing Voxels, Point Clouds, Meshes, Implicit functions, and Occupancy Fields. Within Occupancy Fields, the paper details Signed Distance Functions (SDF), Unsigned Distance Functions (UDF), and Neural Radiance Fields (NeRF), explaining their respective operational principles. In section 2.4, the Foundation Models section comprehensively describes the progress and achievements of Large Language Models (LLM) and Large Vision Models. In section 2.5, the paper discusses the applications and developments of these models in image generation, video generation, and 3D model generation.
In section 3, this paper delves into the application development of generative AI models in architectural design. Given the complexity of the architectural design process, this article delineates the architectural design process into six steps, as presented in introduction. In each step, the article summarizes and discusses the current application methods of generative AI models in these six domains. By analyzing these research papers, the study demonstrates how generative AI can facilitate innovation in architectural design, improve design efficiency, and optimize architectural solutions. Throughout this summarization process, literature retrieval was conducted using databases such as Cumincad and Web of Science, supplemented by searches on Litmaps. To ensure the targeted and accurate nature of the search, specific search queries were set for each design process.
In Section 4, this article explores the potential applications of generative AI technology in generating architectural design images, architectural design videos, architectural design 3D models, and human-centric architectural design. In section 4.1, it anticipates applications for architectural design image generation in generating floor plans, facade images, architectural images. In section 4.2, it anticipates architectural design video generation, it foresees applications such as generating videos from a single architectural image, generating videos from architectural images, style transfer for specific video content. In section 4.3, Regarding architectural design 3D model generation, it envisions possibilities in generating 3D models from images and text prompt, transferring styles to 3D models, and generating and editing detailed styles for 3D models. In section 4.4, it elaborates on the potential of generative AI in enhancing the human-centric architectural design process.
The generative AI models are currently experiencing rapid development, with new methods continually emerging. The evolution of deep learning-based approaches, particularly Variational Autoencoders (VAE), Generative Adversarial Networks (GAN), and Diffusion Models (DM), have significantly advanced and enhanced image generation techniques. VAEs played a pioneering role in deep learning-based generative models. They employ an encoder-decoder architecture integrated with probabilistic graphical models to learn latent representations for image generation [6]. GANs represent a milestone in the realm of image generation with a generator and a discriminator, GANs engage in an adversarial training process to prompt the generator to generate images progressively resembling the distribution of real data [7], [8]. Moreover, the diffusion models stand out as the most revolutionary technologies that have emerged in recent years with remarkable image generation quality [9], [10]
Generative Adversarial Network (GAN) [11] comprises a generator \(G\) and a discriminator \(D\), as illustrated in Figure 3. The \(G\) is responsible for generating samples for noise \(z\), while the \(D\) determines the authenticity of the generated samples \(G(Z)\) with the ground truth image \(\bar{x}\). Ideally: \[\label{equ:GAN95D} D(\Bar{X})=1, D(G(z))=0\tag{1}\] This adversarial nature enables the model to maintain a dynamic equilibrium between generation and discrimination, propelling the learning and optimization of the entire system. Despite its advantages, GAN still faces challenges, such as mode collapse during training.
Conditional image generation is an image generation technique that controls the generation process by introducing conditional information to generate images that match given conditions, such as text, labels, and hand-drawn sketches. Conditional image generation introduces additional input conditions, enabling the generator to generate images with specific properties based on conditional information. To address the issue that GAN models exhibit limited controllability, Conditional GAN (CGAN) [12] was introduced that uses additional auxiliary information as a condition to fine-tune both the \(G\) and \(D\). The \(G\) of CGAN receives conditional information besides random noise. By providing conditional information to the \(G\), CGAN can more precisely control the generated results. Additionally, variants such as pix2pix [13] and StyleGAN [7] have been developed.
In image generation, diffusion models outperform GANs and VAEs [14], [15]. Most diffusion models currently used are based on Denoising Diffusion Probabilistic Models (DDPM) [15] which simplifies the diffusion model through variational inference. As shown in Figure 3, diffusion models contain both forward diffusion process and reverse denoising (inference) processes. The forward process follows the concept of a Markov chain and turns the input image into Gaussian noise. Given a data sample \(x_0\), the Gaussian noise is progressively Increased to the data sample during \(T\) steps in the forward process, producing the noisy samples \(x_t\), where the timestep \(t=\{1, \ldots, T\}\). As \(t\) increases, the distinguishable features of \(x_0\) gradually diminish. Eventually when \(T \rightarrow \infty\), \(x_T\) is equivalent to a Gaussian distribution with isotropic covariance. In addition, the inference process can be understood as a sequence of denoising autoencoders with same weights \(\epsilon_\theta\left(x_t, t\right)\) (\(\epsilon_\theta\) is typically implemented as U-Net[16]), which are trained to forecast denoised images of their corresponding inputs \(x_t\).
Different from DDPM, Latent Diffusion Model (LDM) [9] does not directly operate on the images but operates in the latent space, called perceptual compression. LDM reduces the dimensionality of the data by projecting it into a low-dimensional, efficient latent space, in which high-frequency, imperceptible details are abstracted away. The framework of LDM is illustrated in Figure 4. After the image \(x\) is compressed by the encoder \(\mathcal{E}\) to latent representation \(z\), the diffusion process is performed on the latent representation space. LDM has a similar diffusion process to the DDPM. Finally, LDM infers the data sample \(z\) from the noise \(z_T\) and \(\mathcal{D}\) restores the data \(z\) to the original pixel space and gets the result images \(\widetilde{x}\).
Specifically, given an image \(x \in \mathbb{R}^{H \times W \times 3}\) with height \(H\), wigth \(W\) in \(RGB\) space , LDM first utilizes an encoder \(\mathcal{E}\) to encode the image \(x\) into a latent representation space: \[\label{equ:1} z=\mathcal{E}(x)\tag{2}\] where \(z \in \mathbb{R}^{h \times w \times c}\) with height \(h\) and width \(w\), the constant \(c\) represents the number of channels. Then \(\mathcal{D}\) recover the image from the latent representation space: \[\label{equ:2} \tilde{ x}=\mathcal{D}(z)=\mathcal{D}(\mathcal{E}(x))\tag{3}\]
To accelerate the generation speed, the Latent Consistency Model (LCM) [17] was proposed to optimize the step of denoising inference.
In the field of three-dimensional shape modeling, implicit functions are commonly represented in three ways: Occupancy Field, Signed Distance Function (SDF), or Unsigned Distance Function (UDF), and the recently emerging Neural Radiance Fields (NeRF).
Representation in 3D visual problems can generally be divided into four categories: voxel-based, point cloud-based, mesh-based, and implicit representation-based.
Voxel. As shown in Fig 5 (a). The voxel format describes a 3D object as a matrix of volume occupancy, where the size of the matrix is fixed. Researchers [18] adopted voxel representation in the generation of 3D shapes. Voxel format requires high resolution to describe fine-grained details, so as the shape resolution increases, the computational cost also explodes. The reconstruction results of voxel-based research are limited in resolution and do not provide topological guarantees or represent sharp features.
Point Cloud. As shown in 5 (b). Point clouds are a lightweight 3D representation composed of \((x, y, z)\) coordinate values. Point clouds are a natural way to represent shapes. PointNet [19] extracts global shape features using the max-set operations, and it is used widely as an encoder for point-based generative networks [20]. However, point clouds do not represent topology and are unsuitable for generating watertight surfaces.
Mesh. As shown in 5 (c) meshes are widely used and constructed from vertices and faces. [21] deformed a pre-defined template to restrict a fixed topology using graph convolution. Recently, meshes are used to represent shapes in deep learning techniques [22]. Although meshes are more suitable for describing the topological structure of objects, they usually require advanced preprocessing steps.
Implicit. As shown in 5 (d), implicit representation refers to describing a surface with a zero-crossing point of a volume function \(\psi : R^3 \to R\), whose value can be adjusted. Representing a 3D shape as a set of level sets of a deep network, mapping 3D coordinates to a signed distance function [23] or occupancy field [24]. Implicit representation can create a lightweight, continuous shape representation with no resolution limits.
Occupancy Field is one of the implicit function methods based on deep learning [24]. Occupancy Field assigns binary values to each point in three-dimensional space, determining whether the point is occupied by an object. This approach utilizes neural networks to learn the representation of occupancy fields, facilitating highly detailed three-dimensional reconstruction. The advantage of Occupancy Field lies in its dynamic modeling of object occupancy in scenes, making it suitable for handling complex three-dimensional environments.
SDF. Building upon Occupancy Field, the Signed Distance Function (SDF) has become a crucial direction in implicit function representation within deep learning. SDF assigns a signed distance value to each point, indicating the shortest distance from the point to the object’s surface. Positive values signify points outside the object, while negative values indicate points inside the object. As shown in Figure 6. DeepSDF [23] provides an end-to-end approach for continuous SDF learning, enabling precise modeling of irregular shapes and local geometry.
UDF. UDF and SDF are two distinct yet interrelated implicit function representation approaches. UDF assigns an unsigned distance value to each point, representing the distance to the nearest surface without considering surface direction. UDF is particularly useful for capturing more intuitive surface distance information without involving directional aspects. Zhao et al. [26] contribute significantly by jointly exploring the learning of both signed and unsigned distance functions. This approach aims to enrich the expressiveness of implicit functions, simultaneously capturing intricate details through both signed and unsigned distance information.
NeRF. Neural Radiance Fields (NeRF) [25] have revolutionized the field of computer vision and graphics by introducing a novel approach to scene representation. As shown in Figure 7. At the heart of NeRF lies the concept of representing a scene as a continuous function capturing radiance information at every point. The fundamental equation driving NeRF is the rendering equation, mathematically formulating the observed radiance along a viewing ray. The NeRF formulation is expressed as:
\[C(\mathbf{p}) = \int T(\mathbf{p}_t) \cdot \sigma(\mathbf{p}_t) \cdot L(\mathbf{p}_t, -\mathbf{d}) \, d\mathbf{p}_t\]
Where \(C(\mathbf{p})\) represents the observed color at point \(\mathbf{p}\), \(\mathbf{p}_t\) represents points along the viewing ray, \(T(\mathbf{p}_t)\) is the transmittance function, \(\sigma(\mathbf{p}_t)\) represents volume density, and \(L(\mathbf{p}_t, -\mathbf{d})\) represents emitted radiance. NeRF introduces an implicit representation, enabling the encoding of detailed and continuous volumetric information. This allows for high-fidelity reconstruction and rendering of scenes with fine-scale structures, surpassing the limitations of explicit representations. Recently, 3D Gaussian Splatting[27] is introduced by projecting 3D information onto a 2D domain using Gaussian kernels, and achieved better performance than NeRF.
In computer science, foundation models also called large-scale models use deep learning models with numerous parameters and intricate structures, particularly in natural language processing and computer vision tasks. These models demand substantial computational resources for training but exhibit exceptional performance across diverse tasks. The evolution from basic neural networks to sophisticated diffusion models, as depicted in Figure 8, illustrates the continuous quest for more robust and adaptable AI systems.
Transformer. The Transformer model has achieved remarkable success in natural language processing (NLP) which consists of several components: encoder, decoder, positional Encoding, and the final linear and softmax layers. Both the encoder and decoder are composed of multiple identical layers. Each layer contains several components of attention layers and feedforward network layers. Additionally, positional encoding is used to inject positional information into the text embeddings, indicating the position of words within the sequence. Notably, Transformer has paved the way for two prominent Transformer models: Bidirectional Encoder Representations from Transformers (BERT)[28] and Generative Pre-trained Transformer (GPT)[29]. The main difference is that BERT is based on a bidirectional pre-training language model and fine-tuning, while GPT is based on an autoregressive pre-training language model and prompting.
GPT. GPT aims to pre-train models using large-scale unsupervised learning to facilitate understanding and generation of natural language. The training process involves two primary stages: Initially, a language model is trained in an unsupervised manner on extensive corpora without task-specific labels or annotations. Subsequently, supervised fine-tuning occurs during the second stage, catering to specific application domains and tasks.
BERT. BERT has emerged as a breakthrough approach, achieving state-of-the-art performance across diverse language tasks. BERT’s training methodology comprises two key stages: pre-training and fine-tuning. Pre-training involves the utilization of extensive text corpora to train the language model. The primary objective of pre-training is to endow the BERT model with robust language understanding capabilities, enabling it to effectively tackle various natural language processing tasks. Subsequently, fine-tuning utilizes the pre-trained BERT model in conjunction with smaller labeled datasets to refine the model parameters. This process facilitates the customization of the model to specific tasks, thereby enhancing its suitability and performance for targeted applications.
In recent years, LLMs have witnessed explosive and rapid growth. Basic language models refer to models that are only pre-trained on large-scale text corpora, without any fine-tuning. Examples of such models include LaMDA[30] and OpenAI’s GPT-3[31].
In computer vision, pretrained vision-language models like CLIP[32] have demonstrated powerful zero-shot generalization performance across various downstream visual tasks. These models are typically trained on hundreds of millions to billions of image-text pairs collected from the web. In addition, some research efforts also focus on large-scale base models conditioned on visual input prompts. For example, SAM[33] can perform category-agnostic segmentation from given images and visual prompts (such as boxes, points, or masks).
The current generative models based on the diffusion model present unprecedented understanding and creative capabilities. Stable Diffusion [9] uses the CLIP [32] text encoder and can adjust the model through text prompts. Its diffusion process starts with random noise and gradually denoising until generates a complete data sample . DALLE-3[34] utilizing the diffusion model with massive data to generate amazing results MIDJourney excels at adapting to actual artistic styles to create images with any combination of effects the user desires.
In this section, we introduce widely used applications of generative AI, including image generation (Section 2.5.1), video generation (Section 2.5.2), and 3D model generation (Section 2.5.3). Furthermore, we present results from presented models in Figure 9 as illustrative references.
Synthesizing high-quality images from text descriptions is a challenging problem in computer vision, StackGAN[35] proposed a two-stage model to solve this issue. In the first stage, StackGAN generates the primitive shape and colors of the object based on the given text description, yielding the initial low-resolution images. In the second stage, StackGAN takes the low-resolution result and text prompts as inputs and generates high-resolution images with photo-realistic details. It can rectify defects in the results of the first stage and add exhaustive details during the refinement process. GLIDE[36] extends the core concepts of the diffusion model by adding additional text information to enhance the training process, ultimately generating text-conditioned images. On this basis, by using diffusion models and massive data, With the release of LDM[9], Stable Diffusion based on LDM has also sprung up. These works cover areas such as image editing and more powerful 3D generation, further advancing image generation and making it closer to human needs.
Image-to-image translation can convert the content in an image from one image domain to another, that is, cross-domain conversion between images.
Sketch. The objective of sketch-to-image generation is to ensure that the generated image maintains consistency in both appearance and context with the provided hand-drawn sketch. Pix2Pix[13] stands out as a classic GAN model capable of handling diverse image translation tasks, including the transformation of sketches into fully realized images. In addition, SketchyGAN[37] focuses on the sketch-to-image generation task and aims to achieve more diversity and realism. Currently, ControlNet[38] can control diffusion models by adding extra conditions. The sketch-to-image generation tasks are applied in both photo-realistic and anime-cartoon styles[39], [40].
Layout. Layout typically encompasses details such as the position, size, and relative relationships of individual objects. Layout2Im[41] is designed to take a coarse spatial layout, consisting of bounding boxes and object categories, for generating a set of realistic images. These images accurately depict the specified objects in their intended locations. To enhance the global attention in context, He et al.[42] introduced the Context Feature Conversion Module to ensure that the generated feature encoding for objects remains aware of other coexisting objects in the scene. As for diffusion models, GLIGEN[43] facilitates grounded text-to-image generation in open worlds using prompts and bounding boxes as condition inputs.
Scene Graph. The scene graph was proposed and utilized first in 2018[44], which is used to enable explicit reasoning about objects and their relationships. Thereafter, Sortino et al. [45] proposed a model that can satisfy semantic constraints defined by a scene graph and to model relations between visual objects in the scene by taking into account a user-provided partial rendering of the desired target. Currently, SceneGenie[46] combined the scene graphs with advanced diffusion models to generate high-quality images. Which enforces geometric constraints in the sampling process using the bounding box and segmentation information predicted from a scene graph.
Since text prompts only generate some discrete tokens, text-to-video generation is more difficult than tasks such as image retrieval and image captioning. Video diffusion model[47] is the first paper to use the diffusion model for video generation tasks. The video diffusion model proposes 3D UNet, which can be applied on variable sequence lengths. Thus it can be jointly trained on video and image modeling goals, making it suitable for video generation tasks. Additionally, Make-A-Video[48] is based on the pre-trained text-to-image model and adds one-dimensional convolution and attention layers in the time dimension to transform it into a text2video model. By learning the connection between text and vision through the T2I model, the single-modal video data is utilized to learn the generation of temporal dynamic content. Furthermore, the controllability and consistency of video generation models have also garnered increased attention from researchers. PIKA[49] has been proposed to support dynamic transformations of elements in the scene based on prompts, without causing the overall image collapse. DynamiCrafter[50] utilizes pre-trained video diffusion priors to add animation effects to static images based on textual prompts. This tool supports high-resolution models, providing better dynamic effects, higher resolution, and stronger consistency.
Recent advancements in text-to-3D synthesis have demonstrated remarkable progress, with researchers employing various sophisticated strategies to bridge the gap between natural language descriptions and the creation of detailed 3D content. The pioneering work DreamFusion[51] harnesses a pre-trained 2D text-to-image diffusion model to generate 3D models without large-scale labeled 3D datasets or specialized denoising architectures. Magic3D[52] improves upon DreamFusion’s[51] limitations by implementing a two-stage coarse-to-fine approach, accelerating the optimization process through a sparse 3D representation before refining it into high-resolution textured meshes via a differentiable renderer.
Recent 3D reconstruction techniques particularly focus on generating and reconstructing three-dimensional objects and scenes from a single or few images. NeRF[53] represents a state-of-the-art technique where complex scene representations are modeled as continuous neural radiance fields optimized with sparse input views. CLIP-NeRF[54] leveraging the joint language-image embedding space of CLIP model, proposes a unified framework that allows manipulating NeRF in a user-friendly way, using either a short text prompt or an exemplar image. DreamCraft3D[55] introduces a hierarchical process for 3D content creation that employs bootstrapped score distillation sampling from a view-dependent diffusion model. This two-step method refines textures through personalized diffusion models trained on augmented scene renderings, thereby delivering high-fidelity, coherent 3D objects. While Magic123[56] offers a two-stage solution for generating high-quality textured 3D meshes from unposed wild images. It optimizes a neural radiance field for coarse geometry and fine-tunes details using differentiable mesh representations guided by both 2D and 3D diffusion priors.
This study delineates the architecture design into six main steps to facilitate a convenient understanding of the process and essence of architectural design. The output of each step is generated based on the project’s objective conditions and the architects’ subjective intentions. Objective conditions (O) include factors such as site area, building height restrictions, and construction standards that must be adhered to by all architects. Subjective intentions (S) refer to the individual architect’s design concept, architectural style, and other subjective preferences. This study explores how generative AI can assist with preliminary design, layout design, structural design, 3D form design, façade design, and imagery expressions based on objective conditions and subjective intentions. It also presents a statistical analysis of generative AI models used in each architectural step and the tasks they accomplished.
To begin with, creating a preliminary 3D architecture model involves considering objective factors such as the building’s type and function, site conditions, surroundings environment, and subjective factors such as design concepts and morphological intentions. This process can be expressed by Equations (4).
\[{!}{F_{\text{P-3D}} = \left\{ y_{\text{P-3D}} \mid y_{\text{P-3D}} \in \bigcap_{i=1}^4 f_{\text{P-3D}}(o_{\text{P-3D}}^i) \cap f_{\text{P-3D}}(S_{\text{P-3D}}) \right\}}\]
Where \(y_{\text{P-3D}}\) is the generated preliminary 3D model of the architecture, \(F_{\text{P-3D}}\) is the collection of all the options. \(O_{\text{P-3D}}\) refers to the Objective conditions of the preliminary design, which includes design tasks (\(o_{\text{P-3D}}^1\)), such as building functions, building area, building height restrictions, and the number of occupants; site conditions (\(o_{\text{P-3D}}^2\)), such as the red line of the site, the shape of the boundaries; surroundings conditions (\(o_{\text{P-3D}}^3\)), such as nearby traffic arteries, neighbor buildings; and environmental performance (\(o_{\text{P-3D}}^4\)), such as the daylighting, wind and thermal environment. \(S_{\text{P-3D}}\) refers to the Subjective intentions of the preliminary design.
Data Transformation Approach | Paper & Methodology |
---|---|
\(parameters\) to \(F_{\text{P-3D}}\) | VAE [57]; GAN, VAE [58]; 3D-DDPM [59]; GANs [60]; 3D-GAN, CPCGAN [61] |
\(classify\) \(F_{\text{P-3D}}\) | VAE [62]; 3D-AAE [63] |
\(S_{\text{P-3D\_text}}\) to \(F_{\text{P-3D}}\) | CVAE [64] |
\(R_{\text{P-3D\_sketch}}\) to \(F_{\text{P-3D}}\) | VAE, GAN [65] |
\(R_{\text{P-3D\_2d}}\) to \(R_{\text{P-3D\_2d}}\) | pix2pix [66], [67];DCGAN [68];pix2pix, CycleGAN [69], [70] |
(\(o_{\text{P-3D}}^2\) + \(o_{\text{P-3D}}^3\)) to \(F_{\text{P-3D}}\) | pix2pix [71]; ESGAN[72] |
(\(S_{\text{P-3D}}\) + \(o_{\text{P-3D}}^3\)) to \(F_{\text{P-3D}}\) | cGAN [73]; GAN [74] |
\(F_{\text{P-3D}}\) to \(F_{\text{P-3D}}\) | TreeGAN [75]; DDPM [76] |
(\(F_{\text{P-3D}}\) + \(o_{P-3D}^2\)) to \(o_{P-3D}^4\) | VAE [77]; pix2pix, cycleGAN [78] |
To elucidate the specific architectural design process, this paper takes the Bo-DAA apartment project in Seoul, South Korea, as an example. The project requirements include multiple residential units and shared public spaces, encompassing a communal workspace, lounge, shared kitchen, laundry room, and pet bathing area (\(o_{\text{P-3D}}^1\)). The site is a regular rectangle with flat terrain (\(o_{\text{P-3D}}^2\)), located in an urban setting surrounded by multi-story residential buildings (\(o_{\text{P-3D}}^3\)). To enhance resident comfort, the design considered lighting and views for each residential unit (\(o_{\text{P-3D}}^4\)). Based on these requirements, the architect chose "Book House" as the design concept (\(S_{\text{P-3D}}\)), creating a preliminary 3D form (\(F_{\text{P-3D}}\)) that gradually tapers from bottom to top. This design provides excellent lighting and views for each residential unit level. This process is illustrated in Figure 10.
The applications of generative AI in this process include four main categories, as shown in the table1: generating \(F_{\text{P-3D}}\) based on parameters and classification; generating \(F_{\text{P-3D}}\) based on 2D images or 1D text (usually from \(o_{\text{P-3D}}^1\), \(o_{\text{P-3D}}^2\) , \(o_{\text{P-3D}}^3\) and \(S_{\text{P-3D}}\), \(S_{\text{P-3D\_text}}\), \(R_{\text{P-3D\_2d}}\), \(R_{\text{P-3D\_sketch}}\)); and generating or redesign \(F_{\text{P-3D}}\) based on 3D model data (usually from \(F_{\text{P-3D}}\)); and generating environmental performance evaluation (usually from \(o_{\text{P-3D}}^4\)) based on 3D model data (usually from \(F_{\text{P-3D}}\)).
Firstly, it facilitates generative AI in generating preliminary 3D forms based on input parameters or in conducting classification analysis on preliminary 3D models. Initially, Variational Autoencoders (VAE) play a pivotal role in reconstructing and generating detailed 3D models (\(F_{\text{P-3D}}\)) from a set of input parameters (\(parameters\) to \(F_{\text{P-3D}}\)) [57]. Building upon this, Generative Adversarial Networks (GAN) further refine the process by training on the point coordinate data of 3D models, utilizing category parameters for more precise reconstructions (\(parameters\) to \(F_{\text{P-3D}}\)) [60]. And the approach facilitates the creation of innovative architectural 3D forms through the technique of input interpolation (\(parameters\) to \(F_{\text{P-3D}}\)) [58]. Also, diffusion probability models offer a unique method of training Taihu stone and architectural 3D models, this training enables the discovery of transitional forms between two distinct 3D models by employing interpolation as an input mechanism (\(parameters\) to \(F_{\text{P-3D}}\)) [59]. The Structure GAN model, focusing on point cloud data, enables the generation of 3D models based on specific input parameters such as length, width, and height (\(parameters\) to \(F_{\text{P-3D}}\)) [61]. In a further enhancement to the modeling process, VAE is also utilized for the in-depth training of 3D models (\(F_{\text{P-3D}}\)). This allows for a comprehensive classification and analysis of the models’ distribution within the latent space, paving the way for more nuanced model creation (\(classify\) \(F_{\text{P-3D}}\)) [62]. Generative AI techniques, such as the 3D Adversarial Autoencoder model, are employed for the training and generation of point cloud representations, facilitating the reconstruction and classification of architectural forms (\(classify\) \(F_{\text{P-3D}}\)) [63].
Secondly, it involves using 1D text data or 2D image data as the generation conditions for generative AI to produce preliminary 3D forms. Variational Autoencoders (VAE) are also applied to train and generate 3D voxel models, guided by textual labels (\(S_{\text{P-3D\_text}}\) to \(F_{\text{P-3D}}\)) [64]. Lastly, the integration of VAE and GAN models facilitates the generation of architectural 3D forms from sketches (\(S_{\text{P-3D\_sketch}}\) to \(F_{\text{P-3D}}\)) [65]. The difficulty of training on 3D data is higher than that of training on 2D image data. Facing the challenges associated with training neural networks on 3D forms, researchers have innovated by transforming 3D forms into 2D representations, such as grayscale images enriched with elevation data. This approach simplifies the training process, enhancing efficiency for architectural forms in specific regions and facilitating the generation of 3D models influenced by the surrounding environment (\(R_{\text{P-3D\_2d}}\) to \(R_{\text{P-3D\_2d}}\)) [67]–[70]. Moreover, the practice of converting 3D models into 2D images for reconstruction, followed by reverting these 2D images back to 3D forms, significantly reduces both training duration and costs, ensuring accurate restoration of the original 3D models (\(R_{\text{P-3D\_2d}}\) to \(R_{\text{P-3D\_2d}}\)) [66]. In other generative AI training strategies, researchers incorporate parameters such as the design site’s scope (\(o_{\text{P-3D}}^2\)) and characteristics of the immediate environment (\(o_{\text{P-3D}}^3\)) as generative conditions. This enables the creation of preliminary 3D models that adhere to predefined rule settings (\(o_{\text{P-3D}}^2\) + \(o_{\text{P-3D}}^3\) to \(F_{\text{P-3D}}\)) [71], [72]. Furthermore, researchers can create architectural 3D models from design concept sketches (\(S_{\text{P-3D}}\) to \(F_{\text{P-3D}}\)) [73], and even from a singular concept sketch in conjunction with environmental data (\(S_{\text{P-3D}}\) + \(o_{\text{P-3D}}^3\) to \(F_{\text{P-3D}}\)) [74].
Afterwards, utilize 3D models as the basis for generative AI creation, or redesign based on generated 3D models, which is generated by generative AI. TreeGAN is used to train point cloud models of churches, leveraging these models for diverse redesign applications (\(F_{\text{P-3D}}\) to \(F_{\text{P-3D}}\)) [75]. Additionally, diffusion probability models are instrumental in training 3D models, introducing noise into 3D models to create novel forms (\(F_{\text{P-3D}}\) to \(F_{\text{P-3D}}\)) [76]. Lastly, generative AI is utilized to conduct site and architectural environmental performance evaluations based on 3D models. This involves generating images for assessments such as view analysis, sunlight exposure, and daylighting rates, among others (\(F_{\text{P-3D}}\) + \(o_{P-3D}^2\) to \(o_{P-3D}^4\)) [77], [78].
Architectural plan design, the second phase in the architectural design process, involves creating horizontal section views at specific site elevations. Guided by objective conditions and subjective decisions, this step includes arranging spatial elements like walls, windows, and doors into a 2D plan. This process can be expressed by Equations (5).
\[\begin{align} F_{\text{Plan}} = \left\{ y_{\text{Plan}} \mid y_{\text{Plan}} \in \bigcap_{i=1}^{3} f_{\text{Plan}}(o_{\text{Plan}}^i) \right.\\ \left. \cap \bigcap_{j=1}^{2} f_{\text{Plan}}(s_{\text{Plan}}^j) \right\} \end{align}\]
Where \(y_{\text{Plan}}\) is the generated architectural plan design, \(F_{\text{Plan}}\) is the collection of all the options. \(O_{\text{Plan}}\) refers to the Objective conditions of the architectural plan design, which includes the preliminary architectural 3D form design (\(o_{\text{Plan}}^1\)), which is the result of the prior design phase; spatial requirements and standards (\(o_{\text{Plan}}^2\)), such as space area and quantity needs; Spatial environmental performance evaluations (\(o_{\text{Plan}}^3\)), such as room daylighting ratio, ventilation rate, etc . \(S_{\text{Plan}}\) refers to the Subjective intentions of the architectural plan design. which includes functional space layout (\(s_{\text{Plan}}^1\)), indicating the size and layout of functional spaces; Spatial sequences (\(s_{\text{Plan}}^2\)), such as bubble diagrams and sequence schematics.
By accumulating the plan design results of each layer, the overall plan design outcome is obtained, represented as Equations (6): \[R_{\text{Plan}} = \sum_{i=1}^{n} F_{\text{Plan}}^i\]
Using the Bo-DAA apartment project as an example, architects first create a preliminary 3D model (\(o_{\text{Plan}}^1\)) to outline each floor’s Plan based on the model’s elevation contours. They then design functional spaces (\(s_{\text{Plan}}^1\)) according to spatial requirements (\(o_{\text{Plan}}^2\)), as evacuation distances and space area needs, positioning public areas on the lower floors and residential units above. Spatial sequences (\(s_{\text{Plan}}^2\)) are structured using corridors and atriums to align with the layout. Environmental evaluations (\(o_{\text{Plan}}^3\)) are also conducted to ensure spatial performance. This leads to a comprehensive architectural plan (\(R_{\text{Plan}}\)) that meets all established constraints.This process is showed in Figure 11.
Data Transformation Approach | Paper & Methodology |
---|---|
\(o_{\text{Plan}}^1\) to \(s_{\text{Plan}}^1\) to \(F_{\text{Plan}}\) | GANs [79]–[83]; |
\(s_{\text{Plan}}^1\) to \(F_{\text{Plan}}\) | pix2pix [84]; |
\(s_{\text{Plan}}^2\) to \(F_{\text{Plan}}\) | Graph2Plan [85]; pix2pix [86]; CycleGAN [87] |
\(F_{\text{Plan}}\) to \(F_{\text{Plan}}\) | pix2pix [88]; GANs [89] |
\(o_{\text{Plan}}^3\) to \(F_{\text{Plan}}\) | pix2pix [90] |
(\(s_{\text{Plan}}^1 + o_{\text{P-3D}}^4\)) to \(s_{\text{Plan}}^1\) | Genetic-Algorithm, FCN [91] |
\(s_{\text{Plan}}^1\) to \(s_{\text{Plan}}^1\) | GNN, VAE [92] |
\(o_{\text{Plan}}^2\) to \(s_{\text{Plan}}^1\) | CoGAN [93] |
(\(s_{\text{Plan}}^1 + o_{\text{P-3D}}^3\)) to \(s_{\text{Plan}}^1\) | GANs [94]–[96]; CNN, pix2pixHD [94]; |
(\(s_{\text{Plan}}^1 + o_{\text{Plan}}^2\)) to \(s_{\text{Plan}}^1\) | Transformer [97] |
\(s_{\text{Plan}}^2\) to \(s_{\text{Plan}}^1\) | GANs [98]–[105]; Transformer [106]; DM [107] |
\(o_{\text{P-3D}}^2\) to \(s_{\text{Plan}}^1\) | pix2pix [104], [108], [109]; GauGAN [110] |
\(o_{\text{Plan}}^1\) to \(s_{\text{Plan}}^1\) | GANs [111]; StyleGAN, Graph2Plan, RPLAN [112]; pix2pix [113] |
\(R_{\text{Plan}}\) to \(s_{\text{Plan}}^2\) | EdgeGAN [114]; cGAN [115] |
\(o_{\text{Plan}}^3\) to \(s_{\text{Plan}}^2\) | DCGAN [116]; VQ-VAE, GPT [117] |
\(o_{\text{Plan}}^1\) to \(s_{\text{Plan}}^2\) | DDPM [118] |
\(F_{\text{Plan}}\) to \(o_{\text{Plan}}^3\) | cGAN [119] |
\(s_{\text{Plan}}^1\) to \(o_{\text{Plan}}^3\) | pix2pix [120]; pix2pix [121] |
The applications of generative AI in plan design include four main categories, as shown in the table2: generating floor plan \(F_{\text{Plan}}\) based on 2D images (usually from \(o_{\text{Plan}}^1\),\(o_{\text{Plan}}^3\), \(s_{\text{Plan}}^1\), \(s_{\text{Plan}}^2\), and \(F_{\text{Plan}}\)) , generating functional space layout \(s_{\text{Plan}}^1\) based on 2D images (usually from \(s_{\text{Plan}}^1\), \(s_{\text{Plan}}^2\),\(o_{\text{Plan}}^1\),\(o_{\text{Plan}}^2\), \(o_{\text{P-3D}}^2\),\(o_{\text{P-3D}}^3\),\(o_{\text{P-3D}}^4\)), generating spatial sequences \(s_{\text{Plan}}^2\) based on 2D images (usually from \(R_{\text{Plan}}\),\(o_{\text{Plan}}^1\), \(o_{\text{Plan}}^3\)), generating Spatial environmental performance evaluations \(o_{\text{Plan}}^3\) based on 2D images (usually from \(F_{\text{Plan}}\),\(s_{\text{Plan}}^1\)).
Firstly, In terms of generating architectural floor plans, researchers can create functional space layout diagrams from preliminary design range schematics or site range schematics and then generate the final architectural Plan based on these layouts, progressing from \(o_{\text{Plan}}^1\) to \(s_{\text{Plan}}^1\), and finally to \(F_{\text{Plan}}\) [79]–[83]. Architectural floor plans can directly generated from functional space layout diagrams (\(s_{\text{Plan}}^1\) to \(F_{\text{Plan}}\)) [84]. Additionally, researchers utilize generative models to convert planar spatial bubble diagrams or spatial sequence diagrams into latent spatial vectors, which are then used to generate architectural floor plans (\(s_{\text{Plan}}^2\) to \(F_{\text{Plan}}\)) [85]–[87]. Moreover, by utilizing GAN models, architectural floor plans can be further refined to obtain floor plans with flat furniture. (\(F_{\text{Plan}}\) to \(F_{\text{Plan}}\)) [88]. Some processes of reconstruction and generation of floor plans are achieved through training architectural floor plans (\(F_{\text{Plan}}\) to \(F_{\text{Plan}}\))[89]. Generating floor plans can also be produced based on space Environmental evaluations, such as lighting and wind conditions (\(o_{\text{Plan}}^3\) to \(F_{\text{Plan}}\))[90].
Secondly, generative AI is not limited to producing architectural floor plans but also plays various roles in the generation of functional space layouts (\(s_{\text{Plan}}^1\)). For instance, it can utilize neural networks and genetic algorithms to enhance functional layouts based on wind environments performance evaluations (\(s_{\text{Plan}}^1 + o_{\text{P-3D}}^4\) to \(s_{\text{Plan}}^1\)) [91]. Moreover, Generative AI can reconstruct and produce matching functional layout diagrams based on the implicit information within the functional space layout maps (\(s_{\text{Plan}}^1\) to \(s_{\text{Plan}}^1\))[92]. Furthermore, it can generate viable functional layouts per spatial requirements(\(o_{\text{Plan}}^2\) to \(s_{\text{Plan}}^1\) )[93]. Additionally, it can predict and generate functional space layouts based on surrounding environmental performance evaluations (\(s_{\text{Plan}}^1 + o_{\text{P-3D}}^3\) to \(s_{\text{Plan}}^1\)) [94]–[96], [122] . Similarly, Generative AI possesses the ability to complement or augment incomplete functional space layouts based on specific demands (\(s_{\text{Plan}}^1 + o_{\text{Plan}}^2\) to \(s_{\text{Plan}}^1\)) [97], [123], and it use spatial sequences to generate functional space layout diagrams (\(s_{\text{Plan}}^2\) to \(s_{\text{Plan}}^1\)) [98]–[107] . In addition, it can generate functional space layout diagrams based on the designated red-line boundary of a design site (\(o_{\text{P-3D}}^2\) to \(s_{\text{Plan}}^1\)) [104], [108]–[110] , and it skillfully use plan design boundaries as conditions to generate functional space layout diagrams (\(o_{\text{Plan}}^1\) to \(s_{\text{Plan}}^1\)) [111]–[113].
Thirdly, generative AI demonstrates exceptional performance in the generation and prediction of spatial sequences (\(s_{\text{Plan}}^2\)). Specifically, It is capable of identifying and reconstructing wall layout sequences from floor plans (\(R_{\text{Plan}}\) to \(s_{\text{Plan}}^2\)) [114], [115].Additionally, it can construct spatial sequence bubble diagrams directly from these floor plans (\(R_{\text{Plan}}\) to \(s_{\text{Plan}}^2\)) [124]. Moreover, generative AI can employ isovists for predicting spatial sequences (\(o_{\text{Plan}}^3\) to \(s_{\text{Plan}}^2\)) [116], [117] . Lastly, it is also capable of producing these diagrams conditioned on specific plan design boundary ranges (\(o_{\text{Plan}}^1\) to \(s_{\text{Plan}}^2\)) [118] .
Lastly, generative AI can foresee space environmental performance evaluations from floor plans (\(F_{\text{Plan}}\) to \(o_{\text{Plan}}^3\)) [119] , such as light exposure and isovist ranges. It can also predict indoor brightness [120] and daylight penetration [121] using functional space layout diagrams (\(s_{\text{Plan}}^1\) to \(o_{\text{Plan}}^3\)).
Data Transformation Approach | Paper & Methodology |
---|---|
\(o_{\text{str}}^2\) to \((x_{\text{str}}, y_{\text{str}})\) | GANs [125]–[127] |
(\(l_{\text{str\_text}} + o_{\text{str}}^2\)) to \((x_{\text{str}}, y_{\text{str}})\) | GANs [128]–[131] |
\((x_{\text{str}}, y_{\text{str}})\) to \((x_{\text{str}}, y_{\text{str}})\) | GANs [132] |
(\(s_{\text{Plan}}^1 + o_{\text{str}}^2\)) to \((x_{\text{str}}, y_{\text{str}})\) | pix2pixHD [133] |
(\((x_{\text{str}}, y_{\text{str}}) + d_{\text{str}}\)) to (\((x_{\text{str}}, y_{\text{str}}) + d_{\text{str}}\)) | StructGAN-KNWL [134] |
The third phase in the architectural design process, architectural structure system design, involves architects developing the building’s framework and support mechanisms. This process can be expressed by Equations (7). \[{!}{F_{\text{str}} = \left\{ y_{\text{str}} \, \middle| \, y_{\text{str}} \in \bigcap_{i=1}^{3} f_{\text{str}}(o_{\text{str}}^i) \cap \bigcap_{j=1}^{2} f_{\text{str}}(s_{\text{str}}^j) \right\}}\]
Where \(y_{\text{str}}\) is the generated structure system, \(F_{\text{str}}\) is the collection of all the options. \(O_{\text{P-3D}}\) refers to the objective conditions of the preliminary design, which includes the structural load distribution (\(o_{\text{str}}^1\)), referring to the schematic of the building’s structural load distribution; architectural plan design (\(o_{\text{str}}^2\)), the second step of the design process; the preliminary 3D form of the building (\(o_{\text{str}}^3\)), being the result of the first step in the design process. \(S_{\text{P-3D}}\) refers to the Subjective decisions of the structure system design, which includes structural materials (\(s_{\text{str}}^1\)) , typically encompassing parameters characterizing the materials and texture images; structural aesthetic principles (\(s_{\text{str}}^2\)) , usually involving conceptual diagrams and 3D models of the structural form. The design outcome \(y_{\text{str}}\) in the formula encapsulates various structural information, such as structural load capacity (\(l_{\text{str}}\)), structural dimensions (\(d_{\text{str}}\)), and structural layout (\(x_{\text{str}}, y_{\text{str}}\)). This structural information is determined by a set of objective conditions (\(O_{\text{str}}\)) and a set of subjective decisions (\(S_{\text{str}}\)).
Using the Bo-DAA apartment project as an illustration, the architect utilized the preliminary 3D model (\(o_{\text{str}}^3\)) and the architectural plan (\(o_{\text{str}}^2\)) to define the building’s spatial form and structural load distribution (\(o_{\text{str}}^1\)). Opting for a frame structure, reinforced concrete was chosen as the construction material (\(s_{\text{str}}^1\)), embodying modern Brutalism (\(s_{\text{str}}^2\)). This approach ensured that the final structure (\(R_{\text{str}}\)) adhered to both the aesthetic and the functional constraints.This process is represented in Figure 12.
The applications of generative AI in structural system design primarily involve the prediction of structural layout (\((x_{\text{str}}, y_{\text{str}})\)) and structural dimensions (\(d_{\text{str}}\)).
In the realm of generating architectural structure layout images, generative AI is capable of recognizing architectural floor plans (\(o_{\text{str}}^2\)). Consequently, it leverages this recognition to generate detailed images of the structural layout (\(o_{\text{str}}^2\) to \((x_{\text{str}}, y_{\text{str}})\)) [125]–[127].
. Moreover, this technology is adept at creating structural layout diagrams that correspond to floor plans based on specified structural load capacities (\(l_{\text{str\_text}} + o_{\text{str}}^2\) to \((x_{\text{str}}, y_{\text{str}})\)) [128]–[131].Additionally, generative AI holds the capability to refine and enhance existing structural layouts, optimizing the layouts within the same structural space from(\((x_{\text{str}}, y_{\text{str}})\) to \((x_{\text{str}}, y_{\text{str}})\)) [132]. Furthermore, generative AI combines functional space layouts (\(s_{\text{Plan}}^1\)) and architectural floor plans (\(o_{\text{str}}^2\)) to create corresponding architectural structure layout diagrams (\(s_{\text{Plan}}^1 + o_{\text{str}}^2\) to \((x_{\text{str}}, y_{\text{str}})\)) [133].
In terms of predicting and generating structural dimensions, generative AI can forecast and create more appropriate structural sizes and layouts based on the layout and existing dimensions, thereby optimizing these dimensions (\((x_{\text{str}}, y_{\text{str}}) + d_{\text{str}}\) to \((x_{\text{str}}, y_{\text{str}}) + d_{\text{str}}\)) [134]. Furthermore, It can also generate dimensions and layouts that meet load requirements based on structural layout (\(x_{\text{str}}, y_{\text{str}}\)) and load capacity (\(l_{\text{str}}\)), (\((x_{\text{str}}, y_{\text{str}}) + l_{\text{str}}\) to \((x_{\text{str}}, y_{\text{str}}) + d_{\text{str}}\)).
Architectural design’s fourth phase focuses on refining and optimizing 3D models to closely represent the building’s characteristics based on the initial model. This step can enhance the detail and form, and this process can be expressed by Equations (8).
\[\begin{align} F_{\text{D-3D}} = \Bigg\{ y_{\text{D-3D}} \Bigg|y_{\text{D-3D}} \in \\ \bigcap_{i=1}^{4} f_{\text{D-3D}}(o_{\text{D-3D}}^i) \cap \bigcap_{j=1}^{2} f_{\text{D-3D}}(s_{\text{D-3D}}^j) \Bigg\} \end{align}\]
Where \(y_{\text{D-3D}}\) is the generated refined 3D model of the architecture, \(F_{\text{D-3D}}\) is the collection of all the options. \(O_{\text{D-3D}}\) refers to the Objective conditions of the preliminary design, which includes Requirements (\(o_{\text{D-3D}}^1\)) of the refinement of the architectural 3D form by indicator; the preliminary 3D forms design of architectural (\(o_{\text{D-3D}}^2\) ), the result of the first step in the design process; Architectural floor plan design (\(o_{\text{D-3D}}^3\)), the outcome of the second step in the design process; Architectural structural system design (\(o_{\text{D-3D}}^4\)), the result of the third step in the design process. \(S_{\text{D-3D}}\) refers to the subjective decisions of refined 3D model of the architecture, which includes Aesthetic principles (\(s_{\text{D-3D}}^1\)), which are principles used by architects to control the overall form and proportions of a building ; The design style (\(s_{\text{D-3D}}^2\)), which mean a manifestation of a period or region’s specific characteristics and expression methods that can be reflected through elements such as the form, structure, materials, color, and decoration of a building.
Using the Bo-DAA apartment project, the architectural form index (\(o_{\text{D-3D}}^1\)) including key metrics like floor area ratio and height is first established. Next, the preliminary 3D form (\(o_{\text{D-3D}}^2\)) shapes a tapered volume. In the floor plan phase (\(o_{\text{D-3D}}^3\)), refinements such as a sixth-floor setback for public spaces are made. The structural system design (\(o_{\text{D-3D}}^4\)) guides these modifications within structural principles. Aesthetic principles (\(s_{\text{D-3D}}^1\)) and design styles (\(s_{\text{D-3D}}^2\)) are woven throughout, culminating in a refined 3D form (\(R_{\text{D-3D}}\)) that harmonizes constraints with aesthetics..This process is illustrated in Figure 13.
Data Transformation Approach | Paper & Methodology |
---|---|
\(generate\) \(F_{\text{D-3D}}\) | 3D-GAN [135] |
\(classify\) \(F_{\text{D-3D}}\) | VAE [136] |
\(parameters\) to \(F_{\text{D-3D}}\) | DCGAN, StyleGAN [137]; 3D-GAN [138] |
\(s_{\text{D-3D\_text}}^2\) to \(F_{\text{D-3D}}\) | 3D-GAN [139] |
\(F_{\text{D-3D\_2d}}\) to \(F_{\text{D-3D\_2d}}\) | StyleGAN, pix2pix [140]; pix2pix, CycleGAN [141] |
\(o_{\text{D-3D}}^3\) to \(F_{\text{D-3D}}\) | StyleGAN [142] |
(\(o_{\text{P-3D}}^1\) + \(s_{\text{Plan}}^2\)) to \(F_{\text{D-3D}}\) | GCN [142]; cGAN, GNN [143] |
The applications of generative AI in architectural 3D forms refinement and optimization design include two main categories, as shown in the table4: Using parameters or 1D text to generate \(F_{\text{D-3D}}\), or to conduct classification analysis (usually from \(s_{\text{D-3D\_text}}^2\)); generating \(F_{\text{P-3D}}\), which represented by 2D images or 3D models, based on 2D images (usually from \(F_{\text{D-3D\_2d}}\), \(o_{\text{D-3D}}^3\), \(s_{\text{Plan}}^2\)).
In terms of Using parameters or 1D text to generate refined architectural 3D models by generative AI, researchers have trained voxel expression models (\(generate\) to \(F_{\text{D-3D}}\)) [135] to generate these refined models.Additionally, generative AI has been employed to train Signed Distance Function (SDF) voxels, coupled with performing clustering analysis on shallow vector representations of 3D models (\(F_{\text{D-3D}}\)) [136].Following this, 2D images containing 3D voxel information can be generated based on the input RGB channel values. (parameters to \(F_{\text{D-3D\_2d}}\)) [138] . Furthermore, New forms of 3D elements can be generated based on interpolation, transitioning from (parameters to \(F_{\text{D-3D}}\))[137]. Voxelized and point cloud representations of 3D enhanced model components (\(F_{\text{D-3D}}\)) can be trained , and according to the textual labels of architectural components (\(s_{\text{D-3D\_text}}^2\) to \(F_{\text{D-3D}}\)) [139].
In terms of basing 2D images to generate refined architectural 3D models by generative AI, researchers converted the refined architectural 3D models (\(F_{\text{D-3D}}\)) into sectional images, it is possible to train paired sectional diagrams using Generative Adversarial Networks (GANs) to understand the connections between adjacent sectional images. By inputting a single sectional image into the model to reconstruct a new sectional image and then using the newly generated sectional image as input to continue developing sectional images through iteration of the above process, the reconstruction of the 3D model can be completed. (\(F_{\text{D-3D\_2d}}\) to \(F_{\text{D-3D\_2d}}\)) [140], [141] . Concurrently, generative AI can also generate refined 3D models from architectural floor plans (\(o_{\text{D-3D}}^3\) to \(F_{\text{D-3D}}\)) [142], or based on spatial sequence matrices and spatial requirements (\(o_{\text{P-3D}}^1\) + \(s_{\text{Plan}}^2\) to \(F_{\text{D-3D}}\)) [143], [144]. In an innovative approach, generative AI can learn the 3D models of architectural components (\(F_{\text{D-3D}}\)), combining them to create refined architectural 3D models. For instance, architectural 3D model components can be pixelated into 2D images for training.
The fifth step in architectural design focuses on facade design, aiming to create the building’s external appearance that reflects its style and environmental compatibility, incorporating cultural and symbolic elements. This process can be expressed by Equations (9).
\[\begin{align} F_{\text{Fac}} = \Bigg\{ y_{\text{Fac}} \, \Bigg| \, y_{\text{Fac}} \\ \in \bigcap_{i=1}^{4} f_{\text{Fac}}(o_{\text{Fac}}^i) \cap \bigcap_{j=1}^{2} f_{\text{Fac}}(s_{\text{Fac}}^j) \Bigg\} \end{align}\]
Data Transformation Approach | Paper & Methodology |
---|---|
\((a_{w} + p_{w} + a_{win} + p_{win} + s_{\text{Fac}}^1)\) to \(F_{\text{Fac}}\) | GANs [145]–[151] DM, CycleGAN [152] |
\((a_{w} + p_{w} + a_{win} + p_{win} + s_{\text{Fac}}^1)\) to \(R_{\text{Fac}}\) | pix2pix [153] |
\(F_{\text{Fac}}\) to \(F_{\text{Fac}}\) | CycleGAN [154]; StyleGAN2 [155] |
(\(F_{\text{Fac}} + s_{\text{Fac}}^2\)) to \(F_{\text{Fac}}\) | GANs [156] |
\((a_{w} + p_{w} + a_{win} + p_{win} + s_{\text{Fac}}^1)\) to \((a_{w} + p_{w} + a_{win} + p_{win} + s_{\text{Fac}}^1)\) | GAN [157] |
Where \(y_{\text{Fac}}\) is the generated architectural facade, \(F_{\text{Fac}}\) is the collection of all the options. \(O_{\text{Fac}}\) refers to the Objective conditions of architectural facade, which includes performance evaluation of the architectural facade (\(o_{\text{Fac}}^1\)), such as daylighting, heat insulation, thermal retention, etc; architectural plan design (\(o_{\text{Fac}}^2\)), which is the result of the second step in the design process; architectural structural system design (\(o_{\text{Fac}}^3\)), which is the outcome of the third step in the design process; and architectural 3D forms refinement and optimization design (\(o_{\text{Fac}}^4\)), which is the result of the fourth step in design process. \(S_{\text{Fac}}\) refers to the subjective decisions of the preliminary design, which includes facade component elements (\(s_{\text{Fac}}^1\)), referring to specific facade component styles employed by the architect, reflecting the designer’s style and concept; materials and style of the facade (\(s_{\text{Fac}}^2\)), different materials bring various textures and colors to the building, exhibiting unique architectural characteristics and styles.
Subsequently, the final facade design outcome can be achieved by summing up the facade design results from each direction. This process can be expressed by Equations (10).
\[R_{\text{Fac}} = \sum_{i=1}^{4} F_{\text{Fac}}^i\]
Each direction’s final facade design outcome \(y_{\text{Fac}}\), encapsulates various facade information, such as the area (\(a_{w}\)) and position (\(p_{w}\)) of the wall surface, the area (\(a_{win}\)) and position (\(p_{win}\)) of the window surface, and the adoption of a specific style for the facade components (\(c_{\text{Fac}}\)). These information are derived from the set of objective conditions \(O_{\text{Fac}}\) and the set of subjective decisions \(S_{\text{Fac}}\).
In the Bo-DAA apartment project, architects use the architectural plan (\(o_{\text{Fac}}^2\)) to define windows and walls, incorporating glass curtain walls on the ground floor. The structural design (\(o_{\text{Fac}}^3\)) guides facade structuring, ensuring alignment with the building’s structure. The refined 3D model (\(o_{\text{Fac}}^4\)) influences the facade’s shape, with residential windows designed to complement the building’s form. Facade performance is enhanced through simulations (\(o_{\text{Fac}}^1\)). Material selection (\(s_{\text{Fac}}^2\)) favors exposed concrete, echoing Brutalist aesthetics (\(s_{\text{Fac}}^1\)), resulting in a minimalist, sculptural facade (\(R_{\text{Fac}}\)).This process is illustrated in Figure 14.
The applications of generative AI in architectural facade design include two main categories, as shown in the table 5: generating \(F_{\text{Fac}}\) based on 2D images (usually from \(s_{\text{Fac}}^2\), \(F_{\text{Fac}}\), semantic segmentation map of facade); generating semantic segmentation map of facade based on 2D images.
In generating architectural facades, initially, it facilitates the generation of facade images by utilizing architectural facade semantic segmentation map, which annotate the precise location and form of facade elements such as walls, window panes, and other components. Consequently, this process involves generating facade images under the constraints of a given wall area (\(a_{w}\)) and position (\(p_{w}\)), window area (\(a_{win}\)), position (\(p_{win}\)), and component elements (\(s_{\text{Fac}}^1\)), represented as (\(a_{w} + (p_{w}) + a_{win} + (p_{win}) + s_{\text{Fac}}^1\)) to \(F_{\text{Fac}}\) [145]–[152]. Furthermore, complete facade and roof images for all four directions of a building can be generated using semantic segmentation images (\(a_{w} + (p_{w}) + a_{win} + (p_{win}) + s_{\text{Fac}}^1\) to \(R_{\text{Fac}}\)) [153]. Additionally, generative AI proves instrumental in training architectural facade images for both reconstruction and novel generation processes (\(F_{\text{Fac}}\) to \(F_{\text{Fac}}\)) [154]. Its utility is further demonstrated in the application of style transfer to architectural facades, either by incorporating style images (\(F_{\text{Fac}} + s_{\text{Fac}}^2\) to \(F_{\text{Fac}}\)) [156], [158] or by facilitating style transfer between facade images of diverse architectural styles (\(F_{\text{Fac}}\) to \(F_{\text{Fac}}\)) [155].
In generating semantic segmentation maps for architectural facades, generative AI can be employed for the reconstruction and generation of semantic segmentation maps of facades, like rebuilding the occluded parts of a semantic segmentation maps based on the unobstructed parts(\(a_{w} + (p_{w}) + a_{win} + (p_{win}) + s_{\text{Fac}}^1\) to \((a_{w} + (p_{w}) + a_{win} + (p_{win}) + s_{\text{Fac}}^1)\)) [157].
Architectural image expression synthesizes design elements into 2D images, reflecting the architect’s vision and design process. This process can be expressed by Equations (11).
Data Transformation Approach | Paper & Methodology |
---|---|
\(parameter\) to \(F_{\text{Img}}\) | GANs [159]; |
\(s_{Image\_text}^1\) to \(F_{\text{Img}}\) | GANs [160]–[162]; DMs [4], [163]–[166]; GANs, DMs [5] |
(\(s_{Image\_text}^1 + F_{\text{Img}}\)) to \(F_{\text{Img}}\) | DMs [167]–[170]; GANs [171], [172]; GANs, DMs [173]; GANs, CLIP [174], [175] |
(\(o_{\text{Img}}^3 + F_{\text{Img}}\)) to \(F_{\text{Img}}\) | GANs [176]; |
\(s_{Image\_mask}^1\) to \(F_{\text{Img}}\) | GANs [177]; CycleGAN [178]; |
\(s_{\text{Img}}^2\) to \(F_{\text{Img}}\) | GANs [179]–[182] |
\(F_{\text{Img}}\) to \(F_{\text{Img}}\) | GANs [183]–[186]; |
\(s_{\text{Img}}^2\) to \(s_{\text{Img}}^2\) | GAN [187]; |
\(F_{\text{Img}}\) to \(s_{\text{Img}}^1\) | VAE [188]; |
\[\begin{align} F_{\text{Img}} = \Bigg\{ y_{\text{Img}} \, \Bigg| \, y_{\text{Img}} \\ \in \bigcap_{i=1}^{4} f_{\text{Img}}(o_{\text{Img}}^i) \cap \bigcap_{j=1}^{2} f_{\text{Img}}(s_{\text{Img}}^j) \Bigg\} \end{align}\]
Where \(y_{\text{Img}}\) is the generated architectural images, \(F_{\text{Img}}\) is the collection of all the options. \(O_{\text{Img}}\) refers to the Objective conditions of the architectural image expression, which includes Architectural plan design (\(o_{\text{Img}}^1\)), which is the result of the second step in the design process; architectural structural system design, (\(o_{\text{Img}}^2\)), which is the result of the third step in the design process; refined 3D form of architecture (\(o_{\text{Img}}^3\)), which is the result of the fourth step in the design process. architectural facade design (\(o_{\text{Img}}^4\)), which is the result of the fifth step in the design process. \(S_{\text{Img}}\) refers to the Subjective decisions of the architectural plan design. which includes aesthetic principles (\(s_{\text{Img}}^1\)) and image style (\(s_{\text{Img}}^2\)), indicating elements are principles architects use to control the composition and style of architectural images.
The applications of generative AI in architectural image expression include three main categories, as shown in the table 6: generating architectural image \(F_{\text{Fac}}\) based on 1D text (usually from parameter, \(s_{Image\_text}^1\)) ; generating architectural image \(F_{\text{Fac}}\) based on 2D images (usually from ) ; generating architectural different style images or semantic images (\(s_{\text{Img}}^1\),\(s_{\text{Img}}^2\)) based on 2D images (usually from \(s_{\text{Img}}^2\), \(F_{\text{Fac}}\)).
In generating architectural images based on 1D text, researchers employ linear interpolation techniques to create architectural images from varying perspectives (parameter to \(F_{\text{Img}}\)) [159]. Moreover, the direct generation of architectural images from textual prompts simplifies and streamlines the process (\(s_{Image\_text}^1\) to \(F_{\text{Img}}\))[4], [5], [161], [163]–[165].This approach is also effective for generating architectural interior images, as demonstrated by the use of Stable Diffusion for interior renderings (\(s_{Image\_text}^1\) to \(F_{\text{Img}}\)) [166].
In generating architectural images based on 2D images, several researchers have focused on training architectural images and their paired textual prompts using generative AI models, facilitating the creation of architectural images based on the textual prompts (\(s_{Image\_text}^1 + F_{\text{Img}}\) to \(F_{\text{Img}}\)) [160], [167]–[175]. Additionly, researchers utilize generative AI models to train architectural images and corresponding textual prompts, generating architectural images based on these prompts (\(o_{\text{Img}}^3 + F_{\text{Img}}\) to \(F_{\text{Img}}\)) [162], [176].Furthermore, the direct generation of architectural images can be achieved through the use of image semantic labels (masks) or textual descriptions. Specifically, generating architectural images from image semantic labels offers precise control over the content of the generated images (\(s_{Image\_mask}^1\) to \(F_{\text{Img}}\)) [177], [178]; Researchers have also explored the transformation of architectural images across different styles, such as generating architectural images from sketches or line drawings (\(s_{\text{Img}}^2\) to \(F_{\text{Img}}\)) [179]–[182]. By leveraging generative AI models, architectural images can undergo style blending, where images are generated based on two input images, enhancing the versatility of architectural visualization (\(F_{\text{Img}}\) to \(F_{\text{Img}}\)) [186]. Employing GAN models to generate comfortable underground space renderings from virtual 3D space images (\(F_{\text{Img}}\) to \(F_{\text{Img}}\)) [183]; and facilitates the creation of interior decoration images from 360-degree panoramic interior images (\(F_{\text{Img}}\) to \(F_{\text{Img}}\)) [184]. Moreover, Using StyleGAN2 to generate architectural facade and floor plan images (\(F_{\text{Img}}\) to \(F_{\text{Img}}\)) [185] serves as a basis for establishing 3D architectural models.
In generating architectural different style images or semantic images based on 2D images, generative AI can be instrumental in the reconstruction and generation of architectural line drawings (\(s_{\text{Img}}^2\) to \(s_{\text{Img}}^2\)) [187]. And generative AI is capable of producing semantic style images that correspond to architectural images (\(F_{\text{Img}}\) to \(s_{\text{Img}}^1\)) [188].
In this section, we illustrate the potential future research directions to apply generative AI in architectural design using the latest emerging techniques of image, video and 3D forms generations (Section 2).
Researchers have applied various generative AI image generation techniques to the design and generation of architectural plan images. As technology advances, architects can gradually incorporate more conditional constraints into the generation of floor plans, allowing generative AI to take over the thought process of architects. Architects can supply text data to the generative models, Text data encompasses client design requirements and architectural design standards (\(o_{\text{Plan}}^2\)), such as building type, occupancy, spatial needs, dimensions of architectural spaces, evacuation route settings and dimensions, fire safety layout standards, etc. Architects also can supply image data to the generative models, such as site plans (\(o_{\text{P-3D}}^2\)), which define the specific land use of architectural projects, nearby buildings and natural features (\(o_{\text{P-3D}}^3\)), as well as floor layout diagrams (\(s_{\text{Plan}}^1\)) or spatial sequence diagrams (\(s_{\text{Plan}}^2\)) .
Based on the aforementioned method, some generative AI models hold developmental potential in architectural floor plan generation. "Scene Graph" is a data structure capable of intricately describing the elements within a scene and their interrelations, consisting of nodes and edges. This structure is particularly suited for depicting the connectivity within architectural floor plans. By integrating diffusion models, SceneGenie[46] can accurately generate architectural floor plans using Scene Graphs. Furthermore, technologies such as Stable Diffusion[9] and Imagen[189] allow for further refinement in the generation process of architectural floor plans through text prompts and layout controls.
As shown in Figure 15, existing generative models, such as Stable Diffusion[9] and Imagen[189], can generate complete architectural designs based on textual input. However, the generated images often fail to meet professional standards and may lack rational layout adherence to designers’ intentions. Nonetheless, with the advancement of conditional image generation, it is now possible to incorporate additional constraints such as bounding boxes to control the generation process of diffusion models. This integration holds promise for aligning with layout considerations in architectural design.
Applications of generative AI on facade generation based on semantic segmentation , textual descriptions, and facade image style transfer. These advancements have made the facade generation process more efficient. With the advancement of generative AI technology, researchers can develop more efficient and superior facade generation models. For instance, architects can provide generative AI models with conditions such as facade sketches, facade mask segmentation images, and descriptive terms for facade generation. These conditions can assist architects in generating corresponding high-quality facade images, streamlining the facade design process, and enhancing design efficiency.
The key to applying generative models to architectural design lies in integrating professional architectural data with computational data. As illustrated in Figure 16, layout and segmentation masks can often represent the facade information in architecture in 2D image generation. The architectural constraints can serve as hyperparameter inputs to guide the image generation process by the generative model.
The various methods of generative AI in image generation have also shown unique potential in creating architectural facade images, such as GLIGEN [43], and MultiDiffusion [190]. Moreover, with the development of generative AI technology, ControlNet[38] can precisely control the content generated by diffusion models by adding extra conditions. It is applicable to the style transfer of architectural facades and can enrich facade designs with detailed elements such as brick textures, window decorations, or door designs. Moreover, ControlNet can be used to adjust specific elements in facade drawings, for instance, altering window shapes, door sizes, or facade colors, thereby enhancing the personalization and creativity of the design. Simultaneously, analyzing the style and characteristics of surrounding buildings ensures that the new facade design harmonizes with its environment, maintaining consistency in the scene.
The text-to-image image generation method is capable of producing creative architectural concept designs (\(F_{\text{Img}}\)) based on brief descriptions or a set of parameters (\(s_{Image\_text}^1,s_{Image\_text}^2\)). The image-to-image image generation method enables the generation of architectural images possessing consistent features or styles. This offers the potential to explore architectural forms and spatial layouts yet to be conceived by human designers. Automatically generated architectural concepts can serve as design inspiration, helping designers break traditional mindsets and explore new design spaces. Simultaneously, diffusion probabilistic models can generate realistic architectural rendering images suitable for Virtual Reality (VR) or Augmented Reality (AR) applications, providing designers and clients with immersive design evaluation experiences. This advances higher quality and interactive architectural visualization technologies, making the design review and modification process more intuitive and efficient.
Stable Diffusion[9], DALLE-3[34], and GLIDE[36] have been significantly applied in the domain of architectural image generation, demonstrating robust capabilities in image synthesis. ControlNet[38], with its exceptional controllability, has increasingly been utilized by architects to generate architectural images and style transfer, substantially enriching design creativity and enhancing design efficiency. Similarly, GLIGEN[43] and SceneGenie[46] have shown potential in the control of image content, which also holds significant value in the generation and creation of architectural imagery.
The application of generative AI-based video generation in architectural design has multiple development directions. Through generative AI technology, performance videos can be generated using a single architectural effect image (\(F_{\text{Img}}\)) along with relevant textual descriptions (\(s_{Image\_text}^1\), \(s_{Image\_text}^2\)), performance videos can be produced. Future advancements include compiling multiple images of a structure from various angles to craft a continuous video narrative. Such an approach diversifies presentation techniques and streamlines the design process, yielding significant time and cost savings.
In the field of architectural video generation, Make-A-Video[48], DynamiCrafter[50], and PIKA[49] each showcase their strengths, bringing innovative presentation methods to the forefront. Make-A-Video transforms textual descriptions into detailed dynamic videos, enhancing the visual impact and augmenting audience engagement, enabling designers to depict the architectural transformations over time through text effortlessly. DynamiCrafter employs text-prompted technology to infuse static images with dynamic elements, such as flowing water and drifting clouds, with high-resolution support ensuring the preservation of details and realism. PIKA, conversely, demonstrates unique advantages in dynamic scene transformation, supporting text-driven element changes, allowing designers to maintain scene integrity while presenting dynamic details, thereby offering a rich and dynamic visual experience.
With advancements in diffusion models, current generative models can now produce high-quality effect videos. As shown in the Figure 17, the first and second rows depict effect demonstration videos generated from input images using PIKA[49], where the buildings undergo minor movements and scaling while maintaining consistency with the surrounding environment. DynamiCrafter[50] can generate rotating buildings, as demonstrated in the third row, where the model predicts architectural styles from different angles and ensures consistent generation. From GANs to diffusion models, mature image-to-image style transfer models have been implemented. The application of these models ensures that the generated videos exhibit the desired presentation effects, greatly expanding the application scenarios for videos.
In the outlook for future technologies, the application of generative AI for partial style transfer in architectural video content paves the way to new frontiers in architectural visual presentation. This technology enables designers to replicate an overall style and, more importantly, precisely select which parts of the video should undergo style transformation. Deep learning-based neural style transfer algorithms have proven their efficacy in applying style transfer to images and video content. These algorithms achieve style transformation by learning specific features of a target style image and applying these features to the original content. This implies that distinct artistic styles or visual effects can be applied to selected video portions in architectural videos. Local video style transfer opens up novel possibilities in the architectural domain, allowing designers and researchers to explore and present architectural and urban spaces in ways never before possible. By precisely controlling the scope of style transfer application, unique visual effects can be created, thereby enhancing architectural videos’ expressiveness and communicative value.
PIKA[49] showcases significant advantages in style transfer applications for architectural video content, offering robust support for visual presentation and research within the architectural realm. This technology enables designers and researchers to perform precise and flexible style customization for architectural videos, facilitating style transfers tailored to specific video content. Notably, PIKA allows for the style transfer of specific elements or areas within a video instead of a uniform transformation across the entire content. This capability of localized style transfer enables the accentuation of certain architectural features or details, such as presenting a segment of classical architecture in a modern or abstract artistic style, thereby enhancing the video’s appeal and expressiveness. Furthermore, PIKA excels in maintaining video content’s coherence and visual consistency. By finely controlling the extent and scope of the style transfer, PIKA ensures that the video retains its original structure and narrative while integrating new artistic styles, resulting in an aesthetically pleasing and authentic final product. Additionally, PIKA’s style transfer technology is not confined to traditional artistic styles but is also adaptable to various complex and innovative visual effects, providing a vast canvas for creative expression in architectural video content. Whether emulating the architectural style of a specific historical period or venturing into unprecedented visual effects, PIKA is equipped to support such endeavors.
Generating 3D building forms Using architectural images, such as site information (\(o_{\text{P-3D}}^2\)),or text prompt, such as design requirements (\(o_{\text{P-3D}}^1\)), as input can improve modeling efficiency.
In architectural 3D modeling, technologies such as DreamFusion [51], Magic3D [52], CLIP-NeRF [54], and DreamCraft3D[55] have emerged as revolutionary architectural design and visualization tools. They empower architects and designers to directly generate detailed and high-fidelity 3D architectural models from textual descriptions or 2D images, significantly expanding the possibilities for architectural creativity and enhancing work efficiency. Specifically, as shown in Figure 18, DreamFusion[51] and Magic3D[52] allow designers to swiftly create architectural 3D model prototypes through simple text descriptions, accelerating the transition from concept to visualization. Designers can easily modify textual descriptions and employ these tools for iterative design, exploring various architectural styles and forms to optimize design schemes. Moreover, CLIP-NeRF[54] and DreamCraft3D[55] enable designers to extract 3D information from existing architectural images, facilitating the precise reconstruction of historical buildings or current sites for restoration, research, or further development. Additionally, designers can create unique visual effects in 3D models by transforming and fusing image styles, further enhancing the artistic appeal and attractiveness of architectural representations.
With the advancements in generative AI for 3D model generation technology, generative AI can generate architectural 3D models with specific styles and textures based on the input of preliminary architectural 3D models (\(S_{\text{P-3D}}\)) (\(F_{\text{P-3D}} + S_{\text{P-3D}}\) to \(F_{\text{D-3D}}\)) . If this technology enables the modification and editing of 3D models based on highly personalized design requirements and allows designers to make real-time adjustments, it would significantly enhance the efficiency of architectural creation and enrich the avenues for architectural design.
GaussianEditor [191] and Magic123[56] demonstrate their applications and advantages in generating and editing detail styles for architectural 3D models by offering designers greater creative freedom and control over editing. As shown in Figure 19, GaussianEditor’s Gaussian semantic tracing and Hierarchical Gaussian Splatting enable more precise and intuitive editing of architectural details. At the same time, Magic123’s two-stage approach facilitates the transformation from complex real-world images to detailed 3D models, as shown in Figure 20. The development of these technologies heralds a future in architectural design and visualization characterized by a richer diversity and higher customization of 3D architectural models.
As society evolves and technology advances, the challenges faced by architectural design become increasingly complex, requiring consideration of more factors. Traditional design methods demand extensive time from designers to meet requirements and adjust designs. Moreover, user needs are becoming more diverse, making it a significant issue to reflect better human requirements in design, which necessitates more intelligent tools for realization. At the same time, the rapid development of AI technology, incredibly generative AI, offers the possibility for more intelligent architectural design. Based on these needs and visions, future generative AI will not only assist in the architectural design process but also, based on human-centric design principles, receive multimodal inputs, including text, images, sound, etc., and through intelligent processing, quickly understand design requirements and adjust design schemes, thereby generating designs that align with the architect’s vision. Such architecturally designed AI large models will be similar to existing co-pilot models but with further enhanced functionality and intelligence.
Realizing this large model requires training the AI model on a vast amount of architectural design data and user feedback to enable it to understand complex design requirements. It also necessitates multimodal input processing, developing technologies capable of handling various types of inputs, such as text, images, and sound, to increase the model’s application flexibility. In addition, developing intelligent interaction interfaces is essential; user-friendly interfaces allow architects to communicate intuitively with the AI model, state their needs, and receive feedback. Finally, the model should provide customized output designs, generating multiple design options based on the input requirements and data for architects to choose and modify.
However, realizing this architectural design AI large model faces numerous challenges: 1)Data collection and processing: high-quality training data is critical to the performance of AI models, and efficiently collecting and processing a vast amount of architectural design data is a significant challenge. 2)The fusion of multimodal inputs, effectively integrating information from different modalities to improve the model’s accuracy and application scope, requires further technological breakthroughs. Another challenge is optimizing user interaction; designing an interface that aligns with architects’ habits and enables accessible communication with the AI model is crucial for the technology’s implementation. 3)Ensuring that AI-generated designs meet practical needs while being innovative and personalized is critical for technological development. By addressing these challenges, the future may see the realization of generative AI models that truly aid architectural design, improving design efficiency and quality and achieving human-centric architectural design optimization.
The field of generative models has witnessed unparalleled advancements, particularly in image generation, video generation, and 3D content creation. These developments span across various applications, including text-to-image, image-to-image, text-to-3D, and image-to-3D transformations, demonstrating a significant leap in the capability to synthesize realistic, high-fidelity content from minimal inputs. The rapid advancement of generative models marks a transformative phase in artificial intelligence, where synthesizing realistic, diverse, and semantically consistent content across images, videos, and 3D models is becoming increasingly feasible. This progress paves new avenues for creative expression and lays the groundwork for future innovations in digital architectural design process. As the field continues to evolve, further exploration of model efficiency, controllability, and domain-specific applications will be crucial in harnessing the full potential of generative AI models for a broad spectrum of architectural design.
In conclusion, the integration of generative AI into architectural design represents a significant leap forward in the realm of digital architecture. This advanced technology has shown exceptional capability in generating high-quality, high-resolution images and designs, offering innovative ideas and enhancing the creative process across various facets of architectural design. As we look to the future, it is clear that the continued exploration and integration of Generative AI in architectural design will play a pivotal role in shaping the next generation of digital architecture. This technological evolution not only simplifies and accelerates the design process but also opens up new avenues for creativity, enabling architects to push the boundaries of traditional design and explore new, innovative design spaces.
corresponding author, zhang.ye@tju.edu.cn↩︎