April 18, 2025
Mobile GUI agents show promise in automating tasks but face generalization challenges in diverse real-world scenarios. Traditional approaches using pre-training or fine-tuning with massive datasets struggle with the diversity of mobile applications and user-specific tasks. We propose enhancing mobile GUI agent capabilities through human demonstrations, focusing on improving performance in unseen scenarios rather than pursuing universal generalization through larger datasets. To realize this paradigm, we introduce LearnGUI, the first comprehensive dataset specifically designed for studying demonstration-based learning in mobile GUI agents. It comprises 2,252 offline tasks and 101 online tasks with high-quality human demonstrations. We further develop LearnAct, a sophisticated multi-agent framework that automatically extracts knowledge from demonstrations to enhance task completion. This framework integrates three specialized agents: DemoParser for knowledge extraction, KnowSeeker for relevant knowledge retrieval, and ActExecutor for demonstration-enhanced task execution. Our experimental results show significant performance gains in both offline and online evaluations. In offline assessments, a single demonstration improves model performance, increasing Gemini-1.5-Pro’s accuracy from 19.3% to 51.7%. In online evaluations, our framework enhances UI-TARS-7B-SFT’s task success rate from 18.1% to 32.8%. LearnAct framework and LearnGUI benchmark establish demonstration-based learning as a promising direction for more adaptable, personalized, and deployable mobile GUI agents. The project resources are available at https://lgy0404.github.io/LearnAct.
Figure 1: A toy example for demonstration learning on mobile GUI Agent. We build a benchmark named LearnGUI for demonstration learning on Mobile GUI Agent, which provides different few-shot task combinations and offers multi-dimensional metrics including task similarity, UI similarity, and action similarity between support tasks and query tasks.
Mobile device automation has evolved significantly over time, from simple rule-based scripts to sophisticated AI-powered agents [1]–[4]. Traditional automation approaches like Robotic Process Automation (RPA) [5] and rule-based shortcuts [6], [7] relied on predefined scripts to execute repetitive tasks, but they struggled with dynamic interfaces, required frequent maintenance when apps updated, and lacked understanding of complex user intentions.
More recently, mobile Graphical User Interface (GUI) agents have emerged as a transformative technology with the potential to revolutionize how humans interact with mobile devices. These agents leverage Large Language Models (LLMs) to autonomously complete human tasks through environmental interaction [8]–[16]. They perceive phone states of mobile phone by observing screens (through screenshots or application UI trees) and generate actions (such as CLICK, TYPE, SWIPE, PRESS_BACK, PRESS_HOME, and PRESS_ENTER) that are executed via the phone user interface [1]–[4]. By harnessing the powerful perception and reasoning capabilities of LLMs, mobile GUI agents have the potential to fundamentally change how people interact with their mobile devices, bringing to life the "J.A.R.V.I.S." effect seen in science fiction.
Despite these promising advances, mobile GUI agents continue to face significant challenges in real-world deployment scenarios. The immense diversity of mobile applications and user interfaces creates pervasive long-tail scenarios where current agents struggle to perform effectively. The prevailing approaches to building modern mobile GUI agents rely on either the inherent capabilities of general-purpose LLMs [8]–[14], [17] or fine-tuning with large volumes of data [18]–[21]. However, these methods face fundamental limitations when confronted with diverse real-world usage scenarios. As of 2025, billions of users interact with 1.68 million applications on Google Play alone [4], each with unique task requirements and UI layouts [2], [3]. Pre-training or fine-tuning datasets cannot feasibly cover this immense variety, leading to poor performance in unseen scenarios and hindering the widespread adoption of mobile GUI agents [22], as illustrated in Figure [fig:overall] (left side). Traditional approaches simply cannot cover the entire spectrum of possible interactions and user-specific requirements across this heterogeneous landscape.
To address these limitations, we propose a novel paradigm that enhances mobile GUI agent capabilities through few-shot demonstration learning. Unlike traditional approaches that either lack flexibility or require massive datasets, our demonstration-based approach achieves both robustness and personalization by learning from a small number of user-provided examples. We recognize that mobile users have unique, repetitive tasks with inherent variability—such as smart home control with dynamic configurations, health monitoring with personalized parameters, or enterprise software with company-specific layouts. These scenarios combine stable patterns with variable elements, creating a "personalization gap" that pre-trained models cannot bridge. By leveraging user-specific demonstrations, our approach enables personalized assistants that learn both consistent patterns and adaptation strategies, acquiring task-specific knowledge impossible to cover in general training datasets. This personalization allows mobile GUI agents to overcome performance bottlenecks and provide truly helpful automation for the tasks users most want to delegate.
To fill the gap in high-quality demonstration data, we introduce LearnGUI, the first dataset specifically designed to research and evaluate mobile GUI agents’ ability to learn from few-shot demonstrations. Built upon AMEX [23] and AndroidWorld [24], LearnGUI comprises 2,252 offline few-shot tasks and 101 online tasks with high-quality human demonstrations. This dataset enables systematic research into demonstration-based learning for mobile GUI agents. A toy example for LearnGUI is shown in Figure 1.
Furthermore, we present LearnAct, a multi-agent framework that automatically understands human demonstrations, generates instructional knowledge, and uses this knowledge to assist mobile GUI agents in reasoning about unseen scenarios. LearnAct consists of three specialized agents: (1) DemoParser, a knowledge generation agent that extracts usable knowledge from demonstration trajectories to form a knowledge base; (2) KnowSeeker, a knowledge retrieval agent that searches the knowledge base for demonstration knowledge relevant to the current task; and (3) ActExecutor, a task execution agent that combines user instructions, real-time GUI environment, and retrieved demonstration knowledge to perform tasks effectively.
Our experimental results decisively validate the effectiveness of demonstration-based learning for mobile GUI agents, as shown in Figure [fig:overall] (right side). In offline evaluations, a single demonstration dramatically improves model performance across diverse scenarios, with the most striking results seen in Gemini-1.5-Pro [25], whose accuracy increases from 19.3% to 51.7% (a 198.9% relative improvement). Performance gains are particularly pronounced in complex applications, with accuracy in CityMapper increasing from 14.1% to 69.4% and in To-Do apps from 17.4% to 69.2%. For real-world online evaluations, our framework demonstrates exceptional effectiveness, with Qwen2-VL-7B [26] with LearnAct achieving significant performance gains, while UI-TARS-7B-SFT [27]’s task success rate improves from 18.1% to 32.8% (+14.7%). These findings offer a practical pathway to developing more adaptable and personalized mobile GUI agents.
In summary, our contributions are as follows:
We develop LearnGUI, the first dataset designed for studying demonstration-based learning in mobile GUI agents, comprising 2,252 offline and 101 online tasks with high-quality human demonstrations.
We design and implement LearnAct, a sophisticated multi-agent framework that systematically extracts, retrieves, and leverages knowledge from human demonstrations. This framework includes three specialized components: DemoParser (knowledge extraction), KnowSeeker (knowledge retrieval), and ActExecutor (task execution).
Our evaluations demonstrate unprecedented performance gains: a single demonstration improves Gemini-1.5-Pro [25]’s accuracy by 198.9% in offline tests, while enhancing UI-TARS-7B-SFT [27]’s online task success rate from 18.1% to 32.8%, advancing mobile GUI agents toward greater adaptability and practical deployability.
2.7pt
Dataset | # Inst. | # Apps | # Step | Env. | HL | LL | GT | FS |
---|---|---|---|---|---|---|---|---|
PixelHelp [28] | 187 | 4 | 4.2 | |||||
MoTIF [29] | 276 | 125 | 4.5 | |||||
UIBert [30] | 16,660 | - | 1 | |||||
UGIF [31] | 523 | 12 | 6.3 | |||||
AITW [32] | 30,378 | 357 | 6.5 | |||||
AITZ [33] | 2,504 | 70 | 7.5 | |||||
AndroidControl [22] | 15,283 | 833 | 4.8 | |||||
AMEX [23] | 2,946 | 110 | 12.8 | |||||
MobileAgentBench [34] | 100 | 10 | - | |||||
AppAgent [35] | 50 | 10 | - | |||||
LlamaTouch [36] | 496 | 57 | 7.01 | |||||
AndroidWorld [24] | 116 | 20 | - | |||||
AndroidLab [37] | 138 | 9 | 8.5 | |||||
LearnGUI (Ours) | 2,353 | 73 | 13.2 |
Mobile GUI Datasets and Environments. The development of mobile GUI agents relies heavily on high-quality datasets for training and evaluation. Table 1
compares LearnGUI and existing mobile GUI datasets and benchmarks. These resources can be broadly categorized into static datasets and dynamic benchmarking environments. Static datasets [22], [23], [28]–[34] typically provide natural language task descriptions, UI
states (screenshots and/or application UI trees), and corresponding user actions (CLICK, SWIPE, TYPE, and other standardized interactions). These datasets vary in scale, ranging from hundreds to tens of thousands of instructions across different
applications. Recent work like AppAgent [35] has explored demonstration-based learning but without ground truth annotations or systematic
analysis, providing only 50 tasks across 10 applications with high-level instructions. Notably, the average task length varies significantly across datasets, with AMEX [23] featuring substantially longer sequences (12.8 steps on average) compared to AndroidControl (4.8 steps) and AITW [32] (6.5 steps). Benchmarking environments, on the other hand, typically select a limited number of tasks and applications to provide dynamic testing environments [4]. These frameworks evaluate agent performance through metrics such as task completion rates, critical state achievements, and execution time. Examples include LlamaTouch [36], AndroidWorld [24],
and AndroidLab [37], which offer interactive environments but lack few-shot demonstration capabilities. We present the first systematic
study of demonstration-based learning for mobile GUI agents through LearnGUI, which distinguishes itself through three key innovations. First, it is designed to evaluate few-shot learning capabilities with a comprehensive collection of 2,252 offline tasks
and 101 online tasks. Built upon AMEX [23] and AndroidWorld [24], which feature longer, more complex tasks ideal for out-of-distribution and demonstration-based learning scenarios, LearnGUI provides a unified framework for both offline and online
evaluation. Second, while the original AMEX [23] dataset contains 2,946 independent tasks unsuitable for few-shot evaluation, we conducted detailed
analyses to transform and enhance this resource. Specifically, we made three key modifications: (1) Action Space Standardization, refining the original action space by removing inconsistent TASK_IMPOSSIBLE
actions, enhancing
TASK_COMPLETE
to support information retrieval tasks, and standardizing formats for consistency; (2) K-shot Task Combinations, constructing systematic task groupings by recovering application context, computing instruction similarity within
applications, and creating k-shot combinations with similar tasks as support demonstrations; and (3) Similarity Measurement, computing UI and action similarity through descriptive representations, enabling analysis of how different similarity types affect
learning efficacy. Third, regarding online evaluation, AndroidWorld [24] originally provides 116 dynamically constructed tasks
without human demonstration trajectories. We collected 101 high-quality human demonstrations based on AndroidWorld’s environment and dynamic instructions, forming LearnGUI-Online for evaluating the few-shot capabilities of mobile GUI agents in real-time
scenarios. By addressing the limitations of existing datasets, LearnGUI enables systematic research into few-shot learning for mobile GUI agents with varying k-shot configurations and controlled similarity conditions between support and query tasks.
Mobile GUI Agents. Mobile GUI agents are intelligent systems that leverage large language models to understand, plan, and execute tasks on mobile devices by integrating natural language processing, multimodal perception, and action execution capabilities [1], [2]. Recent developments in this field have explored various approaches to enhance agent performance and generalizability. One prominent category of work focuses on designing effective prompting strategies to guide pre-trained LLMs without additional training [38]–[40]. By crafting prompts that incorporate task descriptions, interface states, and action histories, researchers can direct model behavior toward specific automation goals [8], [10], [11], [41]. These approaches leverage the inherent capabilities of general-purpose LLMs but often struggle with complex tasks. A second category involves adapting LLMs specifically for mobile automation through fine-tuning techniques [19]–[21], [42]–[45]. These methods train models on GUI-specific data to enhance their understanding of and interaction with graphical interfaces. While improving performance over pre-training approaches, these fine-tuned models require substantial training data and still face generalization challenges. Despite the progress made by both approaches, a fundamental limitation persists: the inability to generalize effectively to out-of-distribution scenarios. These methods both struggle with unseen applications, novel UI layouts, or unexpected task variations. These limitations stem from the impossibility of covering all potential real-world scenarios during training, creating significant bottlenecks in mobile GUI agent development. To address these critical challenges, we introduce LearnAct, a sophisticated multi-agent framework that learns and reasons from screenshots without requiring UI tree information. The framework extracts, retrieves, and utilizes demonstration knowledge through three specialized components, enabling effective adaptation to new scenarios with minimal demonstrations.
Mobile GUI tasks require agents to interact with digital environments by executing actions to fulfill user instructions. These tasks can be formally described as a Partially Observable Markov Decision Process (POMDP), defined as \(\mathcal{M} = (\mathcal{S}, \mathcal{O}, \mathcal{A}, \mathcal{T}, \mathcal{R})\), where \(\mathcal{S}\) is the state space (current state of the mobile device), \(\mathcal{O}\) is the observation space (instructions, screenshots, UI trees, etc.), \(\mathcal{A}\) is the action space (e.g., click, type, swipe), \(\mathcal{T}: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{S}\) is the state transition function, and \(\mathcal{R}: \mathcal{S} \times \mathcal{A} \rightarrow [0,1]\) is the reward function. For example, a user might request the agent to "find the cheapest hotel in Paris for next weekend." The agent must perceive the current screen—either through an image or a UI tree—and execute a sequence of actions to complete the given task.
The key innovation in our approach is the integration of human demonstration knowledge into this POMDP framework. By incorporating demonstration knowledge \(\mathcal{D}\) into the decision process, we enhance the agent’s ability to handle out-of-distribution scenarios. This knowledge influences the agent’s policy \(\pi: \mathcal{O} \times \mathcal{D} \rightarrow \mathcal{A}\), which maps observations and relevant demonstration knowledge to actions, providing valuable examples of successful interaction patterns.
To study the impact of demonstration-based learning on mobile GUI agents, we need a dataset that provides various k-shot demonstrations with controlled similarity relationships between support and query tasks. This allows us to systematically investigate how demonstration quantity and task similarity affect agent performance. While cross-application knowledge transfer remains an interesting research direction, we focus on within-application task learning, as this represents the most practical use case where users would provide demonstrations for applications they frequently use.
Our dataset design specifically enables research on three key dimensions:
Unified comprehensive evaluation framework: LearnGUI provides a standardized platform for studying few-shot demonstration learning in mobile GUI agents, featuring a unified action space and evaluation protocols that reflect real-world use cases
K-shot demonstration learning: The dataset systematically explores how varying quantities of demonstrations (k=1, 2, or 3) affect agent performance, enabling research on the optimal number of examples needed
Multi-dimensional similarity analysis: LearnGUI enables investigation of how different types of similarity between demonstration and query tasks influence learning efficacy and generalization capabilities
This comprehensive approach allows for a nuanced analysis of how mobile GUI agents can leverage human demonstrations to improve task performance, especially in scenarios not covered by their training data.
The LearnGUI dataset consists of two components: LearnGUI-Offline for systematic evaluation of few-shot learning capabilities across varying similarity conditions, and LearnGUI-Online for real-time assessment in an interactive environment. Both components share a unified action space to ensure consistent evaluation, as detailed in Table 2.
Action | Definition |
---|---|
CLICK[x, y] | Click at coordinates (x, y). |
TYPE[text] | Type the specified text. |
SWIPE [direction] | Swipe in the specified direction. |
PRESS_HOME | Go to the home screen. |
PRESS_BACK | Go back to the previous app screen. |
PRESS_ENTER | Press the enter button. |
TASK_COMPLETE[answer] | Mark the task as complete. Provide answer inside brackets if required. |
We built LearnGUI-Offline by restructuring and enhancing the AMEX dataset [23], which contains 2,946 independent mobile tasks. To transform this resource for few-shot learning evaluation, we made several key modifications:
Action Space Standardization. We refined the original action space to better align with real-world scenarios. First, we removed TASK_IMPOSSIBLE
actions due to inconsistent labeling in the original dataset, which included
errors such as tasks being incorrectly marked as impossible. Second, we enhanced TASK_COMPLETE
to TASK_COMPLETE[answer]
for information retrieval tasks. Many mobile tasks require returning specific information rather than just
completion status. This aligns with both AMEX [23] and AndroidWorld [24] paradigms.
K-shot Task Combinations. We constructed systematic k-shot task combinations through a multi-step process. We began by recovering the application context for each task through instruction and screenshot analysis, as the original dataset lacked explicit app labels. Next, we computed instruction similarity between tasks within the same application using the all-MiniLM-L6-v2 model. Finally, we created k-shot combinations (k=1,2,3) for each query task by selecting the k most similar tasks within the same application as support demonstrations, ensuring that the average similarity exceeded a minimum threshold of 0.6. This process yielded 2,252 tasks with valid k-shot combinations.
Similarity Measurement. To enable multi-dimensional similarity analysis, we computed metrics across three key dimensions. For Instruction Similarity, we utilized the scores calculated during the K-shot Task Combinations process. For UI Similarity, we merged the UI trees from all steps of each task and calculated similarity using TF-IDF vectorization and cosine similarity, capturing the visual and structural similarity of interfaces. For Action Similarity, following the DemoParser approach detailed in Section 4.1, we generated descriptive representations of each action and computed embedding-based cosine similarity between task pairs.
For evaluating mobile GUI agents in real-time interactive scenarios, we developed LearnGUI-Online based on the AndroidWorld environment [24]. While AndroidWorld provides 116 dynamically constructed task templates, it lacks human demonstration trajectories essential for few-shot learning evaluation.
We identified 101 tasks suitable for human completion, excluding 15 tasks that proved challenging for human users. We then collected high-quality human demonstrations for these tasks. For tasks with dynamic elements, we generated specific instances and recorded corresponding demonstrations.
The resulting LearnGUI-Online dataset provides a realistic testbed for evaluating few-shot learning capabilities in mobile GUI agents under authentic conditions.
Split | K-shot | Tasks | Apps | Step actions | Avg InsSim | Avg UISim | Avg ActSim | UISHActSH | UISHActSL | UISLActSH | UISLActSL |
---|---|---|---|---|---|---|---|---|---|---|---|
Offline-Train | 1-shot | 2,001 | 44 | 26,184 | 0.845 | 0.901 | 0.858 | 364 | 400 | 403 | 834 |
Offline-Train | 2-shot | 2,001 | 44 | 26,184 | 0.818 | 0.898 | 0.845 | 216 | 360 | 358 | 1,067 |
Offline-Train | 3-shot | 2,001 | 44 | 26,184 | 0.798 | 0.895 | 0.836 | 152 | 346 | 310 | 1,193 |
Offline-Test | 1-shot | 251 | 9 | 3,469 | 0.798 | 0.868 | 0.867 | 37 | 49 | 56 | 109 |
Offline-Test | 2-shot | 251 | 9 | 3,469 | 0.767 | 0.855 | 0.853 | 15 | 42 | 55 | 139 |
Offline-Test | 3-shot | 251 | 9 | 3,469 | 0.745 | 0.847 | 0.847 | 10 | 36 | 49 | 156 |
Online-Test | 1-shot | 101 | 20 | 1,423 | - | - | - | - | - | - | - |
Table 1 presents the comprehensive statistics of the LearnGUI dataset in comparison with existing datasets. With 2,353 instructions across 73 applications and an average of 13.2 steps per task, LearnGUI offers rich data for studying demonstration-based learning in mobile GUI agents. The dataset provides various k-shot combinations (k=1,2,3) for each task, along with multi-dimensional similarity metrics across instruction, UI, and action dimensions. This design enables systematic analysis of how different types and quantities of demonstrations affect learning outcomes. The similarity distribution reflects the natural variation in mobile tasks within applications, with a meaningful spread across similarity levels that allows for a detailed investigation of knowledge transfer under different conditions. A detailed visualization of these similarity distributions is provided in Appendix 8.
Figure 2: Joint distribution of UI similarity and action similarity in LearnGUI-Offline. The scatter plot shows the relationship between UI and action similarity measures across task pairs. The quadrant divisions represent our categorization of tasks into four profiles: UISHActSH, UISHActSL, UISLActSH, and UISLActSL, enabling analysis of how different similarity combinations affect learning transfer.
We divided LearnGUI-Offline into training and testing splits to enable systematic evaluation of few-shot learning capabilities. Table 3 presents the detailed statistics of these splits, including the distribution of tasks across different similarity profiles.
The training set contains 2,001 tasks for each k-shot configuration (1, 2, and 3), spanning 44 applications with an average of 13.1 steps per task. The test set includes 251 tasks per k-shot configuration across 9 applications. Both splits maintain the same action space and similarity measurement methodology.
Based on empirical analysis, we established threshold values of 0.9447 for UI similarity and 0.9015 for action similarity to classify tasks into high (SH) and low (SL) similarity categories, enabling systematic analysis of how different similarity types affect learning from demonstrations.
As shown in Figure 2, we classify tasks into four categories based on UI and action similarity:
UISHActSH: High UI similarity and high action similarity. For example, in a smart home app, two tasks that both involve adjusting the brightness of different lights in the living room would navigate through similar UI screens.
UISHActSL: High UI similarity but low action similarity. For instance, in a smart home app, turning on all lights with a single button press versus adjusting each light’s color temperature.
UISLActSH: Low UI similarity but high action similarity. For example, setting a schedule for lights versus setting a schedule for the thermostat—different UI screens but similar action patterns.
UISLActSL: Low UI similarity and low action similarity. For instance, checking security camera footage versus creating a scene that coordinates multiple devices.
This categorization enables a detailed analysis of how different types of similarity affect learning efficacy. For instance, we can investigate whether UI similarity or action similarity has a greater impact on successful knowledge transfer from demonstrations.
Additionally, the LearnGUI-Online test set contains 101 tasks across 20 applications. Unlike the offline dataset, these tasks are evaluated in real time through direct interaction with the mobile environment.
The comprehensive structure of LearnGUI, with its carefully designed splits and similarity classifications, provides a resource for studying how mobile GUI agents can learn from demonstrations under varying conditions of task similarity and demonstration quantity.
Figure 3: Illustration of the overall framework of LearnAct. Architecture diagram showing the three main components (DemoParser, KnowSeeker, ActExecutor) and their interconnections within the LearnAct system, including data flow from human demonstrations to execution.
Building on the insights from our LearnGUI dataset, we introduce LearnAct, a novel framework designed to break through the limitations of traditional training approaches for mobile GUI agents. Rather than pursuing universal generalization through extensive training data, LearnAct establishes demonstration-based learning as a paradigm for developing more adaptable, personalized, and practically deployable mobile GUI agents. As illustrated in Figure 3, LearnAct is a sophisticated multi-agent framework that automatically understands human demonstrations, generates instructional knowledge, and leverages this knowledge to assist mobile GUI agents in reasoning about unseen scenarios. The LearnAct framework consists of three specialized components, each addressing a critical aspect of demonstration-based learning: (1) DemoParser (Section 4.1), a knowledge generation agent that extracts usable knowledge from demonstration trajectories to form a knowledge base; (2) KnowSeeker (Section 4.2), a knowledge retrieval agent that searches the knowledge base for demonstration knowledge relevant to the current task; and (3) ActExecutor (Section 4.3), a task execution agent that combines user instructions, real-time GUI environment, and retrieved demonstration knowledge to perform tasks effectively.
Figure 4: Pipeline of DemoParser Agent. Input instructions and corresponding actions and screenshots; output low-level action descriptions and create knowledge database. This process transforms high-level user instructions into precise operation sequences while building a reusable domain knowledge base to improve mobile interface interaction automation efficiency.
The DemoParser transforms raw human demonstrations into structured demonstration knowledge. It takes as input a raw action sequence (consisting of coordinates-based clicks, swipes, and text inputs) along with corresponding screenshots and task instructions. It then utilizes a vision-language model to generate semantically descriptive action descriptions that capture the essence of each demonstration step (e.g., “On Search Page, click the search box, to enter keywords”). Building on these descriptions, it constructs a structured knowledge base that records both the high-level action semantics and the contexts in which they occur, as shown in Figure 4.
Formally, DemoParser implements a knowledge generation function \(G: \mathcal{I} \times \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{K}\), where \(\mathcal{I}\) represents the space of instructions, \(\mathcal{S}\) is the space of screenshot sequences, \(\mathcal{A}\) is the space of action sequences, and \(\mathcal{K}\) is the knowledge space. For each demonstration trajectory \((i, s, a) \in \mathcal{I} \times \mathcal{S} \times \mathcal{A}\), DemoParser generates a knowledge entry \(k \in \mathcal{K}\) that encapsulates the demonstration in a semantically descriptive format, converting raw coordinate-based actions (e.g., CLICK[123,456]) into meaningful operation descriptions (e.g., "click search box").
The knowledge generation process is decomposed into a sequence of description generation steps for each action in the demonstration trajectory. Let \(d_j\) represent the description for action \(a_j\), which is generated using a context-aware description function \(\delta: \mathcal{I} \times \mathcal{A}_j \times \mathcal{V}_j \times \mathcal{H}_{j-1} \rightarrow \mathcal{D}\), where \(\mathcal{V}_j\) is the visual representation of action \(a_j\) execution and \(\mathcal{H}_{j-1} = \{d_1, d_2, \ldots, d_{j-1}\}\) is the history of previous action descriptions.
Algorithm 11 in Appendix 9.3 outlines the knowledge generation process. For each demonstration, DemoParser preserves the original task instruction and action sequence while generating semantically descriptive action descriptions. These descriptions provide crucial context about the purpose and significance of each action in the demonstration, enabling more effective knowledge transfer to new scenarios.
For intermediate actions, DemoParser analyzes a visual representation of the action execution, showing before-action and after-action screenshots with the action visualized (e.g., click locations highlighted). The framework combines this visual input with the task instruction, action history, and current action to generate a description that follows a standardized format: "[On/In] [Screen Name], [Action Details], to [Purpose]". For example: "On Home Screen, tap ‘Settings’ icon, to access device configuration." For terminal actions, DemoParser processes the final screenshot, task instruction, and complete action history to generate a conclusion in the format: "[On/In] [Screen], complete task, [Reason/Answer]"
A distinctive feature of DemoParser is its memory mechanism, which captures critical information observed during task execution that may be necessary for future steps. The model identifies and annotates task-relevant information that is directly related to the user’s instruction, will likely be needed in subsequent steps, and has not been previously recorded. These memory annotations are included in the action descriptions when appropriate: "[On/In] [Screen], [Action], to [Purpose]. [Memory: important information for future steps]". For example, in a shopping task, a memory annotation might capture: "[Memory: iPhone 13 Pro costs $999 with 128GB storage]". The detailed prompt for this memory mechanism is provided in Appendix 9.1.
This memory mechanism is particularly valuable for complex tasks requiring information retention across multiple steps, such as comparing prices, remembering account details, or tracking status changes. By transforming raw demonstrations into structured, semantically descriptive knowledge with memory capabilities, DemoParser enables effective knowledge transfer from human demonstrations to automated task execution.
Figure 5: Pipeline of KnowSeeker Agent. The KnowSeeker Agent converts demo trajectories from the knowledge base into a vector database. When executing user tasks, KnowSeeker retrieves the top-k relevant demos from the vector database for subsequent use. This approach enables efficient retrieval of similar demonstrations to assist with new task execution.
KnowSeeker is the retrieval component of the LearnAct framework that identifies demonstration knowledge most relevant to the current task context. As depicted in Figure 5, this agent serves as the bridge between the knowledge base generated by DemoParser and the execution environment of ActExecutor. While DemoParser focuses on transforming demonstrations into structured knowledge, KnowSeeker specializes in efficiently accessing and selecting the most applicable knowledge for a specific task, addressing the critical challenge of knowledge relevance in few-shot learning scenarios.
Formally, KnowSeeker implements a retrieval function \(R: \mathcal{I} \times \mathcal{K} \rightarrow \mathcal{K}^{(s)}\), where \(\mathcal{I}\) is the instruction space, \(\mathcal{K}\) is the knowledge base, and \(\mathcal{K}^{(s)} \subset \mathcal{K}\) is a subset of knowledge entries determined to be relevant for the given instruction. This retrieval process is crucial for effective knowledge utilization, as it filters the potentially vast knowledge base to focus exclusively on demonstrations that offer valuable insights for the current task.
The core of KnowSeeker’s retrieval mechanism relies on semantic similarity measurement between the current task instruction and the instructions associated with demonstrations in the knowledge base. This similarity-based retrieval can be formally defined as:
\[R(i, K) = \{k_j \in K \mid sim(i, i_j) \geq \tau_s \}_{j=1}^{top-k}\]
where \(i\) is the current instruction, \(i_j\) is the instruction associated with knowledge entry \(k_j\), \(sim(\cdot, \cdot)\) is a similarity function, \(\tau_s\) is a similarity threshold, and \(top-k\) indicates selection of the \(k\) most similar entries.
To implement this similarity measurement efficiently, KnowSeeker employs a two-phase approach:
Embedding Generation: Instructions are transformed into dense vector representations using a pre-trained sentence transformer model. Specifically, we utilize the all-MiniLM-L6-v2 model, which offers an optimal balance between computational efficiency and semantic representational power. This model has been fine-tuned on diverse natural language understanding tasks, making it particularly well-suited for capturing the semantic essence of mobile GUI task instructions.
Similarity Computation: The cosine similarity between embedding vectors is calculated to quantify the semantic relationship between instructions. For instructions \(i\) and \(i_j\) with corresponding embeddings \(e_i\) and \(e_j\), the similarity is computed as:
\[sim(i, i_j) = \frac{e_i \cdot e_j}{||e_i|| \cdot ||e_j||}\]
To optimize retrieval efficiency, KnowSeeker pre-computes embeddings for all instructions in the knowledge base during initialization. This approach transforms the potentially expensive operation of computing pairwise similarities during runtime into a more manageable vector comparison task. The pre-computation process is described as:
\[E = \{e_j = f_{embed}(i_j) \mid k_j \in K\}\]
where \(f_{embed}\) is the embedding function implemented by the sentence transformer model.
During task execution, when presented with a new instruction \(i\), KnowSeeker: 1. Computes the embedding \(e_i = f_{embed}(i)\) 2. Calculates similarities \(S = \{sim(e_i, e_j) \mid e_j \in E\}\) 3. Selects the top-\(k\) knowledge entries based on similarity scores
This approach ensures that knowledge retrieval scales efficiently with the size of the knowledge base, enabling rapid identification of relevant demonstrations even as the framework’s experiential knowledge grows over time. By systematically identifying the most relevant demonstration knowledge, KnowSeeker enables ActExecutor to perform tasks more effectively, particularly in unfamiliar scenarios.
Figure 6: Pipeline of ActExecutor Agent. The ActExecutor Agent executes the low-level action descriptions generated by the Action Planner Agent. It uses the KnowSeeker Agent to retrieve relevant demonstrations from the knowledge base and execute the actions in the demonstrations. This approach enables efficient execution of low-level actions to assist with new task execution.
ActExecutor is the execution component of the LearnAct framework that translates retrieved demonstration knowledge into effective actions in the target environment. As illustrated in Figure 6, this agent represents the culmination of the LearnAct pipeline, integrating user instructions, real-time GUI observations, and demonstration knowledge to navigate even unfamiliar mobile applications successfully. While DemoParser creates structured knowledge and KnowSeeker retrieves relevant demonstrations, ActExecutor applies this knowledge to solve practical tasks, addressing the critical challenge of knowledge utilization in few-shot learning scenarios.
ActExecutor implements the POMDP framework introduced earlier, with the critical enhancement of incorporating demonstration knowledge into the decision-making process. The execution process can be formally described as a sequential decision-making loop that iteratively selects actions \(a_t \in \mathcal{A}\) based on current observations \(o_t \in \mathcal{O}\) and demonstration knowledge \(\mathcal{D}\), following policy \(\pi: \mathcal{O} \times \mathcal{D} \rightarrow \mathcal{A}\).
The ActExecutor policy \(\pi\) is implemented through a large vision-language model that processes a carefully constructed prompt integrating all available information sources. This prompt-based policy can be expressed as:
\[\pi(o_t, \mathcal{D}) = f_{LLM}(P(i, o_t, h_{t-1}, \mathcal{D}))\]
where \(i\) is the user instruction, \(o_t\) is the current observation (screenshot), \(h_{t-1}\) is the action history up to time \(t-1\), \(\mathcal{D}\) is the retrieved demonstration knowledge, \(P\) is a prompt construction function, and \(f_{LLM}\) is the LLM-based decision function.
Algorithm 12 in Appendix 9.3 outlines the execution process. For each task, ActExecutor processes the user instruction and screenshot observations through a sequence of perception, decision, and action phases until the task is completed or a maximum step limit is reached.
The execution process integrates three key phases:
Perception Phase: ActExecutor perceives the current state of the mobile device through screenshot observations \(o_t\). These observations provide the visual context essential for understanding the available interaction options and current application state.
Decision Phase: The agent constructs a comprehensive prompt that integrates the user instruction \(i\), current observation \(o_t\), action history \(h\), and retrieved demonstrations \(\mathcal{D}\). This prompt is processed by a large vision-language model using templates detailed in Appendix 9.2, resulting in a selected action from the predefined action space described in Table 2.
Action Phase: The selected action \(a_t\) is executed in the mobile environment, generating a state transition according to the transition function \(\mathcal{T}\) of the POMDP. Additionally, the agent generates a description \(d_t\) of the executed action using a process similar to DemoParser’s description generation, which serves as part of the action history for subsequent steps.
The prompt construction function \(P\) plays a critical role in ActExecutor’s effectiveness. It integrates the agent’s role definition, demonstration examples, task and observation context, action history, and the action space definition into a comprehensive prompt that guides the model’s decision-making.
This approach enables ActExecutor to leverage demonstrations as exemplars that guide its decision-making process. When faced with a novel UI state, the agent identifies analogous situations from demonstrations and adapts the demonstrated actions to the current context. This capability is particularly valuable for handling out-of-distribution scenarios where the agent lacks direct experience.
By closing the loop between demonstration knowledge and task execution, ActExecutor completes the LearnAct framework’s end-to-end pipeline for demonstration-based learning. The combination of knowledge generation (DemoParser), knowledge retrieval (KnowSeeker), and knowledge-guided execution (ActExecutor) enables effective few-shot learning for mobile GUI agents, addressing the fundamental challenge of generalization to unseen scenarios with minimal examples.
We conducted comprehensive evaluations of the LearnAct framework through both offline and online experiments. The offline experiments were performed on the LearnGUI-Offline dataset to evaluate step-by-step task execution capabilities, while the online experiments utilized the LearnGUI-Online platform to assess end-to-end task completion in real-world interactive scenarios. We evaluated a diverse set of models, including both commercial (e.g., Gemini-1.5-Pro [25]) and open-source models (e.g., UI-TARS-7B-SFT [27], Qwen2-VL-7B [26]), to demonstrate the broad applicability of our approach across different model architectures and capabilities.
The diverse similarity profiles in LearnGUI provide a unique opportunity to evaluate mobile GUI agents’ capabilities. Our experiments have two primary goals: (1) to evaluate the feasibility and effectiveness of enhancing mobile agents through few-shot demonstrations as a means to overcome the limitations of traditional pre-training or fine-tuning approaches; and (2) to investigate how different factors such as demonstration quantity (k=1,2,3) and various similarity aspects (instruction, UI, and action) influence the effectiveness of demonstration-based learning.
Implementation Details. We conducted experiments with three foundation models: Gemini-1.5-Pro [25], UI-TARS-7B-SFT [27], and Qwen2-VL-7B [26]. For all models, we set the temperature to zero to obtain deterministic responses. For Qwen2-VL-7B [26] and UI-TARS-7B-SFT [27], we employed parameter-efficient fine-tuning using LoRA with rank 64, alpha 128, and dropout probability 0.1. We targeted all modules while freezing the vision encoder to ensure computational efficiency. Training used a learning rate of 1e-5 with cosine scheduling, batch size of 1, gradient accumulation over 8 steps, a warmup ratio of 0.001, and was conducted for 1 epoch. All fine-tuning experiments were conducted on 8 NVIDIA L40S GPUs. For offline experiments, Gemini-1.5-Pro [25] was evaluated directly on the LearnGUI-Offline test set without additional training. UI-TARS-7B-SFT [27] and Qwen2-VL-7B [26] were fine-tuned on the LearnGUI-Offline training set before evaluation. For online experiments, we deployed all models except Gemini-1.5-Pro [25] (which showed limited task completion capabilities in preliminary tests despite accuracy improvements) to the LearnGUI-Online environment, using 1-shot demonstration retrieval for all LearnAct-enhanced models.
Baselines. To rigorously evaluate our approach, we compared LearnAct against several baselines. These include: (1) SPHINX-GUI Agent, the original agent developed for the AMEX dataset [23], providing a reference point for task execution on similar data; (2) Zero-shot inference versions of all models (Gemini-1.5-Pro [25], UI-TARS-7B-SFT [27], and Qwen2-VL-7B [26]) within the LearnAct framework but without demonstration knowledge, maintaining identical execution environments for fair comparison; and (3) For online evaluation, we additionally compared against GPT-4o, Gemini-Pro-1.5, Claude Computer-Use, and Aguvis to benchmark against current advanced systems.
Evaluation Metrics. For offline evaluation, we adopted mainstream evaluation protocols widely used in recent mobile GUI agent research, such as UI-TARS [27] and OS-ATLAS [46]. Specifically, we measured step accuracy, which consists of two components: action type accuracy and action match accuracy. Action type accuracy measures the percentage of steps where the predicted action type (CLICK, TYPE, SWIPE, etc.) matches the ground truth. Action match accuracy measures the percentage of steps where both the action type and its parameters are correct, following standard evaluation criteria. For CLICK actions, coordinates are considered correct if they fall within 14% of the screen width from the ground truth. For TYPE actions, the content is correct if the F1 score between prediction and ground truth exceeds 0.5. For SWIPE actions, the direction must precisely match the ground truth. For other actions (e.g., PRESS_BACK), an exact match is required. For TASK_COMPLETE actions, we only verify the action type and ignore the answer field. For online evaluation, we measured the task success rate (SR), which represents the percentage of tasks completed successfully in the real-time interactive environment.
3pt
Models | Method | Supports | Average | Gmail | Booking | Music | SHEIN | NBC | CityMapper | ToDo | Signal | Yelp |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SPHINX-GUI Agent[23] | AMEX | 0-shot | 67.2 | 45.9 | 64.5 | 74.4 | 71.8 | 70.3 | 67.4 | 79.3 | 64.9 | 66.3 |
gemini-1.5-pro | Baseline | 0-shot | 19.3 | 20.1 | 16.4 | 24.5 | 10.2 | 35.6 | 14.1 | 17.4 | 27.9 | 15.2 |
LearnAct | 1-shot | 51.7 [+32.4] | 55.5 | 47.1 | 60.0 | 35.7 | 56.4 | 54.7 | 60.6 | 63.1 | 54.6 | |
2-shot | 55.6 [+36.3] | 57.5 | 53.2 | 55.3 | 39.6 | 56.1 | 58.2 | 68.1 | 69.7 | 60.0 | ||
3-shot | 57.7 [+38.4] | 58.4 | 56.6 | 54.6 | 43.9 | 53.9 | 69.4 | 69.2 | 70.5 | 57.6 | ||
UI-TARS-7B-SFT | Baseline | 0-shot | 77.5 | 68.1 | 81.0 | 81.1 | 72.9 | 80.9 | 70.6 | 66.0 | 92.6 | 82.4 |
LearnAct | 1-shot | 82.8 [+5.3] | 79.9 | 82.9 | 86.6 | 75.7 | 86.3 | 79.4 | 84.0 | 89.3 | 83.0 | |
2-shot | 81.9 [+4.4] | 80.1 | 80.7 | 86.2 | 76.1 | 87.2 | 80.0 | 83.7 | 84.4 | 84.2 | ||
3-shot | 82.1 [+4.6] | 79.9 | 80.9 | 86.2 | 75.7 | 86.9 | 81.2 | 85.8 | 84.4 | 84.2 | ||
Qwen2-VL-7B | Baseline | 0-shot | 71.8 | 60.8 | 73.9 | 76.0 | 65.5 | 75.5 | 62.9 | 78.7 | 82.8 | 69.1 |
LearnAct | 1-shot | 77.3 [+5.5] | 75.0 | 77.5 | 77.8 | 69.8 | 83.5 | 72.9 | 78.0 | 83.6 | 78.8 | |
2-shot | 78.5 [+6.7] | 75.0 | 78.0 | 77.8 | 73.3 | 86.0 | 73.5 | 81.9 | 87.7 | 77.6 | ||
3-shot | 79.4 [+7.6] | 75.0 | 78.8 | 78.6 | 72.6 | 87.8 | 77.1 | 82.6 | 87.7 | 80.6 |
Table 4 presents the performance comparison of different models on the LearnGUI-Offline dataset. The results demonstrate the substantial improvements achieved by the LearnAct framework across all tested models. Gemini-1.5-Pro [25] shows the most dramatic improvement, with performance increasing from 19.3% to 51.7% (+32.4%) with just a single demonstration, and further improving to 57.7% (+38.4%) with three demonstrations. This represents a 198.9% relative improvement, highlighting the powerful potential of demonstration-based learning even for advanced foundation models. UI-TARS-7B-SFT [27], despite already having strong zero-shot performance (77.5%), still achieves significant gains with LearnAct, reaching 82.8% (+5.3%) with a single demonstration. This indicates that even models specifically fine-tuned for GUI tasks can benefit from demonstration knowledge. Qwen2-VL-7B [26] demonstrates consistent improvement from 71.8% to 77.3% (+5.5%) with one demonstration, and to 79.4% (+7.6%) with three demonstrations, confirming that the benefits of LearnAct generalize across models with different architectures and capabilities.
The results also reveal interesting patterns regarding the impact of demonstration quantity. For Gemini-1.5-Pro [25], performance scales monotonically with the number of demonstrations, suggesting that less specialized foundation models can benefit substantially from additional examples. In contrast, UI-TARS-7B-SFT [27] achieves its peak performance with just one demonstration, indicating that models already fine-tuned for GUI tasks may efficiently extract necessary information from minimal demonstrations.
Application-specific results highlight LearnAct’s consistent improvement across diverse scenarios, with particularly notable gains in complex applications like CityMapper (from 14.1% to 69.4% for Gemini-1.5-Pro [25]) and To-Do apps (from 17.4% to 69.2%). This suggests that demonstration-based learning is especially valuable for navigating applications with complex interactions and non-standard interfaces.
3pt
Models | Supports | UISHActSH | UISHActSL | UISLActSH | UISLActSL | ||||
---|---|---|---|---|---|---|---|---|---|
3-4 (lr)5-6 (lr)7-8 (lr)9-10 | type | match | type | match | type | match | type | match | |
gemini-1.5-pro | 1-shot | 79.5 [+12.8] | 50.2 [+35.6] | 78.1 [+12.3] | 47.8 [+33.2] | 77.5 [+9.2] | 52.3 [+30.5] | 77.9 [+14.1] | 44.2 [+29.3] |
2-shot | 77.7 [+13.0] | 53.9 [+37.3] | 73.2 [+10.8] | 49.9 [+34.7] | 80.0 [+9.0] | 56.5 [+34.8] | 77.2 [+12.9] | 48.9 [+34.4] | |
3-shot | 72.3 [+15.8] | 53.5 [+39.6] | 72.8 [+12.9] | 49.5 [+34.6] | 78.7 [+10.4] | 60.0 [+38.4] | 79.2 [+12.8] | 51.6 [+36.3] | |
Qwen2-VL-7B | 1-shot | 86.0 [+5.3] | 72.2 [+6.3] | 85.4 [+4.9] | 69.6 [+5.5] | 86.0 [+2.0] | 76.2 [+5.4] | 82.9 [+1.3] | 69.4 [+4.3] |
2-shot | 85.0 [+67.4] | 75.6 [+9.3] | 84.0 [+67.2] | 71.2 [+5.7] | 86.9 [+73.3] | 76.8 [+6.3] | 84.0 [+68.5] | 70.5 [+5.5] | |
3-shot | 80.2 [+5.0] | 70.3 [+7.9] | 82.9 [+4.7] | 70.2 [+5.7] | 85.6 [+1.9] | 77.5 [+8.4] | 85.6 [+3.4] | 72.8 [+6.6] | |
UI-TARS-7B-SFT | 1-shot | 88.1 [+1.9] | 77.8 [+6.6] | 87.2 [+2.1] | 75.3 [+6.4] | 87.7 [+0.3] | 80.1 [+5.9] | 85.0 [-0.2] | 75.0 [+2.8] |
2-shot | 85.5 [+2.1] | 76.7 [+8.3] | 85.7 [+1.6] | 75.9 [+4.9] | 87.3 [-0.4] | 79.1 [+5.9] | 84.9 [-0.8] | 74.1 [+2.1] | |
3-shot | 87.1 [+7.9] | 78.2 [+13.9] | 85.5 [+2.6] | 75.4 [+4.9] | 86.0 [-0.9] | 78.9 [+6.8] | 85.5 [-0.9] | 75.2 [+2.7] |
To further understand the factors influencing LearnAct’s effectiveness, we analyzed performance across different similarity profiles, as shown in Table 5. Several important insights emerge: Gemini-1.5-Pro [25] shows substantial improvements across all similarity combinations, with the largest gains in action match accuracy (ranging from +29.3% to +39.6%). This indicates that demonstration knowledge significantly enhances the model’s ability to execute precise actions regardless of similarity conditions. UI-TARS-7B-SFT [27] exhibits the most pronounced improvements in UISHActSH scenarios (+13.9% with 3-shot), suggesting that the model can extract maximum value from demonstrations when both UI and action patterns are similar to the target task. Qwen2-VL-7B [26] shows notably large improvements in action type accuracy for 2-shot settings (e.g., +67.4% for UISHActSH), potentially indicating a threshold effect where multiple demonstrations trigger significant pattern recognition improvements.
Interestingly, while UI similarity generally correlates with higher performance gains, we observe that action similarity also plays a crucial role. For instance, Gemini-1.5-Pro [25] achieves its highest match accuracy in UISLActSH scenarios (+38.4% with 3-shot), suggesting that action similarity can sometimes compensate for UI differences. This finding highlights the importance of considering both UI and action similarity when designing demonstration-based learning approaches for mobile GUI agents.
These results validate our hypothesized framework design, demonstrating that LearnAct successfully leverages demonstration similarity to enhance performance across varying conditions, with the most substantial benefits observed when demonstrations can provide both perceptual and procedural knowledge relevant to the target task.
4pt
Input | Models | # Params | LearnGUI-OnlineSR |
---|---|---|---|
Image + AXTree | GPT-4o[47] | - | 34.5 |
Image + AXTree | Gemini-Pro-1.5[25] | - | 22.8 |
Image | Claude Computer-Use[48] | - | 27.9 |
Image | Aguvis[21] | 72B | 26.1 |
Image | Qwen2-VL-7B \(+\) 0-shot | 7B | 9.9 |
Image | Qwen2-VL-7B \(+\) LearnAct | 7B | 21.1 [+11.2] |
Image | UI-TARS-7B-SFT \(+\) 0-shot | 7B | 18.1 |
Image | UI-TARS-7B-SFT \(+\) LearnAct | 7B | 32.8 [+14.7] |
While offline evaluations provide valuable insights into step-by-step execution capabilities, real-world deployment requires successful end-to-end task completion. Table 6 presents the results of our online evaluation on the LearnGUI-Online benchmark, which reveals several important findings. The LearnAct framework substantially improves performance for both evaluated models, with Qwen2-VL-7B [26] improving from 9.9% to 21.1% (+11.2%) and UI-TARS-7B-SFT [27] from 18.1% to 32.8% (+14.7%). These significant gains demonstrate that the benefits of demonstration-based learning translate effectively to real-world interactive scenarios. Qwen2-VL-7B [26] with LearnAct achieves 21.1% success rate, showing meaningful improvements over its baseline performance. This suggests that the quality and relevance of demonstrations are highly effective for enhancing model capabilities. UI-TARS-7B-SFT [27] with LearnAct achieves 32.8% success rate, approaching the performance of GPT-4o (34.5%) despite using a much smaller model. This indicates that demonstration-based learning can help bridge the gap between smaller specialized models and large foundation models. Detailed visualizations of these performance comparisons are provided in Appendix 10.1.To provide concrete examples of how LearnAct performs in real-world scenarios, we present three detailed case studies in Appendix 10.2.
The most striking finding is the effectiveness of our demonstration-based learning approach. The LearnAct framework provides significant performance improvements through its demonstration mechanism, with gains of up to 14.7% in task success rate. This demonstrates the power of high-quality demonstrations for enhancing model performance, highlighting the importance of relevant examples over simply increasing model size.
These results confirm that the LearnAct framework provides a practical pathway to developing effective mobile GUI agents, making it particularly valuable for application-specific customization and personalization scenarios.
3pt
Ablation Setting | Average | Gmail | Booking | Music | SHEIN | NBC | CityMapper | ToDo | Signal | Yelp | |
---|---|---|---|---|---|---|---|---|---|---|---|
1-2 DemoParser | KnowSeeker | ||||||||||
Baseline | 19.3 | 20.1 | 16.4 | 24.5 | 10.2 | 35.6 | 14.1 | 17.4 | 27.9 | 15.2 | |
40.6 | 47.7 | 31.3 | 55.4 | 29.1 | 47.0 | 43.0 | 58.2 | 48.8 | 50.7 | ||
41.6 | 46.9 | 34.1 | 52.7 | 27.9 | 51.9 | 45.3 | 51.4 | 61.1 | 51.8 | ||
51.7 | 55.5 | 47.1 | 60.0 | 35.7 | 56.4 | 54.7 | 60.6 | 63.1 | 54.6 |
To understand the contribution of each component in the LearnAct framework, we conducted ablation experiments on the LearnGUI-Offline dataset using Gemini-1.5-Pro [25]. As shown in Table 7, we systematically evaluated the impact of removing either the DemoParser or KnowSeeker component while keeping all other settings constant.
The results reveal several important insights. Both components are essential, as removing either component leads to substantial performance degradation compared to the full LearnAct framework. The complete framework achieves 51.7% accuracy, while removing DemoParser reduces performance to 40.6% (-11.1%) and removing KnowSeeker reduces it to 41.6% (-10.1%). Regarding DemoParser’s contribution, comparing "KnowSeeker only" (40.6%) to the baseline (19.3%), we observe that even without action descriptions, relevant demonstrations improve performance by 21.3%. However, the addition of DemoParser’s action descriptions further enhances performance by 11.1%, confirming the value of structured knowledge extraction. For KnowSeeker’s contribution, the "DemoParser only" configuration (41.6%) also substantially outperforms the baseline, indicating that detailed action descriptions are valuable even with randomly selected demonstrations. However, KnowSeeker’s retrieval of relevant demonstrations provides an additional 10.1% improvement, highlighting the importance of demonstration relevance.
The performance variations across applications are particularly informative. For instance, in the Signal application, DemoParser appears more important (61.1% vs. 48.8% for KnowSeeker only), suggesting that detailed action descriptions are crucial for applications with complex interaction patterns. Conversely, for the ToDo application, KnowSeeker seems more valuable (58.2% vs. 51.4% for DemoParser only), indicating that demonstration relevance may be more critical for applications with varied task types.
These findings validate our multi-agent framework design, confirming that both knowledge extraction (DemoParser) and relevant demonstration retrieval (KnowSeeker) play complementary and essential roles in enabling effective demonstration-based learning for mobile GUI agents.
Our experimental results demonstrate that demonstration-based learning significantly enhances mobile GUI agents’ capabilities. The substantial performance improvements across all evaluated models validate our core hypothesis that demonstration-based learning effectively addresses generalization challenges. Even advanced foundation models like Gemini-1.5-Pro [25] show dramatic improvements (198.9% relative improvement). Our multi-dimensional similarity analysis reveals that both UI similarity and action similarity influence learning efficacy, with action similarity sometimes compensating for UI differences.
Data Collection and Dataset Expansion. While our approach shows promising results, several limitations and future directions warrant consideration. First, regarding data collection, our current dataset, while comprehensive, could benefit from greater diversity and representativeness. The LearnGUI dataset, comprising 2,252 offline tasks and 101 online tasks, represents a significant step forward but remains limited in scale compared to the vast diversity of mobile applications and user interactions. Future work should expand the dataset to include a broader range of applications, particularly those with complex interaction patterns and specialized domains.
K-shot Learning Analysis. Second, our current investigation of k-shot learning is limited to k=1, 2, and 3 demonstrations. While these configurations provide valuable insights, a more comprehensive analysis of how demonstration quantity affects performance would be beneficial. Future research could explore the relationship between the number of demonstrations and performance gains, potentially identifying optimal demonstration counts for different scenarios and model architectures.
Enhanced Learning and Execution Strategies. Third, our learning and execution strategies could be enhanced to better leverage the relationship between support tasks and query tasks. While our current approach effectively retrieves relevant demonstrations, more sophisticated methods could be developed to extract and transfer knowledge more efficiently. For instance, techniques for abstracting common patterns across demonstrations, identifying critical decision points, and adapting demonstrated strategies to novel scenarios could further improve performance.
Agent Self-Learning. A promising direction for future research is to enable agents to learn from their own successful executions. Currently, our framework relies exclusively on human demonstrations, but agents could potentially learn from their own successful task completions. By incorporating these successful agent executions into the knowledge base, we could enable a form of "self-learning" where agents continuously improve their capabilities through their own experiences.
By addressing these limitations and pursuing these research directions, demonstration-based learning can evolve into a robust paradigm for developing adaptable, personalized, and practically deployable mobile GUI agents that effectively address the diverse needs of real-world users. The insights gained from our multi-dimensional similarity analysis provide valuable guidance for future research in this domain, suggesting that both UI similarity and action similarity play crucial roles in successful knowledge transfer.
This paper introduces a novel demonstration-based learning paradigm that fundamentally addresses the generalization challenges faced by mobile GUI agents. Rather than pursuing universal coverage through ever-larger datasets, our approach leverages human demonstrations to enhance agent performance in unseen scenarios. We developed LearnGUI, the first comprehensive dataset for studying demonstration-based learning in mobile GUI agents, comprising 2,252 offline tasks and 101 online tasks with high-quality human demonstrations. We further designed LearnAct, a sophisticated multi-agent framework with three specialized components: DemoParser for knowledge extraction, KnowSeeker for relevant knowledge retrieval, and ActExecutor for demonstration-enhanced task execution. Our experimental results demonstrate remarkable performance gains, with a single demonstration increasing Gemini-1.5-Pro [25]’s accuracy from 19.3% to 51.7% in offline tests and enhancing UI-TARS-7B-SFT [27]’s online task success rate from 18.1% to 32.8%. These findings establish demonstration-based learning as a promising direction for developing more adaptable, personalized, and practically deployable mobile GUI agents.
Figure 7 illustrates the distribution of similarity scores across different dimensions in the LearnGUI-Offline dataset, enabling systematic analysis of how different types of similarity between demonstration and query tasks affect learning efficacy.
Figure 7: Distribution of instruction, UI, and action similarity scores in LearnGUI-Offline. The histograms show the distribution of similarity scores across three dimensions: instruction similarity (top), UI similarity (middle), and action similarity (bottom). These distributions enable systematic analysis of how different types of similarity between demonstration and query tasks affect learning efficacy.
This section provides detailed descriptions of the components of our LearnAct framework, corresponding to the methods presented in Section 4 of the paper.
We provide all of our prompt templates used in DemoParser for generating semantically descriptive action descriptions from demonstration data. These carefully designed prompts guide the vision-language model to produce structured knowledge that captures the essence of human demonstrations, as shown in Figures 8 and 9.
None
Figure 8: Prompt template for intermediate action descriptions. The template guides DemoParser to generate standardized descriptions for intermediate actions, including detailed rules for memory annotations that capture important information observed during task execution.
None
Figure 9: Prompt templates for terminal action descriptions. The templates provide specific formats for both standard task completion and information retrieval tasks, ensuring consistent output structure across different task types.
We provide the prompt templates used by ActExecutor to make decisions based on current observations, action history, and demonstration knowledge. These prompts guide the vision-language model to select appropriate actions for task execution, as shown in Figure 10.
None
Figure 10: Task execution prompt template. This comprehensive prompt directs ActExecutor to generate actions based on current observations, action history, and retrieved demonstrations, with explicit formatting requirements to ensure consistent action outputs.
We provide the detailed algorithms for the DemoParser and ActExecutor components of our LearnAct framework, which are the core computational processes enabling knowledge extraction and task execution.
Figure 11: DemoParser Knowledge Generation Process
Figure 12: ActExecutor Task Execution Process
This section provides additional experimental results and analyses that supplement the findings presented in Section 5 of the paper.
Figures 13 and 14 provide detailed comparisons of model performance with and without LearnAct enhancement in online evaluation scenarios.
Figure 13: Detailed performance comparison of Qwen2-VL-7B with and without LearnAct on LearnGUI-Online. The figure shows the task success rates of Qwen2-VL-7B baseline versus Qwen2-VL-7B enhanced with LearnAct across different task dimensions in the LearnGUI-Online benchmark.
Figure 14: Detailed performance comparison of UI-TARS-7B-SFT with and without LearnAct on LearnGUI-Online. The figure presents a comprehensive breakdown of task success rates for UI-TARS-7B-SFT baseline versus UI-TARS-7B-SFT enhanced with LearnAct across multiple task dimensions in the LearnGUI-Online benchmark.
We present three detailed case studies from our online experiments to provide concrete examples of how LearnAct leverages demonstration knowledge to solve tasks in unseen mobile applications. These case studies highlight different scenarios where demonstration knowledge proves particularly beneficial for task execution.
Figure 15: UI-TARS-7B-SFT with LearnAct vs. Baseline in NotesRecipeIngredientCount Task. Task template: "What quantity of {ingredient} do I need for the recipe ‘{title}’ in the Joplin app? Express your answer in the format <amount> <unit> without using abbreviations."
Figure 16: UI-TARS-7B-SFT with LearnAct vs. Baseline in SimpleCalendarDeleteOneEvent Task. Task template: "In Simple Calendar Pro, delete the calendar event on {year}-{month}-{day} at {hour}h with the title ‘{event_title}’"
Figure 17: Qwen2-VL-7B with LearnAct vs. Baseline in ExpenseDeleteMultiple Task. Task template: "Delete the following expenses from arduia pro expense: {expenses}."
\(^\dagger\) Equal Contribution, \(^\ddag\) Project Lead, Corresponding Author↩︎