Ontology-Aware RAG for Improved Question-Answering in
Cybersecurity Education

Chengshuai Zhao1 Garima Agrawal1, Tharindu Kumarage1, Zhen Tan1, Yuli Deng1,
Ying-Chih Chen2, Huan Liu1


Abstract

Integrating AI into education has the potential to transform the teaching of science and technology courses, particularly in the field of cybersecurity. AI-driven question-answering (QA) systems can actively manage uncertainty in cybersecurity problem-solving, offering interactive, inquiry-based learning experiences. Large language models (LLMs) have gained prominence in AI-driven QA systems, offering advanced language understanding and user engagement. However, they face challenges like hallucinations and limited domain-specific knowledge, which reduce their reliability in educational settings. To address these challenges, we propose CyberRAG, an ontology-aware retrieval-augmented generation (RAG) approach for developing a reliable and safe QA system in cybersecurity education. CyberRAG employs a two-step approach: first, it augments the domain-specific knowledge by retrieving validated cybersecurity documents from a knowledge base to enhance the relevance and accuracy of the response. Second, it mitigates hallucinations and misuse by integrating a knowledge graph ontology to validate the final answer. Experiments on publicly available cybersecurity datasets show that CyberRAG delivers accurate, reliable responses aligned with domain knowledge, demonstrating the potential of AI tools to enhance education.

1 Introduction↩︎

The use of AI in education has the potential to transform the way science and technology courses are taught. In scientific learning, students are expected to engage in problem-solving and exploration, yet traditional classroom methods often focus on the passive acquisition of established knowledge. This approach limits opportunities for students to experience the process of knowledge creation, leading to lower cognitive engagement. Cybersecurity is a problem-based learning domain where students must master complex tools, develop defense techniques, and uncover new threats, which necessitates a re-imagination of traditional education practices  [1].

Prior research highlights that managing uncertainty is a crucial component of the learning process, as students often struggle with acquiring new skills, applying diverse methodologies, and forming new understandings [2]. Educators can effectively manage this uncertainty by increasing it through the introduction of authentic, ambiguous challenges to stimulate critical thinking, maintaining it to encourage deeper exploration and problem-solving, and reducing it by identifying optimal solutions to help students integrate new insights with existing knowledge [3]. AI-driven question-answering (QA) systems can help manage this uncertainty in technical problem-solving by supporting self-paced learning, significantly enhancing cognitive engagement  [4].

In recent years, large language models (LLMs) have become central to AI-driven technologies. While LLM-powered QA systems hold great promise for enhancing learning, they also face challenges such as hallucination and limited domain knowledge, which can undermine their effectiveness [5][8]. In cybersecurity education, where precision is critical, ensuring the accuracy of AI-generated content is essential. For example, in tasks such as identifying vulnerabilities or interpreting security policies, inaccurate AI responses could lead to misinformation, compromising the learning experience and, potentially, real-world security [9]. A promising solution to address this challenge is the retrieval-augmented generation (RAG) approach [10], where the model generates responses by retrieving information from a validated knowledge base, thereby enhancing the accuracy and reliability.

Although the RAG approach helps reduce hallucinations and address domain knowledge issues to some extent, the reliability of LLM-generated answers remains a concern for achieving educational goals. Students may ask questions that fall outside the scope of the augmented cybersecurity knowledge base. In such cases, LLMs rely on their own parametric knowledge to generate responses, which can expose the QA system to risks of misinformation or misuse [9], [11], [12]. In an educational setting, it is also crucial to prevent students from manipulating the AI system for unintended purposes. There is a strong need to provide a validation system to ensure the accuracy and safety of LLM-generated responses. One potential solution is reinforcement learning from human feedback (RLHF) [13]. However, this method requires verification by cybersecurity experts, making it labor-intensive, costly, and time-consuming. Preferably, an automatic validation approach is needed. Domain-specific knowledge graphs, which structure expert knowledge and capture the interactions between key entities in alignment with domain rules [14], offer a promising direction. The knowledge graph ontology encodes these rules [15]. By leveraging this ontology, LLM responses can be validated by fact-checking against the predefined rules, ensuring greater accuracy and reliability without the need for constant human oversight.

In this paper, we propose CyberRAG, an ontology-aware RAG approach for developing a reliable QA system in cybersecurity education, comprising two key components. First, we utilize RAG methods to retrieve validated cybersecurity documents from a knowledge base, enhancing both the accuracy and relevance of the answers. Through finely crafted prompts, CyberRAG improves QA performance by leveraging both cybersecurity content and the natural language processing capabilities of LLMs. Additionally, we introduce an ontology-based validation approach that uses a cybersecurity knowledge graph ontology to automatically verify LLM-generated responses, thereby preventing potential risks of misuse and hallucination. We conduct comprehensive experiments on publicly available datasets to demonstrate the effectiveness of CyberRAG in delivering reliable and accurate answers. Our research explores the potential of integrating AI into education, emphasizing its transformative impact on traditional methods. This approach extends beyond cybersecurity and can also be applied to other educational subjects. Our contributions can be summarized as follows:

  • We propose CyberRAG, a novel and reliable computational framework aimed at creating an interactive and secure environment for cybersecurity education.

  • We design finely crafted prompts that enable CyberRAG to produce accurate and relevant answers by integrating a cybersecurity knowledge base with the vast prior knowledge present in LLMs.

  • We explore the use of knowledge graph ontology to validate LLM-generated responses, ensuring accuracy and preventing potential misuse of QA systems in education.

2 Related Work↩︎

2.1 RAG in Education↩︎

Generative models have the potential to transform traditional education by enabling personalized learning and automating content creation, making education more adaptive and accessible [16]. The integration of LLMs in education has further enhanced conversational capabilities, supporting self-paced learning through AI-driven question-answering tools and bots [17], [18]. However, the use of LLMs also presents challenges, particularly the risks of hallucinations and inaccurate responses, which are critical concerns in educational contexts and must be carefully managed to ensure responsible implementation [19], [20]. To address these issues, retrieval-augmented generation (RAG) methods are used by combining LLMs with real-time knowledge base retrieval to improve response accuracy ensuring content is current and reliable in educational settings [21][23].

2.2 LLM-generated Answer Validation↩︎

However, most LLM approaches are currently limited to handling simple queries, as generating complex answers from extensive knowledge articles and course materials sourced from diverse knowledge bases remains challenging, often leading to issues with the correctness of the answers [24]. Jeong et al.  [25] proposed an adaptive QA approach to handle complex questions, while Li et al.  [26] developed an evaluation framework for grounded and modular assessment of RAG responses.

Various domains are using knowledge graphs to enhance LLMs’ reasoning and retrieval capabilities [27][29]. An ontology within a KG, which defines the rules of a specific domain, can also be used to validate LLM responses by checking the correctness of relationships between key entities [5]. This capability is particularly valuable in educational settings, where ensuring the accuracy of LLM-generated content is crucial. While some student questions may extend beyond the scope of the augmented course material or knowledge base, ontology-aware validation offers a promising solution. However, research in this area remains limited. Our ontology-aware methods mark a significant advancement in this field.

2.3 AI for Cybersecurity Education↩︎

Learning cybersecurity is crucial for national security, safeguarding critical infrastructure, and ensuring defense [30][32]. However, mastering this complex field requires deep knowledge of concepts, tools, and attack-defense simulations [33][35]. While AI shows promising potential in enhancing cognitive learning, research on its application in cybersecurity education remains limited [36][39]. Previous efforts include structured flow graphs from CTF texts for vulnerability analysis [40] and knowledge graphs for guiding student projects [41][43]. Additionally, a semi-automated approach has been used to create knowledge graphs from unstructured course material, improving student learning [44]. The AISecKG ontology[45] further advanced this by generating cybersecurity KG and the CyberQ dataset [46] for LLM-based question-answering.

However, a significant gap remains: the lack of a self-paced learning cybersecurity QA system. To address this, we leverage the CyberQ dataset to retrieve structured, authenticated cybersecurity content using RAG and validate LLM outputs with the AISecKG ontology, particularly for queries beyond the scope of the CyberQ dataset. Our approach enhances the accuracy, reliability, and safety of AI-driven education tools, effectively addressing this critical research gap.

Figure 1: Overview of CyberRAG Framework.

3 Preliminaries↩︎

In this section, we will outline the fundamental concepts underlying our proposed CyberRAG framework.

3.1 Problem Formulation↩︎

We consider cybersecurity problem-based learning to be a question-answering (QA) problem. Specifically, students raise a series of questions \(Q = \{q_1, q_2,..., q_{|Q|}\}\), we expect the model to output corresponding responses as answers \(A = \{a_1, a_2,..., a_{|A|}\}\), where \(|Q| = |A|\) in our scenario.

3.2 RAG System↩︎

We employ a retrieval-augmented generation (RAG) system as a base framework to solve our problem because of its impressive capacity for hallucination mitigation. A classic RAG approach entails two parts:

  • A retriever \(\mathcal{R}\) that can effectively retrieve reference documents \(D = \{d_1, d_2,..., d_{|D|}\}\) from the knowledge base \(K\) (e.g., knowledge graph, QA database, and text materials) based on query-document relevance.

  • A generation model (e.g., large language model) \(\mathcal{G}\) that provides answers using the documents and human instructions \(I = \{i_1, i_2,..., i_{|I|}\}\).

3.3 Knowledge Graph Ontology↩︎

In the context of our CyberRAG framework, we incorporate knowledge graph ontology to validate the LLM-generated responses. A knowledge graph (KG) is a structured representation of knowledge in which entities (a.k.a, nodes) are connected by relationships (a.k.a, edges). Formally, we define a knowledge graph as a tuple \(G = (E, R)\), where \(E = \{e_1, e_2, ..., e_{|E|}\}\) represents the set of entities, and \(R = \{r_1, r_2, ..., r_{|R|}\}\) represents the set of relationships between these entities. Each relationship \(r \in R\) can be viewed as a directed edge connecting two entities \(e_i\) and \(e_j\) in the graph, thereby forming a triplet \((e_i, r, e_j)\).

To further formalize the structure and semantics of the domain knowledge represented in our system, we utilize an ontology \(O\). An ontology is a formal specification of a set of concepts within a domain and the relationships between those concepts. Formally, we define an ontology \(O\) as a tuple \(O = (C, R_C, H_C)\), where \(C = \{c_1, c_2, ..., c_{|C|}\}\) is a set of concepts, \(R_C = \{r_1^C, r_2^C, ..., r_{|R_C|}^C\}\) is a set of conceptual relationships, and \(H_C \subseteq C \times C\) is a hierarchical structure (e.g., a taxonomy) that organizes these concepts.

4 Method↩︎

In this section, we first introduce our approach to CyberRAG. The detailed design of each component is then discussed in the subsequent sections.

4.1 CyberRAG↩︎

Our proposed framework, named CyberRAG, comprises two key components: a retrieval-augmented generation (RAG) system and an ontology-based answer validation module. By leveraging the RAG system with carefully crafted prompts, CyberRAG effectively addresses students’ questions by utilizing cybersecurity knowledge materials and the natural language understanding capabilities of LLMs. The ontology-based validation approach ensures that the responses provided to students are both reliable and secure. An overview of CyberRAG is presented in Fig. 1.

4.2 Cybersecurity Knowledge Retrieval↩︎

To retrieve cybersecurity material from the knowledge base, the first step is to design an effective retriever. We utilize a dual-encoder dense retriever containing two encoders: a question encoder \(\mathcal{E}_Q\), which projects student queries into a shared semantic latent space, and a document encoder \(\mathcal{E}_D\), which embeds knowledge base documents into corresponding vector representations. These encoders are optimized to maintain semantic proximity between queries and relevant documents, ensuring that encoded representations are well-aligned in the latent space, thereby facilitating accurate retrieval. Note that the question encoder and the document encoder can be the same in some settings. Then, the retriever can retrieve the top \(k\) documents according to the query-document similarity.

Specifically, given a query \(q_i \in Q\) and a document \(d_j \in D\), they are encoded by query encoder \(\mathcal{E}_Q\) and document encoder \(\mathcal{E}_D\), respectively. \[h_{q_i} = \mathcal{E}_Q(q_i)\] \[h_{d_j} = \mathcal{E}_D(d_j)\] where \(h_{q_i}\) and \(h_{d_j}\) are latent representations of query \(q_i\) and document \(d_j\).

Then, the semantic relevance between the question and the document is defined by cosine similarity between their latent vectors: \[sim(h_{q_i}, h_{d_j}) = \frac{h_{q_i} \cdot h_{d_j}}{\|h_{q_i}\| \|h_{d_j}\|}\] where \(h_{q_i} \cdot h_{d_j}\) are inner product of two vectors and \(\|h_{q_i}\|\) and \(\|h_{d_j}\|\) are the magnitudes (a.k.a. Euclidean norms) of \(q_i\) and \(d_i\), respectively.

Based on the similarities. top-\(k\) relevant course documents can be retrieved. \[\label{eq:retriever} D = \mathcal{R}(Q, K)\tag{1}\]

4.3 Cybersecurity Answer Generation↩︎

After retrieving relevant cybersecurity documents, the next step is to prompt the generative model (i.e., LLMs) to generate answers. The most challenging aspect is designing a prompt that maximizes the effectiveness of both the retrieved information and the generative model. An ideal prompt should guide the model to summarize the information when the retrieved documents are highly relevant and encourage the model to generate new responses by leveraging its own knowledge when the retrieved information is insufficient or incomplete.

We design the prompt as illustrated in the Answer Generation Prompt. The proposed prompt consists of three key elements:

  • Documents: The documents relevant to the student’s query are retrieved by the retriever from the knowledge base, using the method outlined in the previous section.

  • Questions: These are queries raised by students, which the generative model is tasked with answering.

  • Instructions: These are rules and prompts designed by domain experts. The first two instructions guide the generative model to answer students’ questions by referencing the retrieved documents. The third instruction encourages the model to generate responses based on its own knowledge when the documents are insufficient. Thus the proposed framework can address students’ queries, whether or not they are covered by the knowledge base.

The final prompt is given as input to the generative model to generate the answer to the student’s query: \[\label{eq:answer} A = \mathcal{G}(D, Q, I)\tag{2}\]

Factors such as potential impact, exploitability, and affected systems are considered when determining the severity level of a vulnerability. Factors such as the ease of exploitation, potential impact, and affected systems or data are typically considered when determining the severity level of a vulnerability.

The severity level of a vulnerability is typically determined by analyzing its potential consequences, ease of exploitation, and the systems it affects.

What criteria are used to determine the severity level of a vulnerability?

Answer the user’s QUESTION using the DOCUMENT text above.

Keep your answer grounded in the facts of the DOCUMENT.

If the DOCUMENT doesn’t contain the facts to answer the QUESTION, give a response based on your knowledge.

4.4 Ontology-based Answer Validation↩︎

For ontology validation, we used AISecKG [47], a cybersecurity education ontology that defines relationships between concepts, applications, and roles within the cybersecurity domain. AISecKG organizes these into three broad categories with 12 entity types. Concepts include features, functions, data, attacks, vulnerabilities, and techniques, while applications cover tools, systems, and apps. Roles consist of users, attackers, and security teams. The ontology defines nine core relationships between these entities, represented by 68 unique edges. For example, tuples such as (‘attacker’, ‘can_exploit’, ‘feature’) and (‘security team’, ‘can_analyze’, ‘feature’) illustrate entities and these relationships. These triples represent the fundamental domain rules that govern cybersecurity information at a schema level. By leveraging these ontology-based triples along with the natural language understanding capabilities of LLMs, an automatic answer validation system can be developed for the cybersecurity domain.

Specifically, given the question \(Q\) and the corresponding response \(A\) provided by the generative language model, the validation model \(\mathcal{V}\) takes the QA contexts, ontology rules denoted by \(O = \{o_1, o_2,..., o_{|O|}\}\), and validation human instruct \(I'\) as the input, then produces the validation result \(R\).

\[\label{eq:validation} R = \mathcal{V}(Q, A, O, I')\tag{3}\] Note that \(R \in [0, 1]\), and a higher score indicates the response aligns with the ontology well. The complete validation prompt is shown in the Ontology Validation Prompt.

What criteria are used to determine the severity level of a vulnerability?

Factors like potential impact, exploitability, and affected systems are considered when determining the severity level of a vulnerability.

(attacker, can_harm, system),

(system, can_expose, vulnerability),

(function, can_analyze, vulnerability) ...

Please judge if the QUESTION and ANSWER align well with the ONTOLOGY.

The QUESTION and ANSWER align well with the ONTOLOGY if they are in the same knowledge domain as the ONTOLOGY and the ANSWER follows the relationships defined in the ONTOLOGY.

The output format is a tuple: (your judgment: Pass/Not Pass, confidence score)

By setting an appropriate threshold \(\sigma\), the pipeline can effectively filter out incorrect answers and identify potential misuse behavior. The CyberRAG algorithm is outlined in Algorithm 2.

Figure 2: CyberRAG Algorithm

5 Experiment and Discussion↩︎

5.1 Dataset↩︎

In this work, we use CyberQ  [46], an open-source cybersecurity dataset containing around 4,000 open-ended questions and answers on topics such as cybersecurity concepts, tool usage, setup instructions, attack analysis, and defense techniques. The questions vary in complexity, answer length, and vocabulary. The dataset was developed using facts from AISecKG [47], a cybersecurity knowledge graph, and a three-step LLM prompting method. The Zero-shot (ZS) method generated 1,027 QA pairs for simple WH-questions on cybersecurity entities. The Few-shot (FS) method produced 332 medium-complexity QA pairs related to setup and tools. The Ontology-Driven approach generated 2,171 high-complexity QA pairs covering attack and defense scenarios. The dataset includes 30 very short, 1,061 short, and 2,439 long questions. These cybersecurity questions and answers are comprehensive, challenging, and not straightforward, making them ideal for our goal of developing an interactive QA system to teach cybersecurity to students.

Table 1: Performance across different datasets and scenarios.
Dataset Scenarios BERTScore \(\uparrow\) METEOR \(\uparrow\) ROUGE-1 \(\uparrow\) ROUGE-2 \(\uparrow\)
ZS Zero Shot 0.8571 \(\pm\) 2.0e-4 0.2645 \(\pm\) 1.4e-3 0.1326 \(\pm\) 7.0e-4 0.0626 \(\pm\) 7.0e-4
Out of KB 0.8720 \(\pm\) 1.0e-4 0.3489 \(\pm\) 1.3e-3 0.2174 \(\pm\) 6.0e-4 0.1015 \(\pm\) 7.0e-4
In KB 0.9294 \(\pm\) 1.0e-4 0.7861 \(\pm\) 5.0e-4 0.6490 \(\pm\) 6.0e-4 0.5977 \(\pm\) 6.0e-4
1-6 FS Zero Shot 0.8654 \(\pm\) 1.0e-4 0.3055 \(\pm\) 1.0e-3 0.2164 \(\pm\) 1.3e-3 0.0954 \(\pm\) 1.1e-3
Out of KB 0.8758 \(\pm\) 2.0e-4 0.3746 \(\pm\) 9.0e-4 0.3130 \(\pm\) 1.2e-3 0.1484 \(\pm\) 8.0e-4
In KB 0.9461 \(\pm\) 1.0e-4 0.8588 \(\pm\) 7.0e-4 0.7882 \(\pm\) 5.0e-4 0.7195 \(\pm\) 8.0e-4
1-6 OD Zero Shot 0.8601 \(\pm\) 1.0e-4 0.2866 \(\pm\) 8.0e-4 0.1524 \(\pm\) 5.0e-4 0.0703 \(\pm\) 5.0e-4
Out of KB 0.8783 \(\pm\) 1.0e-4 0.3912 \(\pm\) 6.0e-4 0.2587 \(\pm\) 5.0e-4 0.1234 \(\pm\) 3.0e-4
In KB 0.9331 \(\pm\) 1.0e-4 0.7861 \(\pm\) 4.0e-4 0.6407 \(\pm\) 5.0e-4 0.5931 \(\pm\) 5.0e-4

5.2 Experiment and Parameter Settings↩︎

The knowledge base for our experiments is the CyberQ dataset [46], and the cybersecurity knowledge graph ontology utilized in this work is AISecKG [47]. In future work, the knowledge base can be expanded to include other PDF documents containing cybersecurity-related coursework material. For brevity, we will refer to the CyberQ dataset as our KB and AISecKG as our ontology. We conduct the main experiments in two scenarios: within the knowledge base (In-KB) and outside the knowledge base (Out-of-KB). For the In-KB scenario, students’ queries can be answered using existing documents in the knowledge base, while the Out-of-KB scenario involves queries for which relevant documents are not present in the knowledge base.

We consider BERTScore [48], METEOR [49], ROUGE-1 [50], and ROUGE-2 [50] as metrics to evaluate the performance of the generated response. It is important to note that the ground truth answers in the CyberQ dataset are annotated by human cybersecurity experts. Unless otherwise specified, each experiment is run 10 times, with average scores and standard deviations recorded.

We use the Contriever [51] as the retriever for document retrieval, employing cosine similarity as the semantic distance metric to identify and retrieve the most relevant documents from the knowledge base. For the generative language model, we utilize LLaMA3-8B-Instruct [52], with an input text length of 256 and max-length padding. During the answer generation process, the number of beams for beam search is set to 4. The ontology-based validation model is formulated using another instance of LLaMA3-8B-Instruct [52], with an input text length of 512 and max-length padding. For all other parameters not explicitly mentioned, default settings are applied.

5.3 Research Questions↩︎

To assess our approach, we employ the following evaluation methodology and begin by posing the following research questions (RQs):

  • RQ1: How does the integration of a knowledge base enhance the performance of the CyberRAG framework in generating accurate and relevant answers for both In-KB and Out-of-KB queries?

  • RQ2: What is the impact of the knowledge base and the generative language model on the overall performance of CyberRAG, and how does their ablation affect the system’s ability to answer students’ questions?

  • RQ3: How does the integration of a knowledge base influence the quality and reliability of answers generated by the CyberRAG framework, and what are the key benefits observed from the retrieval process?

  • RQ4: How effective is the ontology-based validation model in ensuring the accuracy, relevance, and scope of answers generated by CyberRAG. Can this validate and prevent potential misuse behaviors?

5.4 Quantitative Results↩︎

To address RQ1, we design a comparative experiment. As shown in Table 1, a⃝ CyberRAG consistently achieves good performance across all datasets and scenarios. For instance, the In-KB question-answering on the FS dataset gives a BertScore of 0.9461, a METEOR of 0.8588, a ROUGE-1 of 0.7882, and a ROUGE-2 of 0.7195. These results demonstrate that the answers generated by CyberRAG not only align well with the semantic meaning of the ground truth but also closely match in exact word terms. b⃝ Compared with the Out-of-KB, the model shows relatively higher performance for In-KB questions. For example, in the ZS subset for In-KB questions, the model achieves a BERTScore of 0.9294, and for more complex questions in the OD subset, it reaches 0.9331. In contrast, for Out-of-KB questions, the BERTScore is 0.8720 in ZS and 0.8783 for the OD subset. This indicates that the integration of the knowledge base provides a solid reference for generating answers for open-ended questions in cybersecurity, enhancing the model’s performance across both simple and complex queries.. c⃝ CyberRAG provides semantically meaningful and high-quality answers for questions outside the knowledge base. For example, in the OD dataset with the Out-of-KB settings, the BERTScore is 0.8783, while the METEOR, ROUGE-1, and ROUGE-2 are 0.3912, 0.2587, and 0.1234, respectively. This suggests that while the generated answers may not exactly match the ground truth in wording, they are highly similar to expert-certified answers in meaning. d⃝ Furthermore, the results show a very low standard deviation (std) in both In-KB and the Out-of-KB settings, with std values consistently below 0.0011. This demonstrates that CyberRAG is both stable and reliable in producing accurate answers. Overall, these findings suggest that our proposed method is promising for answering cybersecurity-related questions.

5.5 Ablation Study↩︎

Figure 3: Case Study. The answer validation case study (left) elaborates on how the validation model prevents misuse behaviors. The CyberRAG case study (right) showcases the data flow details.

We conducted an ablation study to investigate RQ2; we consider two scenarios: (i) CyberRAG is built solely on a question-answering dataset, such as CyberQ, without utilizing the generative language model. In this case, the QA system is unable to answer questions beyond the scope of the training dataset, which is undesirable as it limits the breadth of cybersecurity education. (ii) CyberRAG without augmenting the knowledge base. In this case, the model essentially reverts to functioning as a generative language model, answering questions in a zero-shot manner. As shown in Fig. 1, the results under zero-shot settings are considerably inferior. Compared to the In-KB scenarios, the model under zero-shot setting lacks access to the ground truth information for reference, leading to significant drop in performances. When comparing the zero-shot results with Out-of-KB scenarios, the latter demonstrates a better performance. This is because, although Out-of-kb can not be answered directly using documents from the knowledge base, these documents still provide relevant and supplementary information that may help in formulating the answers. In conclusion, both the knowledge base and the generative language model play significant roles in enhancing the performance of a QA systems in cybersecurity education.

5.6 Retrieval Analysis↩︎

To research RQ3, we design a retrieval analysis. First, we acquire the documents retrieved from the knowledge base and then compare the response produced by CyberRAG with the ground truth to explore how CyberRAG benefits from the retrieval augmentation process. We consider the Faithfulness, Answer Relevancy, Context Precision, Context Recall, and Context Entity Recall as metrics and use the RAGAS [53] package to evaluate our method.

Table 2: RAGAS evaluation for CyberRAG.
Metric ZS FS OD
Faithfulness \(\uparrow\) \(0.8134\) \(0.8914\) \(0.7598\)
Answer Relevancy \(\uparrow\) \(0.9825\) \(0.9860\) \(0.9828\)
Context Precision \(\uparrow\) \(0.9890\) \(1.0000\) \(0.9960\)
Context Recall \(\uparrow\) \(0.9909\) \(0.9974\) \(0.9946\)
Context Entity Recall \(\uparrow\) \(0.9392\) \(0.9512\) \(0.9671\)

As shown in Table 2, our proposed method CyberRAG shows satisfactory results on all metrics, especially on Answer Relevancy, Context Precision, Context Recall, and Context Entity Recall. These results indicate that the retrieved documents are highly relevant and accurate for answer generation and the generated response adheres to the reference documents, with minimal hallucination. It is also worth mentioning that there is still some room for improvement in the Faithfulness metrics. The lower score can be attributed to the generative model introducing additional knowledge from itself to enrich the answers, which is otherwise beneficial for the Out-of-KB question-answering.

5.7 Answer Validation Analysis with Case Study↩︎

We designed an answer validation analysis to explore RQ4. We simulate a case-study with an out-of-domain query, such as, “How to make money in the stock market?" - a query that can be considered an unintended or misused case. As shown in Fig. 3, although the generative language model produces the corresponding answer, it is filtered out by the validation model because it fails to adhere to cybersecurity knowledge graph ontology. Conversely, when a relevant cybersecurity question is posed, it smoothly passes the validation test, highlighting the critical role the ontology-based validation model in ensuring accuracy and preventing potential misuse. Fig. 3 illustrates the data flow in the case study, demonstrating the transparency and reliability of CyberRAG.

6 Conclusion↩︎

The rapid advancement of AI is transforming education, particularly in technical fields like cybersecurity. AI-driven QA systems enhance cognitive engagement by managing uncertainty in problem-based learning. Our proposed CyberRAG introduces a novel, ontology-aware retrieval-augmented generation approach to create a reliable QA system for cybersecurity education. By leveraging domain knowledge and an ontology-based validation model, CyberRAG ensures relevance, accuracy, and safety of responses. Comprehensive experiments demonstrate its dependability in real-world scenarios, fostering a more interactive and secure learning environment. This research highlights the potential of AI to transform educational practices, not only in cybersecurity but across various subjects. As AI in education evolves, future research will explore innovative methods like virtual environments to further enhance students’ practical experiences using LLM agents.

References↩︎

[1]
Shivapurkar, M.; Bhatia, S.; and Ahmed, I. 2020. Problem-based learning for cybersecurity education. In Journal of The Colloquium for Information Systems Security Education, volume 7, 6–6.
[2]
Jordan, M. E. 2015. Variation in students’ propensities for managing uncertainty. Learning and Individual differences, 38: 99–106.
[3]
Chen, Y.-C.; Benus, M. J.; and Hernandez, J. 2019. Managing uncertainty in scientific argumentation. Science Education, 103(5): 1235–1276.
[4]
Means, A. J. 2021. Hypermodernity, automated uncertainty, and education policy trajectories. Critical Studies in Education, 62(3): 371–386.
[5]
Agrawal, G.; Kumarage, T.; Alghamdi, Z.; and Liu, H. 2024. Can Knowledge Graphs Reduce Hallucinations in LLMs? : A Survey. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 3947–3960. Mexico City, Mexico: Association for Computational Linguistics.
[6]
Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; and Liu, T. 2023. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. CoRR, abs/2311.05232.
[7]
Zhang, Y.; Li, Y.; Cui, L.; Cai, D.; Liu, L.; Fu, T.; Huang, X.; Zhao, E.; Zhang, Y.; Chen, Y.; Wang, L.; Luu, A. T.; Bi, W.; Shi, F.; and Shi, S. 2023. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. CoRR, abs/2309.01219.
[8]
Xu, Z.; Jain, S.; and Kankanhalli, M. S. 2024. Hallucination is Inevitable: An Innate Limitation of Large Language Models. CoRR, abs/2401.11817.
[9]
Kumarage, T.; Agrawal, G.; Sheth, P.; Moraffah, R.; Chadha, A.; Garland, J.; and Liu, H. 2024. A survey of ai-generated text forensic systems: Detection, attribution, and characterization. arXiv preprint arXiv:2403.01152.
[10]
Lewis, P. S. H.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; Riedel, S.; and Kiela, D. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
[11]
Tan, Z.; Zhao, C.; Moraffah, R.; Li, Y.; Kong, Y.; Chen, T.; and Liu, H. 2024. The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative. CoRR, abs/2402.14859.
[12]
Tan, Z.; Zhao, C.; Moraffah, R.; Li, Y.; Wang, S.; Li, J.; Chen, T.; and Liu, H. 2024. "Glue pizza and eat rocks" - Exploiting Vulnerabilities in Retrieval-Augmented Generative Models. CoRR, abs/2406.19417.
[13]
Christiano, P. F.; Leike, J.; Brown, T. B.; Martic, M.; Legg, S.; and Amodei, D. 2017. Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 4299–4307.
[14]
Abu-Salih, B. 2021. Domain-specific knowledge graphs: A survey. Journal of Network and Computer Applications, 185: 103076.
[15]
Kejriwal, M. 2019. Domain-specific knowledge graph construction. Springer.
[16]
George, A. S. 2023. The Potential of Generative AI to Reform Graduate Education. Partners Universal International Research Journal, 2(4): 36–50.
[17]
Moore, S.; Tong, R.; Singh, A.; Liu, Z.; Hu, X.; Lu, Y.; Liang, J.; Cao, C.; Khosravi, H.; Denny, P.; et al. 2023. Empowering education with llms-the next-gen interface and content generation. In International Conference on Artificial Intelligence in Education, 32–37. Springer.
[18]
Upadhyay, A.; Farahmand, E.; Muñoz, I.; Akber Khan, M.; and Witte, N. 2024. Influence of LLMs on Learning and Teaching in Higher Education. Available at SSRN 4716855.
[19]
Yan, L.; Sha, L.; Zhao, L.; Li, Y.; Martinez-Maldonado, R.; Chen, G.; Li, X.; Jin, Y.; and Gašević, D. 2024. Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology, 55(1): 90–112.
[20]
Li, Q.; Fu, L.; Zhang, W.; Chen, X.; Yu, J.; Xia, W.; Zhang, W.; Tang, R.; and Yu, Y. 2023. Adapting large language models for education: Foundational capabilities, potentials, and challenges. arXiv preprint arXiv:2401.08664.
[21]
Dakshit, S. 2024. Faculty Perspectives on the Potential of RAG in Computer Science Higher Education. arXiv preprint arXiv:2408.01462.
[22]
Liu, C.; Hoang, L.; Stolman, A.; and Wu, B. 2024. HiTA: A RAG-Based Educational Platform that Centers Educators in the Instructional Loop. In International Conference on Artificial Intelligence in Education, 405–412. Springer.
[23]
Modran, H.; Bogdan, I. C.; Ursuțiu, D.; Samoila, C.; and Modran, P. L. 2024. LLM Intelligent Agent Tutoring in Higher Education Courses using a RAG Approach.
[24]
Elmessiry, A.; and Elmessiry, M. 2024. NAVIGATING THE EVOLUTION OF ARTIFICIAL INTELLIGENCE: TOWARDS EDUCATION-SPECIFIC RETRIEVAL AUGMENTED GENERATIVE AI (ES-RAG-AI). In INTED2024 Proceedings, 7692–7697. IATED.
[25]
Jeong, S.; Baek, J.; Cho, S.; Hwang, S. J.; and Park, J. C. 2024. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. arXiv preprint arXiv:2403.14403.
[26]
Li, X.; Liu, M.; and Gao, S. 2024. GRAMMAR: Grounded and Modular Evaluation of Domain-Specific Retrieval-Augmented Language Models. arXiv preprint arXiv:2404.19232.
[27]
Hussien, M. M.; Melo, A. N.; Ballardini, A. L.; Maldonado, C. S.; Izquierdo, R.; and Sotelo, M. Á. 2024. RAG-based Explainable Prediction of Road Users Behaviors for Automated Driving using Knowledge Graphs and Large Language Models. arXiv preprint arXiv:2405.00449.
[28]
Agrawal, G.; Kumarage, T.; Alghamdi, Z.; and Liu, H. 2024. Mindful-RAG: A Study of Points of Failure in Retrieval Augmented Generation. arXiv preprint arXiv:2407.12216.
[29]
De Santis, A.; Balduini, M.; De Santis, F.; Proia, A.; Leo, A.; Brambilla, M.; and Della Valle, E. 2024. Integrating Large Language Models and Knowledge Graphs for Extraction and Validation of Textual Test Data. arXiv preprint arXiv:2408.01700.
[30]
AlDaajeh, S.; Saleous, H.; Alrabaee, S.; Barka, E.; Breitinger, F.; and Choo, K.-K. R. 2022. The role of national cybersecurity strategies on the improvement of cybersecurity education. Computers & Security, 119: 102754.
[31]
Newhouse, W.; Keith, S.; Scribner, B.; and Witte, G. 2017. National initiative for cybersecurity education (NICE) cybersecurity workforce framework. NIST special publication, 800(2017): 181.
[32]
Rahman, N. A. A.; Sairi, I. H.; Zizi, N. A. M.; and Khalid, F. 2020. The importance of cybersecurity education in school. International Journal of Information and Education Technology, 10(5): 378–382.
[33]
Cheung, R. S.; Cohen, J. P.; Lo, H. Z.; and Elia, F. 2011. Challenge based learning in cybersecurity education. In Proceedings of the International Conference on Security and Management (SAM), volume 1. The Steering Committee of The World Congress in Computer Science, Computer….
[34]
Schneider, F. B. 2013. Cybersecurity education in universities. IEEE Security & Privacy, 11(4): 3–4.
[35]
Ai, L.; Kumarage, T.; Bhattacharjee, A.; Liu, Z.; Hui, Z.; Davinroy, M.; Cook, J.; Cassani, L.; Trapeznikov, K.; Kirchner, M.; et al. 2024. Defending Against Social Engineering Attacks in the Age of LLMs. arXiv preprint arXiv:2406.12263.
[36]
Laato, S.; Farooq, A.; Tenhunen, H.; Pitkamaki, T.; Hakkala, A.; and Airola, A. 2020. Ai in cybersecurity education-a systematic literature review of studies on cybersecurity moocs. In 2020 IEEE 20th International Conference on Advanced Learning Technologies (ICALT), 6–10. IEEE.
[37]
Grover, S.; Broll, B.; and Babb, D. 2023. Cybersecurity education in the age of ai: Integrating ai learning into cybersecurity high school curricula. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, 980–986.
[38]
Wei-Kocsis, J.; Sabounchi, M.; Mendis, G. J.; Fernando, P.; Yang, B.; and Zhang, T. 2023. Cybersecurity Education in the Age of Artificial Intelligence: A Novel Proactive and Collaborative Learning Paradigm. IEEE Transactions on Education.
[39]
Ferrari, E. P.; Wong, A.; and Khmelevsky, Y. 2024. Cybersecurity Education within a Computing Science Program-A Literature Review. In Proceedings of the 26th Western Canadian Conference on Computing Education, 1–5.
[40]
Pal, K. K.; Kashihara, K.; Banerjee, P.; Mishra, S.; Wang, R.; and Baral, C. 2021. Constructing flow graphs from procedural cybersecurity texts. arXiv preprint arXiv:2105.14357.
[41]
Deng, Y.; Lu, D.; Huang, D.; Chung, C.-J.; and Lin, F. 2019. Knowledge graph based learning guidance for cybersecurity hands-on labs. In Proceedings of the ACM conference on global computing education, 194–200.
[42]
Deng, Y.; Zeng, Z.; Jha, K.; and Huang, D. 2021. Problem-based cybersecurity lab with knowledge graph as guidance. Journal of Artificial Intelligence and Technology.
[43]
Deng, Y.; Zeng, Z.; and Huang, D. 2021. Neocyberkg: Enhancing cybersecurity laboratories with a machine learning-enabled knowledge graph. In Proceedings of the 26th ACM Conference on Innovation and Technology in Computer Science Education V. 1, 310–316.
[44]
Agrawal, G.; Deng, Y.; Park, J.; Liu, H.; and Chen, Y.-C. 2022. Building knowledge graphs from unstructured texts: Applications and impact analyses in cybersecurity education. Information, 13(11): 526.
[45]
Agrawal, G.; Pal, K.; Deng, Y.; Liu, H.; and Baral, C. 2023. AISecKG: Knowledge Graph Dataset for Cybersecurity Education. In Proceedings of the AAAI 2023 Spring Symposium on Challenges Requiring the Combination of Machine Learning and Knowledge Engineering (AAAI-MAKE 2023), Hyatt Regency, San Francisco Airport, California, USA, March 27-29, 2023, volume 3433 of CEUR Workshop Proceedings. CEUR-WS.org.
[46]
Agrawal, G.; Pal, K.; Deng, Y.; Liu, H.; and Chen, Y.-C. 2024. CyberQ: Generating Questions and Answers for Cybersecurity Education Using Knowledge Graph-Augmented LLMs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 23164–23172.
[47]
Agrawal, G. 2023. Aiseckg: Knowledge graph dataset for cybersecurity education. AAAI-MAKE 2023: Challenges Requiring the Combination of Machine Learning 2023.
[48]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi, Y. 2020. BERTScore: Evaluating Text Generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
[49]
Banerjee, S.; and Lavie, A. 2005. An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, 65–72. Association for Computational Linguistics.
[50]
Lin, C.-Y. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, 74–81.
[51]
Izacard, G.; Caron, M.; Hosseini, L.; Riedel, S.; Bojanowski, P.; Joulin, A.; and Grave, E. 2022. Unsupervised Dense Information Retrieval with Contrastive Learning. Trans. Mach. Learn. Res., 2022.
[52]
Dubey, A.; Jauhri, A.; Pandey, A.; and et al. 2024. The Llama 3 Herd of Models. CoRR, abs/2407.21783.
[53]
ES, S.; James, J.; Anke, L. E.; and Schockaert, S. 2024. RAGAs: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - System Demonstrations, St. Julians, Malta, March 17-22, 2024, 150–158. Association for Computational Linguistics.