Jihui Nie11ββDehui Du(π)11ββJiangnan Zhao11
Abstract
Intelligent Cyber-Physical Systems (ICPS) represent a specialized form of Cyber-Physical System (CPS) that incorporates intelligent components, notably Convolutional Neural Networks (CNNs) and Deep Reinforcement Learning (DRL), to undertake multifaceted tasks encompassing perception, decision-making, and control. The utilization of DRL for decision-making facilitates dynamic interaction with the environment, generating control actions aimed at maximizing cumulative rewards. Nevertheless, the inherent uncertainty of the operational environment and the intricate nature of ICPS necessitate exploration within complex and dynamic state spaces during the learning phase. DRL confronts challenges in terms of efficiency, generalization capabilities, and data scarcity during decision-making process. In response to these challenges, we propose an innovative abstract modeling approach grounded in spatial-temporal value semantics, capturing the evolution in the distribution of semantic value across time and space. A semantics-based abstraction is introduced to construct an abstract Markov Decision Process (MDP) for the DRL learning process. Furthermore, optimization techniques for abstraction are delineated, aiming to refine the abstract model and mitigate semantic gaps between abstract and concrete states. The efficacy of the abstract modeling is assessed through the evaluation and analysis of the abstract MDP model using PRISM. A series of experiments are conducted, involving diverse scenarios such as lane-keeping, adaptive cruise control, and intersection crossroad assistance, to demonstrate the effectiveness of our abstracting approach.
Keywords:
Markov Decision Process Spatio-temporal Value Semantics Deep Reinforcement Learning Abstract Modeling PRISM.
1 Introduction
The Cyber-Physical System (CPS) integrates computing, networking, and physical environments, expertly coordinated by computer and communication components with a joint monitoring mechanism[42]. The evolution of the Intelligent Cyber-Physical System (ICPS) as a mainstream paradigm is marked by the integration of AI-enabled components such as controllers or sensors[35]. Notably, the utilization of Deep Reinforcement Learning (DRL) in decision-making, as emphasized by Brunke et al.[8], is particularly promising due to its innate ability for dynamic interaction within the environment.
However, the efficacy of DRL encounters a formidable challenge in the form of an expansive state space, leading to prolonged algorithmic convergence and heightened complexities in the formal verification of the learning process. Furthermore, the inherent black-box nature of DRL poses challenges when applied to safety-critical ICPS scenarios, such as Autonomous Driving Systems (ADSs) and robots. To surmount the challenge posed by the vast state space in ICPS tasks, strategic compression becomes imperative. Dense DRL is apromising field to address these issues[14].An effective abstraction modeling in this context involves leveraging prior knowledge for generalization from concrete to abstract states[25].
The existing abstraction modeling methods can be roughly divided into three categories. In the first category, the focus lies on abstracting similar states, thereby addressing the challenge of sparse rewards in DRL[13, 29, 26]. Strategic games, such as Go[37], utilize hierarchical organization based on the significance of empty points on the boardβs corners and edges. This approach, inspired by consistent patterns in Go, allows the amalgamation of game elements into a unified abstract state, offering a potential solution for the slow convergence issue. Real-time strategy games[28] similarly represent states through collections of game elements and positions. The second category involves temporal abstraction, extending decision-making over time[21, 5, 17]. This concept extends to natural language processing, where units of state are often identified as words or phrases through corpus analysis[7, 33]. This offers a promising avenue for effectively compressing the state space, thereby facilitating improved algorithmic convergence in DRL. In the realm of ICPS, the third category focuses on state-action abstraction, a method extensively applied in addressing the intricate challenges unique to these systems. Notably, empirical evidence establishes a polynomial relationship between the number of ICPS samples and the state spaceβs size[1], emphasizing the nuanced nature of the challenge. For instance, the operation of ADS unfolds within open and dynamic environments characterized by inherent complexity and unpredictability. This complexity manifests in two key aspects. Firstly, the expansive state space explored by DRL incurs substantial exploration costs due to its high dimensions, contributing to discernible scalability limitations, as extensively discussed in existing literature[41]. Secondly, the state-action abstraction approach, while effective in enhancing abstract efficiency and addressing gaps in the abstract model, tends to overlook the consistency of the constructed model in the formal method, leading to potential distortions in the modelβs representation.
In the delineated classification, it becomes evident that traditional approaches to abstraction typically concentrate on minimizing model size while concurrently preserving model accuracy. However, when confronted with the complexities of ICPS, characterized by abundant randomness and uncertainty in the state space, these models frequently fall short of fulfilling the requisites for formal verification. Establishing a verifiable abstract model for ICPS is an imperative undertaking, underscoring the necessity to meticulously tailor the granularity of abstraction to the specific requirements and intricacies of the application context. Striking a delicate balance between mitigating state explosion and sustaining optimal model performance is crucial, as expounded in the work by Schmidt et al.[36].
The main contributions in our work include:
- 1.
We propose the spatio-temporal value metric to measure the similarity between states. It helps to make semantic-preserving abstraction effectively.
- 2.
We propose a novel abstraction modeling approach to construct the abstract MDP model based on spatio-temporal value metric. Moreover, we optimize the abstract model to make the abstract state space closer to the original state space.
- 3.
We use compression ratio and mean absolute error to measure the accuracy of the model. We innovatively use PRISM to check for semantic gaps between abstract and real models. We use abstract MDP to build a dense DRL framework to generate joint policies in online environments.
Paper organization. The remainder of the paper is organized as follows. In Section2, we provide a review of pertinent background and preliminary information. In Section3, we present the overarching framework of our approach. In Section4, we conduct an assessment of our approach through several case studies related to ADS. Finally, in Section5, we review related work, and our conclusions are summarized in Section6.
2 Preliminaries
2.1 Deep Reinforcement Learning
DRL is an amalgamation of reinforcement learning (RL) and deep learning, training agents to achieve goals in simulated or real environments through a sequence of decision-making actions. By learning through continuous trial and error, DRL enables agents to explore environments and discover optimal strategies. At each time step, a DRL agent receives state information from the environment, which is used as input to a deep neural network (DNN). Subsequently, the agent selects an action from a set of possible actions and receives real-time rewards, aiming to maximize cumulative rewards. The process of DRL can be formalized using MDPs.
2.2 Markov Decision Process
Definition 1 (Markov Decision Process)
A Markov Decision Process is denoted by a tuple , where represents a finite non-empty state set, denotes the initial system state, stands for a finite non-empty action set, is the transition probability function such that for and , . represents the reward assigned to the current state-action pair and is the discount factor. The discount factor determines the importance of immediate rewards relative to future rewards. A policy for the MDP denoted as , maps states to actions.
The MDP describes the evolution of the initial system state over discrete time steps. In DRL, the interaction between the policy and the environment, resulting in state transitions and immediate rewards forms the foundation of the learning process. The evaluation and optimization of the policy involve state value function and action value function.
Definition 2 (State Value Function )
The state value function represents the expected return achievable under a policy from state . The Bellman expectation equation for in recursive form is:
(1) |
Definition 3 (Action Value Function )
The action value function represents the expected return achievable from taking action in state and following policy . The Bellman expectation equation for in recursive form is:
(2) |
With the rapid development in the field of DRL, many online, model-free learning algorithms have been proposed to fulfill various requirements[8], such as Deep Deterministic Policy Gradient (DDPG), Twin Delayed Deep Deterministic Policy Gradient (TD3), Actor-Critic (A2C), Proximal Policy Optimization (PPO), and more. The use of DRL controllers in place of traditional controllers holds great promise in large-scale systems with complex dynamics.
2.3 Abstract Markov Decision Process
Definition 4 (Abstract Markov Decision Process)
Let denotes the true MDP, and denotes the abstract MDP. is the state abstraction function, where represents the abstract state, and its inverse image is denoted as , where . is the basic state set corresponding to the abstraction function . is the action abstraction function, where represents the abstract action, and its inverse image is denoted as , where . is the basic action set corresponding to the abstraction function .
State transition and reward functions are defined as follows:
(3) |
(4) |
where , represents the weigh function for state, and , represents the weight function for action. represents the immediate reward of transitioning from abstract state to after taking the abstract action . represents the probability of transitioning from abstract state to after taking the abstract action . The abstract policy is generated based on the abstract MDP.
2.4 PRISM
PRISM serves as an open-source probabilistic model checker for the formal modeling and analyzing of probabilistic systems[22, 32]. Widely applied in diverse application domains, PRISM has been instrumental in analyzing systems ranging from communication and multimedia protocols to randomized distributed algorithms, security protocols, biological systems, and beyond. PRISM is proficient in constructing and scrutinizing a variety of probabilistic models, encompassing: Discrete-time and continuous-time Markov chains (DTMCs and CTMCs), MDPs and probabilistic automata (PA). Additionally, PRISM supports extensions of these models that incorporate cost and reward considerations. It facilitates automated analysis of a broad spectrum of quantitative properties inherent in these models. For instance, users can inquire about the probability of a system shutdown within 4 hours due to a failure, the worst-case probability of a protocol terminating in error across all potential initial configurations, the anticipated size of a message queue after 30 minutes, or the worst-case expected time for an algorithm to conclude. The property specification language integrates temporal logic such as PCTL, CSL, LTL, and PCTL*, alongside extensions for quantitative specifications, costs, and rewards.
3 Spatio-temporal Value Semantics based Abstraction for DRL
For ensuring the safety and reliability of ICPS, the design and optimization of controllers play a pivotal role. However, the hybrid behavior of the system and the uncertainties in the environment contribute to a vast state space for ICPS, rendering the design and optimization of controllers using DRL a complex task.To address this complexity, we propose an abstraction modeling approach based on spatio-temporal value semantics, aiming to efficiently abstract the state space and actions of ICPS, thereby facilitating the optimization of controller design through DRL. The key aspect of state abstraction revolves around ensuring semantic consistency, which involves measuring the similarity between different states and determining their belongingness to the same abstract state. To achieve this, we introduce a novel measurement approach termed spatio-temporal value semantics. Fig.2 illustrates the process of abstracting DRL based on spatio-temporal value semantics.
3.1 Action Abstraction
The controller in ICPS usually selects a specific value in a continuous range as the specific value of the action. Taking ADS as an example, the acceleration range is usually in , and the accelerator and brake of the vehicle make the vehicle achieve the expected acceleration. However, the triggering action of MDP is usually a concrete value in this range. It is obviously unrealistic to realize the construction of an MDP for this infinite number of concrete actions in ICPS.
To address this issue, we propose a technique for abstracting continuous action spaces. The fundamental idea is to finely segment the continuous action space so that the abstract action is analogous to the action on the true MDP transition, with analogous successor states and rewards gained.Inspired by[18], we introduce the interval action box as follows:
Definition 5 (Interval Box)
For a continuous action space of dimension , each variable in each dimension has its own effective range, i.e., the variable in the dimension () is in the range . The interval box method divides this range uniformly into unit intervals , which is a partition of the continuous action space . For an action , based on the interval box abstraction, the action of the abstract action space is obtained, where , is the abstraction granularity of the dimension.
According to Definition5, the abstraction granularity must be adjusted in the abstract MDP to maintain the successor state and reward gained by performing the abstract action close to the successor state and the reward obtained by executing the actual action in the true MDP. Each dimensionβs level of granularity needs to be specified by the specific environment.Different cases call for varying levels of granularity since fine granularity is closer to the original action, while too small granularity would amplify data jitters inaccuracy.
3.2 Semantics-based StateAbstraction
Existing state abstraction approaches struggle to handle the high dimensionality and continuous state space of ICPS. While model-irrelevance abstraction is an ideal solution in theory, it is challenging to implement in practice. To address these issues, we propose a semantic-based abstraction approach to introduce value information, spatio-temporal information, and probability information into the abstraction process, which enables the evaluation of the similarity between states. We first introduce the semantic-based abstraction model.
Definition 6 (Semantic-based Abstraction MDP)
An abstraction model is denoted by a tuple , where is the set of abstract states, is the set of abstract actions, represents the transition distribution, is the initial state set, and represents the mapping to the semantic space.
It is worth noting that the core of semantic abstraction lies in measuring the similarity of states in abstract MDP. The similar states must satisfy two conditions: (1) the available action sets of similar states should be similar, and (2) the multi-step state transition models and multi-step rewards of similar states should be similar.
3.2.1 Semantic Interval Abstraction
To satisfy the specified conditions, we implement Semantic Interval Abstraction. Taking Adaptive Cruise Control (ACC) as an example, where the goal is to maintain a safe distance from the lead vehicle. The concrete state of the ego vehicle, represented as a multidimensional vector , includes parameters like vehicle speed , acceleration , spatial coordinates (, ), and relative distance.
Based on existing causal discovery algorithms, we employ PC (Peter-Clark) algorithm[39] and FCI (Fast Causal Inference) algorithm[40] to construct causal graphs on the autonomous driving dataset. The union of these two causal graphs is considered as the final causal graph. By examining the causal relationships within the graph, we identify the relationships among different dimensions of the state. Through abstraction based on these causal relationships, we determine the semantic values.
Due to the continuous nature of each dimension, the combination of specific dimensions results in a vast state space. In the context of ICPS, information from the state vector exhibits approximate similarity within short time intervals. Thus, the concrete state undergoes semantic abstraction, yielding a condensed representation , where , , and represent relative velocity, relative angle, and relative distance, respectively. In summary, semantic interval abstraction maps multiple dimensions of a concrete state into several semantic values, achieving abstraction of state dimensions.
Since the dimensions of the semantic value may have different scales, the data needs to be normalized to a uniform scale. Therefore, we divide the -dimensional space into segments, with each dimension having intervals i.e.. In this context, represents the -th interval on the -th dimension, and are the lower and upper bounds of this interval. Consequently, the spatial partitioning problem is transformed into an optimization problem, specifically formulated as follows:
(5) |
where and are the minimum and maximum lengths of intervals on the -th semantic dimension. represents the set of concrete states with semantic values falling within the interval . is the minimum number of concrete states in the -th dimension interval, and and are predefined expected error and maximum error for abstract on the -th dimension. Eq.5 ensures that each interval contains enough concrete states while maintaining low abstract errors.
Alg.1 has been devised to orchestrate the transformation of a given concrete state set into an intervalized abstract space denoted as . This transformation takes into account semantic values and adheres to specific constraints. The essential parameters guiding this process encompass the semantic value set , the maximum and minimum interval lengths and , the minimum number of concrete states in intervals , the expected error threshold , and the limit for reduction level .
The term REDUCTION_LEVEL[38] serves as an indicator of the state compression ratio. The optimization iterations are designed to terminate when the abstract effect meets the predefined requirements of . A reasonable range is to make the number of compressed states 10% to 30% of the original number of states.
3.2.2 Semantic-based (,d)-Abstraction
While the semantic interval abstraction can ensure model accuracy, abstraction simplicity is related to the granularity of abstraction. To address this, we propose based on spatio-temporal value semantics.
Definition 7 (Spatio-temporal Value Semantics)
For a concrete state and spatio-temporal value semantics , . Here, represents the semantic value, and the function maps states to their corresponding semantic values. The multidimensional mapping captures various aspects of the state , including the state value function , action value function , reward function , and transition function , among others. The semantic mapping translates state and environmental information into the semantic space, thus depicting the spatio-temporal value of the state.
Spatio-temporal value semantics encapsulate state evolution information across time and space, reflecting the stateβs distribution and evolution. It considers factors like state value function and transition function, providing insights into both current and future states. Using spatio-temporal value semantics, we assess semantic equivalence between abstract states. When the semantic distance is below a specified threshold, states are considered equivalent, simplifying model abstraction with enhanced precision.
Definition 8 ((,d)-Abstraction)
(,d)-Abstraction is denoted by a mapping that satisfies the following condition:
(6) |
represents the abstract mapping function that maps the original state space into an abstract state space . The mapping function transforms a true MDP into an abstract MDP. Let denote the power set of , and represents the inverse mapping function. The core of state abstraction is measuring the similarity of state abstractions and performing nearest-neighbor abstractions based on state similarity. represents the state distance metric, and is the abstraction threshold.
Grounded in the state value functions (Eq.2) and action value functions (Eq.3) in MDP, when two states possess akin transition models and rewards, their expected cumulative rewards will be similar. This observation provides a simplification approach for Semantic-based (,d)-Abstraction: the reward function and transition probabilities can compose the Spatio-temporal Value Metric of that state. Thus, in the process of abstraction based on value semantics, we aim to maintain the optimal value function of the abstract MDP as close as possible to the true MDP, ensuring semantic equivalence.
Definition 9 (Spatio-temporal Value Metric)
For ,
(7) | ||||
where and are positive constants, represents a measure function of the similarity between two probability distributions. and are weights for measuring rewards and transition probability density functions, respectively. and are sufficiently large positive constants, and is considered equivalent to if . The spatio-temporal value metric satisfies mutual simulatability and uniqueness, i.e., implies .
The spatio-temporal value metric is employed in the Semantic-based (,d)-Abstraction algorithm, as delineated in Alg.2. The procedure outlined in Alg.2 takes the intervalized abstract space as its input. Key steps in the algorithmic execution involve the determination of the number of clusters based on the optimal state number determination function (Line 3), initializing cluster centroids randomly (Line 4), and iteratively assigning data points to the nearest centroids while updating centroids until convergence (Lines 5-14). The ultimate outcome comprises the Semantic-based (,d)-Abstraction space and the Abstraction Model , where the centroids encapsulate the final representation of the abstract space (Line 15).
The time complexity of Alg.2 is determined by its iterative clustering, which involves data point assignments to centroids and centroid updates until convergence. The assignment step has a complexity of , where is the number of data points, the number of clusters, and the dimensionality. Centroid updates have a complexity of . The algorithmβs effectiveness is influenced by initial centroid positions and the fixed number of clusters, requiring adjustments based on dataset characteristics for optimal semantic-based (,d)-abstraction.
3.3 Construction of Abstract MDP
In this subsection, we introduce a comprehensive methodology for constructing an MDP, essential for formulating decision-making frameworks in stochastic environments. Fig.2 illustrates the procedural details of the approach. Initiated by gathering detailed trajectory data comprising observed states and actions over time, this approach tackles the inherent complexity and high dimensionality of the raw data. Central to our approach is the systematic abstraction of this data, necessary for effective MDP modeling. The methodology encompasses two primary abstraction processes: action abstraction through interval box and state abstraction through spatio-temporal value semantics. Action abstraction involves discretizing the continuous and diverse real-world actions into distinct intervals, each representing a group of similar actions. This process simplifies the action space, enhancing tractability and computational feasibility. Simultaneously, state abstraction condenses the state space by Alg.1 and Alg.2, thereby capturing their essential characteristics. These abstractions are pivotal in reducing complexity, allowing for more efficient computation and analysis.
The next critical step in our methodology is the statistical computation of transition probabilities, derived from the frequency of state transitions in the trajectory data. These probabilities reflect the likelihood of moving from one state to another given a specific action, mirroring the dynamics of real-world scenarios. When calculating transition probabilities, we employ Hoeffdingβs inequality to reduce errors, enhancing the accuracy of our probability estimates. After abstracting states and actions and incorporating these probabilities, we formulate the initial MDP model. This model may include a reward system based on state transitions. However, recognizing that the initial model may not fully align with real-world contexts, we engage in iterative refinement. This involves adjusting abstractions, recalculating probabilities, and redefining states and actions to enhance the modelβs empirical alignment with observed data.
4 Implementation and Evaluation
4.1 Case Study
We conduct experiments in three representative ADS scenarios, encompassing diverse driving environments with varying control specifications.
Lane Keeping Assist (LKA) is an advanced driving assistance module[24] crucial for automated driving. LKA evaluates the lateral offset and relative yaw angle to adjust the front wheel steering angle . Its objective is to minimize lateral deviation and yaw angle, aligning them close to zero, and ensuring the vehicle stays within the lane.
Adaptive Cruise Control (ACC) is an intelligent module[24] adjusting the carβs speed based on the distance from the preceding vehicle. It manages the vehicleβs acceleration to maintain a safe relative distance greater than . ACC targets the user-set cruise speed , adapting to the preceding vehicleβs movement controlled by . The safe distance dynamically adjusts according to the relative velocity.
Intersection Crossroad Assistance (ICA) enhances safety in complex intersections[24], integrating LKA and ACC features. ICA determines optimal speed and direction, demonstrating randomness for adaptive control and versatility for flexible adaptation. It navigates intersecting roads, one intelligent vehicle, and multiple environmental vehicles, aiming to traverse the intersection successfully with left, straight, or right turns while avoiding deviations or collisions.
4.2 Research Questions
To assess the effectiveness of the abstract model based on spatio-temporal value semantics, we investigate the following research questions:
Research Question 1 (RQ1): How does the performance of the abstract model, grounded in spatio-temporal value semantics, fare in terms of both simplicity and accuracy?
Research Question 2 (RQ2): Does the abstract MDP model exhibit decision-making performance that approximates that of the true model? Moreover, is there a semantic equivalence between the abstract and true models?
Research Question 3 (RQ3): Can abstract models effectively guide the learning process in DRL, specifically addressing issues of low data utilization and poor generalization, consequently leading to accelerated training
4.3 Experiment Setup
4.3.1 Metrics for Comparison
Euclidean Metricis used to measure the straight-line distance between two states in Euclidean space.For states and in , the Euclidean metric is defined as:
(8) |
Multi-Step Metric:For any , the multi-step metric is defined as:
(9) | ||||
where , and are positive constants. is the set of available actions for each state .The function if , and 1 otherwise. is a sufficiently large constant such that implies that equals [15].
4.3.2 DRL Setup
During the data collection phase, we employ curiosity-driven TD3 with Random Network Distillation (RND)[9] to derive control strategies for LKA and ACC. Additionally, curiosity-driven DQN is utilized to generate control strategies for ICA, exploring the case environment and gathering system trajectories.
In each scenario, we simulate the curiosity-driven RL controller 1000 times to accumulate experience. This experience is then partitioned into a modeling set and a validation set in an ratio. The former is utilized for constructing the abstract MDP, while the latter is employed to scrutinize the semantic gaps between the abstract MDP and the concrete MDP.The hyperparameter configurations for the deep learning networks in the three cases are detailed in Table1.
Case Study | Algorithm | Activate function | Size | Learning rate | Soft tau | |||
---|---|---|---|---|---|---|---|---|
Critic | Actor | |||||||
LKA | TD3 | ReLU | 1.00E-03 | 1.00E-03 | 0.95 | / | 1.00E-02 | |
ACC | TD3 | ReLU | 1.00E-04 | 1.00E-04 | 0.95 | / | 1.00E-02 | |
ICA | DQN | ReLU | 2.00E-03 | 0.98 | 0.01 | / |
The rewards for LKA, ACC, and ICA are specified as follows:
LKA Reward: | |||
ACC Reward: | |||
ICA Reward: |
- β’
is the reward at time step .
- β’
represents the lateral offset of the vehicleβs current position from the lane center. A smaller leads to a higher reward.
- β’
represents the angle between the current direction of the vehicle and the lane center direction. A smaller leads to a higher reward.
- β’
represents the velocity of the vehicle at time step .
- β’
represents the distance from the target at time .
4.3.3 Abstraction Setup
The hyper-parameters for the abstraction process are shown in Table2. We define the average semantic error, denoted as , to be 0.005, and the maximum semantic error, denoted as , to be 0.01. These values represent 0.25% of the overall value range. Simultaneously, we set the REDUCTION_LEVEL () to 0.5%. The parameter is determined through the elbow method, average silhouette method, and gap statistic method[20]. We conduct a comparative analysis of abstractions using the traditional Euclidean distance method, the Multi-Step distance abstraction based on the state-of-the-art method proposed by Guo et al.[15], and the abstraction employing the spatio-temporal value metric.
Case study | Semantics | |||
---|---|---|---|---|
LKA | 0.001 | 0.005 | 1% | |
0.001 | 0.005 | 0.1% | ||
ACC | 0.010 | 0.050 | 0.5% | |
ICA | 0.010 | 0.050 | 0.5% | |
0.001 | 0.005 | 1% |
4.3.4 Two Indices
To address RQ1, we evaluated the abstraction models with a focus on simplicity and accuracy using two indices: Compression Ratio (CR) and Mean Absolute Error (MAE). The formulas for CR and MAE are as follows:
(10) |
(11) |
where represents the number of abstract states, represents the number of original concrete states, is the prediction output of the abstract model, is the output of the true model, and measures the average deviation from the reference value. denotes the number of abstract states generated in a single experiment. CR assesses the simplicity of the abstraction model, indicating the quality of the abstraction effect, while MAE reveals the accuracy of the abstraction model in preserving the original semantic information.
4.3.5 Abstract MDP-guided Training for DRL
To investigate the potential guiding impact of an abstract MDP on DRL, we systematically modify the environmental variables within three ICPS. We aim to foster a more secure DRL learning process under the guidance of the abstract MDP. We introduce the influence of the abstract MDP on action selection within the DRL framework, with the overarching objective of expediting the convergence of DRL and establishing a joint strategy that is both safe and generalized.
The influence of the abstract MDP on action selection is incorporated into the DRL by leveraging the abstract MDPβs action output as a variable influencing the output of the neural network (NN). This integration is expressed mathematically as:
(12) |
where a represents the joint action output, denotes the action output from the NN, and signifies the action output from the abstract MDP. The coefficients and are set to 0.5 based on existing empirical values, reflecting the joint contributions of the NN and the abstract MDP in the synthesized action.
4.4 Experimental Results and Analysis
RQ1: The Simplicity and Accuracy of Abstract MDP Our investigation into the performance of abstract models, guided by spatio-temporal value semantics, is revealed through a combination of visual and quantitative data analyses. Fig.3 offers a striking visual narrative of this examination. Fig.3 shows we transform detailed raw data into a coherent, abstract grid. Within this grid, distinct hues represent individual abstract states, effectively simplifying the complex state space without compromising its comprehensive nature. This approach exemplifies how abstract modeling can achieve a harmonious blend of simplicity in design with precision in data representation.
The spatio-temporal value metricβs efficacy in capturing the essence of the state space without overcomplication is quantitatively reinforced by Tab.3. Here, the CR and MAE serve as the principal indices for evaluation. The CR index, reflecting the proportionate reduction in state-space size, alongside the MAE index, indicating the average error magnitude, collectively substantiates the spatio-temporal value metricβs superior performance across various ICPS. Notably, in the ACC scenario, the spatio-temporal approach achieves CRs of 10.12% and 13.02%, underscoring a substantial simplification of the model. Concurrently, the approachβs consistently lower MAE values across all scenarios, compared to Euclidean and Multi-Step approaches, highlight its enhanced accuracy.
Response to RQ1: Spatio-temporal value metric simplifies the state space without sacrificing semantic accuracy, ensuring the abstract modelβs effectiveness in algorithmic analysis and decision-making. The modelβs simplicity is achieved without a concomitant increase in error, demonstrating an optimal balance between the two desired attributes of computational efficiency and semantic precision. These findings articulate the spatio-temporal value metricβs pivotal role in generating abstract models that are not only operationally feasible but also semantically representative of the real-world systems they aim to emulate.
RQ2: Semantic Equivalence in Abstract MDPTo address RQ2, we leveraged the PRISM model checker for encoding the abstraction model, ensuring the retention of critical state information such as rewards, transition probabilities, and metrics pertaining to lane adherence and collision incidents. We crafted specific properties that resonate with rewards and safety considerations, thereby facilitating a comprehensive evaluation of the extent to which the abstract model parallels the true MDP in semantic terms. This approach enabled a thorough exploration of the decision-making implications inherent in the abstract model, shedding light on its strengths and constraints.
In Tab4, we presented the encoded abstraction models for diverse scenarios using PRISM, emphasizing properties that encapsulate semantic information. Through PRISM, the semantic equivalence between the abstract and the true models was quantitatively assessed. For instance, in scenarios such as a four-way intersection, various properties like (minimum expected cumulative reward within 60 time steps), (maximum probability of lane departure within 60 steps), and (maximum probability of collision involvement within 60 time steps) were analyzed to measure decision-making effects.
Our analysis in the sphere of MDP modeling, particularly through semantic gap analysis via PRISM, highlights the pronounced superiority of the spatio-temporal value approach over the Euclidean and Multi-Step approaches. This conclusion is drawn from a systematic juxtaposition across varied scenarios like LKA, ACC, and ICA. The spatio-temporal value approach consistently exhibited lower discrepancies in predicting minimum expected cumulative rewards and maximum probabilities of specific events, thereby indicating a higher fidelity in mimicking real-world dynamics. This aspect is especially pronounced in intricate scenarios like ACC and ICA, where the approachβs accuracy in capturing semantic nuances suggests its enhanced capability in semantic property representation. The proficiency of the spatio-temporal value approach in bridging semantic gaps accentuates its robustness and reliability as a tool in stochastic decision-making frameworks, where fidelity to real-world conditions is paramount.
Response to RQ2: The findings corroborate that abstraction based on spatio-temporal values not only ensures semantic alignment with the true model but also offers substantial insights for refining training strategies. This aligns with the overarching goal of achieving a harmonious balance between model abstraction and real-world decision-making accuracy.
Case Study Number of States Metric Abstract States Abstract States CR MAE CR CR MAE MAE Euclidean 2931 2000 24.38% 16.87% 17.08 191.696 Elbow Multi-Step 1930 16.28% 85.008 \cellcolorgray!25 Spatio-temporal \cellcolorgray!25 1930 \cellcolorgray!25 16.28% \cellcolorgray!25 43.43 Euclidean 2170 18.30% 169.578 Silhouette Multi-Step 1200 10.12% 107.043 ACC 11858 \cellcolorgray!25 Spatio-temporal \cellcolorgray!25 1200 \cellcolorgray!25 10.12% \cellcolorgray!25 46.796 Euclidean 2210 18.64% 168.003 Gap Multi-Step 1260 10.63% 109.632 \cellcolorgray!25 Spatio-temporal \cellcolorgray!25 1260 \cellcolorgray!25 10.63% \cellcolorgray!25 48.189 Euclidean 1765 14.88% 199.788 Canopy Multi-Step 1544 13.02% 84.018 \cellcolorgray!25 Spatio-temporal \cellcolorgray!25 1544 \cellcolorgray!25 13.02% \cellcolorgray!25 46.419 Euclidean 3804 2400 26.93% 16.99% 6.892 51.344 Elbow Multi-Step 1870 13.24% 28.151 \cellcolorgray!25 Spatio-temporal \cellcolorgray!25 1870 \cellcolorgray!25 13.24% \cellcolorgray!25 11.698 Euclidean 2760 19.54% 28.229 Silhouette Multi-Step 1530 10.83% 37.016 LKA 14124 \cellcolorgray!25 Spatio-temporal \cellcolorgray!25 1530 \cellcolorgray!25 10.83% \cellcolorgray!25 16.347 Euclidean 2880 20.39% 19.134 Gap Multi-Step 2060 14.59% 37.2 \cellcolorgray!25 Spatio-temporal \cellcolorgray!25 2060 \cellcolorgray!25 14.59% \cellcolorgray!25 16.533 Euclidean 2995 21.21% 51.327 Canopy Multi-Step 2572 18.21% 19.469 \cellcolorgray!25 Spatio-temporal \cellcolorgray!252572 \cellcolorgray!25 18.21% \cellcolorgray!25 16.346 Euclidean 4622 3200 22.98% 15.91% 21.883 109.249 Elbow Multi-Step 2760 13.72% 103.912 \cellcolorgray!25 Spatio-temporal \cellcolorgray!252760 \cellcolorgray!25 13.72% \cellcolorgray!25 77.111 Euclidean 3570 17.75% 112.909 Silhouette Multi-Step 2100 10.44% 106.304 ICA 20110 \cellcolorgray!25 Spatio-temporal \cellcolorgray!25 2100 \cellcolorgray!25 10.44% \cellcolorgray!25 76.271 Euclidean 4230 21.03% 113.167 Gap Multi-Step 2300 11.44% 107.561 \cellcolorgray!25 Spatio-temporal \cellcolorgray!25 2300 \cellcolorgray!25 11.44% \cellcolorgray!25 76.991 Euclidean 4534 22.55% 113.183 Canopy Multi-Step 3492 17.36% 108/197 \cellcolorgray!25 Spatio-temporal \cellcolorgray!253492 \cellcolorgray!25 17.36% \cellcolorgray!25 78.431
RQ3: Abstract MDP-Guided Learning Process Fig.4 features three line graphs, each corresponding to distinct scenarios in RL: (a) ACC, (b) LKA, and (c) ICA. Each graph portrays the performance of two models across numerous episodes: the conventional model (denoted as TD3 or DDPG) and the abstract MDP-Guided model (referred to as MDP-guided TD3 or MDP-guided DQN). The vertical axis represents the obtained reward, while the horizontal axis indicates the episode count.
An examination of Fig.4 underscores that, across all considered scenarios, the abstract MDP-Guided models (depicted in orange) consistently surpass the performance of the traditional models (depicted in blue). Of particular significance is the discernibly accelerated escalation in reward exhibited by the abstract MDP-Guided models, indicative of more efficient learning dynamics. This acceleration is particularly pronounced in the initial episodes, where the abstract MDP-Guided models attain higher rewards at a faster pace in comparison to their traditional counterparts. Moreover, our proposed approach demonstrates superior guidance, especially in intricate scenarios.
The discerned trends across all three scenarios affirm the pivotal role of abstract models in significantly enhancing the efficiency of the RL process. By adeptly simplifying the inherent complexity of the environment and channeling learning efforts towards critical aspects, abstract models expedite the policy learning process. This, in turn, results in a more expeditious and effective training regimen, thereby effectively addressing the posed research question (RQ3).
Case Study Metric Properties Verification Result Real Value Error LKA Euclidean Rmin=?[C<=51] 44.43 48.50 4.07 Pmax=?[F<=51;isOutOfLane=1] 0.13% 0.00% 0.13 Multi-Step Rmin=?[C<=51] 44.72 48.50 3.78 Pmax=?[F<=51;isOutOfLane=1] 0.13% 0.0% 0.13 \cellcolorgray!25 \cellcolorgray!25 Rmin=?[C<=51] \cellcolorgray!25 46.92 \cellcolorgray!25 48.50 \cellcolorgray!25 1.58 \cellcolorgray!25Spatio-temporal Value \cellcolorgray!25Pmax=?[F<=51;isOutOfLane=1] \cellcolorgray!250.10% \cellcolorgray!250.00% \cellcolorgray!25 0.10 ACC Euclidean Rmin=?[C<=51] 57.53 59.94 2.41 Pmax=?[F<=51;isOutOfLane=1] 0.7% 0.00% 0.7 Pmax=?[F<=51;isCrashed=1] 0.06% 0.00% 0.06 Multi-Step Rmin=?[C<=51] 55.98 59.94 3.96 Pmax=?[F<=51;isOutOfLane=1] 1.00% 0.00% 1.00 Pmax=?[F<=51;isCrashed=1] 0.06% 0.00% 0.06 \cellcolorgray!25 \cellcolorgray!25 Rmin=?[C<=51] \cellcolorgray!25 60.33 \cellcolorgray!25 59.94 \cellcolorgray!25 -0.39 \cellcolorgray!25 \cellcolorgray!25 Pmax=?[F<=51;isOutOfLane=1] \cellcolorgray!25 0.01% \cellcolorgray!25 0.00% \cellcolorgray!25 0.01 \cellcolorgray!25 Spatio-temporal Value \cellcolorgray!25 Pmax=?[F<=51;isCrashed=1] \cellcolorgray!25 0.19% \cellcolorgray!25\cellcolorgray!25 0.00% \cellcolorgray!25 0.19 ICA Euclidean Rmin=?[C<=60] 8.38 9.36 0.98 Pmax=?[F<=60;isCrashed=1] 18.73% 20.80% 2.07 Pmax=?[F<=60;reachDest=1] 0.17% 4.60% 4.43 Multi-Step Rmin=?[C<=60] 8.99 9.36 0.37 Pmax=?[F<=60;isCrashed=1] 17.33% 20.80% 3.47 Pmax=?[F<=60;reachDest=1] 3.25% 4.60% 1.35 \cellcolorgray!25 \cellcolorgray!25 Rmin=?[C<=60] \cellcolorgray!25 9.38 \cellcolorgray!25 9.36 \cellcolorgray!25 -0.02 \cellcolorgray!25 \cellcolorgray!25 Pmax=?[F<=60;isCrashed=1] \cellcolorgray!25 19.35% \cellcolorgray!25 20.80% \cellcolorgray!25 1.45 \cellcolorgray!25 Spatio-temporal Value \cellcolorgray!25 Pmax=?[F<=60;reachDest=1] \cellcolorgray!25 4.50% \cellcolorgray!25 4.60% \cellcolorgray!25 0.10
Response to RQ3: Fig.4 strongly supports the pivotal role of abstract models in guiding the DRL process. Offering a structured approach to the state space and incorporating domain knowledge through abstraction, these models enhance data utilization and generalization. This leads to accelerated convergence towards higher rewards, signifying a more rapid training process. Furthermore, the improved performance in initial episodes suggests that abstract models effectively address challenges related to data scarcity, leveraging abstracted information to guide the learning algorithm toward profitable strategies early in the training process.
5 Related Work
Action and State Abstraction in MDPs.The innovative concept of MDP action abstraction proves to be a strategic solution for alleviating computational burdens and enhancing problem-solving efficiency by compressing the action space while maintaining solution quality. Chen and Xu[12] pioneered a method grounded in discrete Fourier transform for action abstraction in MDPs with deterministic uncertainty. Extending this paradigm to MDPs with continuous action spaces, Omidshafiei[31] applied the Fourier transform. Additionally, Bita[6] contributes a comprehensive framework that leverages the nondeterministic situation calculus and ConGolog programming language to abstract agent actions in nondeterministic domains, facilitating strategic reasoning and synthesis.
Simultaneously, when employing function approximation to abstract states, involving the intricate process of mapping the concrete space to a lower-dimensional counterpart optimized through RL objectives. Studies integrating NNs and Hierarchical Reinforcement Learning (HRL), such as feudal HRL[4, 30] and option-critic with NNs[16, 15], actively address the nuanced realms of state and temporal abstraction. Abel[2] introduces transitive and PAC state abstractions, triumphing in sample complexity reduction despite potential performance drawbacks. Misra[27] introduces HOMER, a pioneering sample-efficient exploration and DRL algorithm tailored for rich observation environments, guaranteeing provable efficiency and computational effectiveness in specific scenarios. However, these methods, while effective in specific scenarios, may lack semantic preservation in the process of abstraction.
Abstract MDPs for DRL. The realm of MDP abstraction, meticulously preserving transition and reward structures[23], emerges as a linchpin for augmenting RL efficiency and generalization. Existing options, such as option-bisimulation[11, 10], grapple with computational intricacies. Abel et al.[3] pioneer state-abstraction-option classes with insightful suboptimality bounds. Vans[34] introduces MDP homomorphic networks, harnessing symmetries for enhanced convergence. Guo[15] with a Multi-Step metric for state-temporal abstraction.Junges[19] takes advantage of the inherent hierarchy of the Markov decision process and divides the state space into macro-level and sub-level. It regards the unresolved sub-level as an uncertainty for constraint and progressive analysis, to reduce the state space explosion problem.Feng[14] unfolds the potential of edited MDPs, efficiently learning safety-critical states from naturalistic driving data, thus showcasing accelerated testing and training of safety-critical autonomous systems. The existing work on abstract MDPs demonstrates notable contributions but faces challenges in terms of verification, safety, and generalization across diverse environments. Strengthening these aspects is essential for advancing the reliability and applicability of abstraction techniques in real-world scenarios.
6 Conclusion and Future Work
Our approach to abstract modeling using spatio-temporal value semantics represents a substantial leap in ensuring the dependability of machine learning systems, particularly in decision-making processes for ICPS employing DRL. This method is characterized by its universal application across a variety of ICPS scenarios, demonstrating its robustness and versatility. However, itβs crucial to recognize certain limitations, such as the learning process efficiency for the abstract model, which we aim to enhance in our future work.
Expanding upon this, the next phase of our research will focus on refining the learning algorithms to accelerate the training phase without compromising the modelβs integrity. Additionally, moving beyond experimental validations, we plan to incorporate formal theorem-based evaluations to establish the equivalence between our abstract models and their respective true MDPs. This shift towards a more rigorous theoretical framework, such as bisimulation, will allow for a deeper understanding and validation of the modelsβ accuracy and reliability.
Moreover, our future endeavors aim to broaden the application of our abstraction technique to more complex ICPS domains. This extension includes optimizing the efficiency of state exploration in DRL training, a crucial aspect for effective navigation in intricate and dynamic environments. Additionally, we intend to explore reachability and robustness aspects within these systems, ensuring that our models not only make predictions but also respond adaptively to real-world scenarios. Through these comprehensive research efforts, our overarching objective is to make significant contributions to the field of abstract modeling, enhancing its effectiveness and reliability in safety-critical and dynamic environments. Addressing these challenges, ensuring model safety, and establishing a cohesive framework remain pivotal for the successful and robust application of abstraction techniques in practical scenarios.
References
- [1]Abel, D.: A theory of abstraction in reinforcement learning. arXiv preprint arXiv:2203.00397 (2022)
- [2]Abel, D., Arumugam, D., Lehnert, L., Littman, M.: State abstractions for lifelong reinforcement learning. In: International Conference on Machine Learning. pp. 10β19. PMLR (2018)
- [3]Abel, D., Umbanhowar, N., Khetarpal, K., Arumugam, D., Precup, D., Littman, M.: Value preserving state-action abstractions. In: International Conference on Artificial Intelligence and Statistics. pp. 1639β1650. PMLR (2020)
- [4]Andreas, J., Klein, D., Levine, S.: Modular multitask reinforcement learning with policy sketches. In: International conference on machine learning. pp. 166β175. PMLR (2017)
- [5]Bacon, P.L., Harb, J., Precup, D.: The option-critic architecture. In: Proceedings of the AAAI conference on artificial intelligence. vol.31 (2017)
- [6]Banihashemi, B., DeGiacomo, G., Lesperance, Y.: Abstraction of nondeterministic situation calculus action theories. In: Elkind, E. (ed.) Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23. pp. 3112β3122. International Joint Conferences on Artificial Intelligence Organization (8 2023). https://doi.org/10.24963/ijcai.2023/347, https://doi.org/10.24963/ijcai.2023/347, main Track
- [7]Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguistics 5, 135β146 (2017). https://doi.org/10.1162/tacl_a_00051, https://doi.org/10.1162/tacl_a_00051
- [8]Brunke, L., Greeff, M., Hall, A.W., Yuan, Z., Zhou, S., Panerati, J., Schoellig, A.P.: Safe learning in robotics: From learning-based control to safe reinforcement learning. Annual Review of Control, Robotics, and Autonomous Systems 5, 411β444 (2022)
- [9]Burda, Y., Edwards, H., Storkey, A.J., Klimov, O.: Exploration by random network distillation. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net (2019), https://openreview.net/forum?id=H1lJJnR5Ym
- [10]Castro, P., Precup, D.: Using bisimulation for policy transfer in mdps. In: Proceedings of the AAAI conference on artificial intelligence. vol.24, pp. 1065β1070 (2010)
- [11]Castro, P.S.: On planning, prediction and knowledge transfer in Fully and Partially Observable Markov Decision Processes. McGill University (Canada) (2011)
- [12]Chen, Y., Xu, J.: Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. The Journal of Machine Learning Research 17(1), 882β938 (2016)
- [13]Dockhorn, A., Kruse, R.: State and action abstraction for search and reinforcement learning algorithms. In: Artificial Intelligence in Control and Decision-making Systems: Dedicated to Professor Janusz Kacprzyk, pp. 181β198. Springer (2023)
- [14]Feng, S., Sun, H., Yan, X., Zhu, H., Zou, Z., Shen, S., Liu, H.X.: Dense reinforcement learning for safety validation of autonomous vehicles. Nature 615(7953), 620β627 (2023)
- [15]Guo, S., Yan, Q., Su, X., Hu, X., Chen, F.: State-temporal compression in reinforcement learning with the reward-restricted geodesic metric. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(9), 5572β5589 (2021)
- [16]Jiang, N., Kulesza, A., Singh, S.: Abstraction selection in model-based reinforcement learning. In: International Conference on Machine Learning. pp. 179β188. PMLR (2015)
- [17]Jiang, Y., Gu, S.S., Murphy, K.P., Finn, C.: Language as an abstraction for hierarchical deep reinforcement learning. Advances in Neural Information Processing Systems 32 (2019)
- [18]Jin, P., Tian, J., Zhi, D., Wen, X., Zhang, M.: Trainify: A cegar-driven training and verification framework for safe deep reinforcement learning. In: Shoham, S., Vizel, Y. (eds.) Computer Aided Verification - 34th International Conference, CAV 2022, Haifa, Israel, August 7-10, 2022, Proceedings, Part I. Lecture Notes in Computer Science, vol. 13371, pp. 193β218. Springer (2022). https://doi.org/10.1007/978-3-031-13185-1_10, https://doi.org/10.1007/978-3-031-13185-1_10
- [19]Junges, S., Spaan, M.T.: Abstraction-refinement for hierarchical probabilistic models. In: International Conference on Computer Aided Verification. pp. 102β123. Springer (2022)
- [20]Kassambara, A.: Practical guide to cluster analysis in R: Unsupervised machine learning, vol.1. Sthda (2017)
- [21]Kulkarni, T.D., Narasimhan, K., Saeedi, A., Tenenbaum, J.: Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems 29 (2016)
- [22]Kwiatkowska, M., Norman, G., Parker, D.: Probabilistic symbolic model checking with prism: A hybrid approach. International journal on software tools for technology transfer 6, 128β142 (2004)
- [23]Lavaei, A., Soudjani, S., Frazzoli, E., Zamani, M.: Constructing mdp abstractions using data with formal guarantees. IEEE Control Systems Letters 7, 460β465 (2022)
- [24]Leurent, E., etal.: An environment for autonomous driving decision-making (2018)
- [25]Li, L., Walsh, T.J., Littman, M.L.: Towards a unified theory of state abstraction for mdps. In: AI&M (2006)
- [26]Li, Y., Gao, T., Yang, J., Xu, H., Wu, Y.: Phasic self-imitative reduction for sparse-reward goal-conditioned reinforcement learning. In: International Conference on Machine Learning. pp. 12765β12781. PMLR (2022)
- [27]Misra, D., Henaff, M., Krishnamurthy, A., Langford, J.: Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In: International conference on machine learning. pp. 6961β6971. PMLR (2020)
- [28]Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M.A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-level control through deep reinforcement learning. Nat. 518(7540), 529β533 (2015). https://doi.org/10.1038/nature14236, https://doi.org/10.1038/nature14236
- [29]Nashed, S.B., Svegliato, J., Bhatia, A., Russell, S., Zilberstein, S.: Selecting the partial state abstractions of mdps: A metareasoning approach with deep reinforcement learning. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 11665β11670. IEEE (2022)
- [30]Oh, J., Singh, S., Lee, H., Kohli, P.: Zero-shot task generalization with multi-task deep reinforcement learning. In: International Conference on Machine Learning. pp. 2661β2670. PMLR (2017)
- [31]Omidshafiei, S., Agha-Mohammadi, A.A., Amato, C., How, J.P.: Decentralized control of partially observable markov decision processes using belief space macro-actions. In: 2015 IEEE international conference on robotics and automation (ICRA). pp. 5962β5969. IEEE (2015)
- [32]Parker, D.A.: Implementation of symbolic model checking for probabilistic systems. Ph.D. thesis, University of Birmingham (2003)
- [33]Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Moschitti, A., Pang, B., Daelemans, W. (eds.) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL. pp. 1532β1543. ACL (2014). https://doi.org/10.3115/v1/d14-1162, https://doi.org/10.3115/v1/d14-1162
- [34]Vander Pol, E., Worrall, D., van Hoof, H., Oliehoek, F., Welling, M.: Mdp homomorphic networks: Group symmetries in reinforcement learning. Advances in Neural Information Processing Systems 33, 4199β4210 (2020)
- [35]Radanliev, P., Roure, D.D., Kleek, M.V., Santos, O., Ani, U.: Artificial intelligence in cyber physical systems. AI Soc. 36(3), 783β796 (2021). https://doi.org/10.1007/s00146-020-01049-0, https://doi.org/10.1007/s00146-020-01049-0
- [36]Schmidt, L.M., Brosig, J., Plinge, A., Eskofier, B.M., Mutschler, C.: An introduction to multi-agent reinforcement learning and review of its application to autonomous mobility. In: 25th IEEE International Conference on Intelligent Transportation Systems, ITSC 2022, Macau, China, October 8-12, 2022. pp. 1342β1349. IEEE (2022). https://doi.org/10.1109/ITSC55140.2022.9922205, https://doi.org/10.1109/ITSC55140.2022.9922205
- [37]Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., vanden Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T.P., Leach, M., Kavukcuoglu, K., Graepel, T., Hassabis, D.: Mastering the game of go with deep neural networks and tree search. Nat. 529(7587), 484β489 (2016). https://doi.org/10.1038/nature16961, https://doi.org/10.1038/nature16961
- [38]Song, J., Xie, X., Ma, L.: Siege: A semantics-guided safety enhancement framework for ai-enabled cyber-physical systems. IEEE Transactions on Software Engineering (2023)
- [39]Spirtes, P., Glymour, C., Scheines, R.: Reply to humphreys and freedmanβs review of causation, prediction, and search. The British journal for the philosophy of science 48(4), 555β568 (1997)
- [40]Spirtes, P.L., Meek, C., Richardson, T.S.: Causal inference in the presence of latent variables and selection bias. arXiv preprint arXiv:1302.4983 (2013)
- [41]Yang, W., Xu, C., Pan, M., Ma, X., Lu, J.: Improving verification accuracy of CPS by modeling and calibrating interaction uncertainty. ACM Trans. Internet Techn. 18(2), 20:1β20:37 (2018). https://doi.org/10.1145/3093894, https://doi.org/10.1145/3093894
- [42]Zhang, M., Du, D., Zhang, M., Zhang, L., Wang, Y., Zhou, W.: A meta-modeling approach for autonomous driving scenario based on STTD. Int. J. Softw. Informatics 11(3), 315β333 (2021). https://doi.org/10.21655/IJSI.1673-7288.00262, https://doi.org/10.21655/ijsi.1673-7288.00262