Potential Field as Scene Affordance for Behavior Change-Based Visual Risk Object Identification

We study behavior change-based visual risk object identification (Visual-ROI), a critical framework designed to detect potential hazards for intelligent driving systems. Existing methods often show significant limitations in spatial accuracy and temporal consistency, stemming from an incomplete understanding of scene affordance. For example, these methods frequently misidentify vehicles that do not impact the ego vehicle as risk objects. Furthermore, existing behavior change-based methods are inefficient because they implement causal inference in the perspective image space.

We propose a new framework with a Bird's Eye View (BEV) representation to overcome the above challenges. Specifically, we utilize potential fields as scene affordance, involving repulsive forces derived from road infrastructure and traffic participants, along with attractive forces sourced from target destinations. In this work, we compute potential fields by assigning different energy levels according to the semantic labels obtained from BEV semantic segmentation. We conduct thorough experiments and ablation studies, comparing the proposed method with various state-of-the-art algorithms on both synthetic and real-world datasets.

Framework

Target Point Predictor

TP_model — Our architecture is adapted from AIM-BEV [13]. The Target Point Predictor receives a sequence of predicted BEV-SEG with a length of N=5. The output is a 2-dimensional representation of the target point \( T_p \) in BEV space.

Potential Field Rendering

render_PF — The complete potential field \( F \) ensures a balanced model strategy, leveraging both repulsive forces \( F_r \) to avoid obstacles and attractive forces \( F_a \) to guide towards the target.

Displacement Error from Two Observations (OFDE & OADE)

The risk score for a removed object is calculated using the average L2 distance between waypoints from actual and counterfactual observations, as defined in OIECR [4]. The path starts at the ego vehicle and, if possible, ends at the predicted target point, following the gradient of the potential field from high to low without any reversals.

Behavior Change Prediction with Potential Field (PF+BCP)

The **Driver Behavior** column shows the number of frames used for training, while the **Traffic Participants** column shows the number of objects used for testing. **Positive%** indicates the percentage of positive samples in each category.
	Driver Behavior			Traffic Participants
	Go	Stop	Positive%	Non-Risk	Risk	Positive%
RiskBench [9]	75,633	44,877	37.2%	158,369	17,434	9.9%
nuScenes [11]	29,446	4,554	13.4%	15,280	1,040	6.3%

We evaluate nine baselines within our framework. These baselines take a sequence of images as input and output a risk score for each road user (e.g., vehicle or pedestrian). A road user is considered a risk object if the raw score exceeds a predefined threshold. It is important to note that all the following baselines rely on Visual-ROI methods.

Cost Map: FF [5].
Collision Anticipation: DSA [6] and RRL [7].
Behavior Prediction-based: BP [8], BCP [3], TP+BCP (BCP w/ Target Point) and BS+BCP (BCP w/ BEV-SEG).
Planning-based: OFDE and OADE.

We evaluate the performance of Visual-ROI models with two types of metrics: Spatial Accuracy and Temporal Consistency. Spatial Accuracy include the Optimal F1 Score (OT-F1) and the Optimal F1 Score in T Seconds (OT-F1-T), which measure the model's ability to accurately identify risks. Temporal Consistency metrics consist of the Progressive Increasing Cost (PIC) and the Weighted Multi-Object Tracking Accuracy (wMOTA), which assess the consistency and accuracy of the model over time.

Optimal F1 Score (OT-F1) : An object is a risk object if its raw risk score exceeds a certain threshold. The optimal threshold is selected by maximizing the F1 score through a sweeping analysis. This process serves as an upper-bound performance benchmark for each model.

Optimal F1 Score in T Seconds (OT-F1-T) : OT-F1-T evaluates the OT-F1 prediction outcomes during the T seconds preceding the critical point. A critical point is defined as the moment when the ego vehicle is both influenced by the risk object and is at its closest proximity to it [9].

Progressive Increasing Cost (PIC) : This metric is introduced in RiskBench [9]. To address the issue of penalty weights decreasing too quickly and approaching zero (Fig. 1), we adjust PIC as \[ \textrm{PIC} = -\sum^{T}_{t=1} e^{-(T-t)/T}\log(\textrm{F1}_{t}) \] Here, \( \textrm{F1}_t \) denotes the F1 score at a specific time frame \( t \), while \( T \) represents the total number of frames within a scenario. We establish \( T \) as 60, equivalent to 3 seconds. We scale PIC to a range between 0 and 1 for improved interpretability by aggregating the PIC values across all scenarios and normalizing the total PIC to fit within this scale.

Weighted Multi-Object Tracking Accuracy (wMOTA) : Inspired by MOT16 [10], we use MOTA to evaluate the temporal consistency of a Visual-ROI model. To address the imbalance between positive (risky) and negative (non-risky) samples, we propose a weighted version of MOTA called wMOTA. We denote number of positive miss at time \(t\) as \(\textrm{PM}_t\). The number of negative miss at time \(t\) as \(\textrm{NM}_t\). The value \(\textrm{PM}_t\) is defined as \(w_p\cdot(\textrm{FN}_t+\textrm{IDsw}^{p}_t)\). The value \(\textrm{NM}_t\) is defined as \(w_n\cdot(\textrm{FP}_t+\textrm{IDsw}^{n}_t)\). In the above two equations, the notations \(\textrm{FN}_t\) and \(\textrm{FP}_t\) represent the numbers of false negatives and false positives at time \(t\), respectively. In addition, the notations \(\textrm{IDsw}^p_t\) and \(\textrm{IDsw}^n_t\) represent the number of identity switches for positive and negative samples, respectively, between times \(t-1\) and \(t\). The parameters \(w_p\) and \(w_n\) are the weights assigned to positive and negative samples. The wMOTA is defined as \[ \textrm{wMOTA}=1-\frac{\sum_t(\textrm{PM}_t+\textrm{NM}_t)}{\sum_t(w_p{\cdot}\textrm{GT}^{p}_t+w_n{\cdot}\textrm{GT}^{n}_t)} \] where \(\textrm{GT}^{p}_t\) and \(\textrm{GT}^{n}_t\) represent the counts of ground truth positive and negative samples at time \(t\), respectively.

The terms GT, BS, and PF refer to Ground Truth, Bird's-Eye-View Segmentation, and Potential Field, respectively. All detected risk objects are shown with green bounding boxes, while ground truth risks are masked in red.

Fine-grained Scenario-based Analysis

We have developed a user-friendly interface for scenario-based ROI analysis.
For further exploration, please visit our GitHub page.

OT-F1 for the Visual-ROI Methods Over Time

We analyze the 60 frames closest to the critical frame across all scenarios. The figure shows that all methods improve as they get closer to the critical frame. Notably, **PF+BCP** (the red line) performs better than the other methods for most of the time.

Table 1: Spatial Accuracy and Temporal Consistency on RiskBench

	Spatial Accuracy			Temporal Consistency		Inference Time
	OT-P (%)↑	OT-R (%)↑	OT-F1 (%)↑	PIC (%)↓	wMOTA (%)↑	Avg (sec)↓
FF [5]	22.2	27.9	24.7	39.3	55.0	0.027
DSA [6]	54.7	19.7	29.0	29.8	53.3	0.269
RRL [7]	49.4	15.4	23.5	28.9	52.3	0.280
BP [8]	24.2	35.1	28.7	39.0	57.5	0.119
BCP [3]	38.6	43.7	41.0	29.3	63.2	0.431
TP+BCP	47.4	51.7	49.5	28.0	67.2	0.437
BS+BCP	56.8	60.7	58.7	24.0	72.5	0.049
OFDE	50.8	56.7	53.6	26.7	65.4	0.062
OADE	52.7	57.9	55.2	25.7	66.9	0.061
PF+BCP	60.2	62.4	61.3	23.0	74.8	0.049

Table 2: Ablation Studies

Method	BEV-SEG	\( F_r \)	\( F_a \)	OT-F1 (%)↑	PIC ↓	wMOTA (%)↑
1				41.0	29.3	63.2
2			✓	49.5	28.0	67.2
3	✓		✓	58.7	24.0	72.5
4		✓		59.0	24.4	72.9
5		✓	✓	61.3	23.0	74.8

Table 3: OT-F1 in T secs

Method	1s (%)↑	2s (%)↑	3s (%)↑	Overall (%)↑
FF [5]	28.7	24.4	21.5	24.7
DSA [6]	36.8	31.6	29.7	29.0
RRL [7]	35.0	32.2	31.9	23.5
BP [8]	33.8	32.8	30.8	28.7
BCP [3]	49.3	47.2	44.2	41.0
TP+BCP	52.8	49.8	46.9	49.5
BS+BCP	60.7	58.8	56.5	58.7
OFDE	56.4	53.0	50.3	53.6
OADE	57.9	55.0	52.7	55.2
PF+BCP	62.5	61.0	59.3	61.3

We use the nuScenes dataset [9] as the testbed to evaluate the effectiveness of our method in the real-world environments. We label risk objects manually according to the protocol described in BCP [3]. In the testing phase, we use YOLOv8 [12] to generate object bounding boxes.

Table 4: nuScenes-ROI Evaluation

	OT-P (%)↑	OT-R (%)↑	OT-F1 (%)↑	PIC (%)↓	wMOTA (%)↑
BP [8]	21.1	38.0	27.1	19.0	58.3
BCP [3]	50.8	50.6	50.7	15.2	69.0
BS+BCP	39.2	56.2	46.2	10.7	65.3
PF+BCP	45.5	73.2	56.1	8.9	76.2

Citation

@inproceedings{pao2025pfbcp,
	title   = {{Potential Field as Scene Affordance for Behavior Change-Based Visual Risk Object Identification}},
	author  = {Pang-Yuan Pao and Shu-Wei Lu and Ze-Yan Lu and Yi-Ting Chen},
	journal = {IEEE International Conference on Robotics and Automation (ICRA)},
	year    = {2025}
}

If you have any questions, please contact Pang-Yuan Pao.

[1] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, "CVT: Introducing Convolutions to Vision Transformers," in ICCV, 2021.

[2] O. Khatib, "Real-Time Obstacle Avoidance for Manipulators and Mobile Robots," in ICRA, 1985.

[3] C. Li, S. H. Chan, and Y.-T. Chen, "Who Make Drivers Stop? Towards Driver-centric Risk Assessment: Risk Object Identification via Causal Inference," in IROS, 2020.

[4] P. Gupta, A. Biswas, H. Admoni, and D. Held, "Object Importance Estimation using Counterfactual Reasoning for Intelligent Driving," IEEE Robotics and Automation Letters, vol. 9, no. 4, pp. 3648-3655, 2024.

[5] Hu, A. Huang, J. Dolan, D. Held, and D. Ramanan, "Safe Local Motion Planning with Self-Supervised Freespace Forecasting," in CVPR, 2021.

[6] F.-H. Chan, Y.-T. Chen, Y. Xiang, and M. Sun, "Anticipating Accidents in Dashcam Videos," in ACCV, 2016.

[7] K.-H. Zeng, S.-H. Chou, F.-H. Chan, J. Carlos Niebles, and M. Sun, "Agent-centric Risk Assessment: Accident Anticipation and Risky Region Localization," in CVPR, 2017.

[8] C. Li, Y. Meng, S. H. Chan, and Y.-T. Chen, "Learning 3D-Aware Egocentric Spatial-Temporal Interaction via Graph Convolutional Networks," in ICRA, 2020.

[9] C.-H. Kung, C.-C. Yang, P.-Y. Pao, S.-W. Lu, P.-L. Chen, H.-C. Lu, and Y.-T. Chen, "RiskBench: A Scenario-based Benchmark for Risk Identification," in ICRA, 2024.

[10] A. Milan, L. Leal-Taixe, I. Reid, S. Roth, and K. Schindler, "MOT16: A Benchmark for Multi-Object Tracking," arXiv, vol. abs/1603.00831, 2016.

[11] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O.Beijbom, "nuScenes: A Multimodal Dataset for Autonomous Driving," in CVPR, 2020.

[12] G. Jocher, A. Chaurasia, and J. Qiu, "Ultralytics YOLO," 2023. Available: https://github.com/ultralytics/ultralytics

[13] N. Hanselmann, K. Renz, K. Chitta, A. Bhattacharyya, and A. Geiger, "KING: Generating Safety-Critical Driving Scenarios for Robust Imitation via Kinematics Gradients," in ECCV, 2022.

Potential Field as Scene Affordance for
Behavior Change-Based
Visual Risk Object Identification

Abstract

Methodology

Framework

Target Point Predictor

Potential Field Rendering

Displacement Error from Two Observations (OFDE & OADE)

Behavior Change Prediction with Potential Field (PF+BCP)

Experimental Setting

Datasets

Visual-Based ROI Baselines

Evaluation Metrics

Qualitative Results

ROI Demo on the RiskBench

Fine-grained Scenario-based Analysis

OT-F1 for the Visual-ROI Methods Over Time

Figure 1: PIC Weight Trend

Figure 2: Performance Comparison of Visual-ROI

Quantitative Results

Table 1: Spatial Accuracy and Temporal Consistency on RiskBench

Table 2: Ablation Studies

Table 3: OT-F1 in T secs

Real-Word Evaluation

nuScenes-ROI Dataset

Table 4: nuScenes-ROI Evaluation

ROI on the nuScenes Dataset

Citation