In this paper, we study task-oriented human grasp synthesis, a new task aiming at synthesizing human grasps that require the awareness of its task and context.
At the core of our method is the task-aware contact maps. Unlike traditional contact maps that only reason about the object itself and its relation with the hand, our enhanced maps take into account scene and task information.
This comprehensive map is critical in hand-object interaction, leading to accurate grasping poses that align with the task. We proposed a two-stage pipeline composed of two diffusion models that first constructs a task-aware contact map informed by the scene and task with ContactDiffuser. In the subsequent stage, we use this contact map to predict task-oriented grasping poses with GraspDiffuser.
To validate our approach, we introduced a new dataset for task-oriented grasp synthesis. Our experiments demonstrate the superior performance of our approach, surpassing existing methods on both grasp quality and task performance.
To this end, we propose a new two-stage diffusion-based framework, driven by the proposed task-aware contact map, to address the synthesis of task-oriented human grasps.
A task-aware contact map incorporates crucial information about the context and the task. It builds upon existing object-centric contact maps by integrating crucial contextual and task information through a distance map, which represents the proximity between an object and its environment.
The distance map is computed as the shortest distance between the target object and its surroundings. It serves two key purposes:
This approach integrates surrounding information into the object representation, enhancing scene and task understanding.
The two-stage diffusion-based framework consists of:
We evaluate the synthesized human grasps based on their physical plausibility, stability, and diversity, following prior works [1,2,3,4]. We propose a proper metric called Task Score (TS) to evaluate the quality of task-oriented human grasp synthesis.
Penetration Volume (PV): We calculate the penetration volume by converting the meshes into 1mm cubes and calculating the overlap of these voxels.
Simulation Displacement (SD): We simulate the object and predicted grasps in PyBullet [14] for 1 sec. and then compute the object's center of mass displacement.
Contact Ratio (CR): Contact percentage of predicted grasps with objects.
Qualified Ratio (QR): The metric jointly considers both penetration volume and simulation displacement. Note that, a higher penetration volume generally leads to a lower simulation displacement, which is not satisfactory. We set thresholds at 3 × 10-6 cm3 and 2 cm for penetration volume and simulation displacement, respectively. We calculate the percentage of predicted grasps that satisfy both criteria.
Diversity Score (DS): We follow Flex [4] to compute the average L2 pairwise distance to evaluate the diversity of predicted grasps.
Obstacle Penetration Percentage (OPP): We compute the penetration percentage of human grasp vertices in obstacles for initial and goal scenes.
Task Score (TS): A proper metric for task-oriented human grasp synthesis should take grasp physically-plausibility, stability, and collision avoidance in both initial and goal scenes. Thus, we propose a new metric and define it as TS = QR × (1 - Init OPP) × (1 - Goal OPP).
Task | Method | PV↓ | SD↓ | CR(%)↑ | QR(%)↑ | DS ↑ | Init OPP(%)↓ | Goal OPP(%)↓ | TS↑ |
---|---|---|---|---|---|---|---|---|---|
Placing | GraspTTA [1] | 1.85 | 2.60 | 98.86 | 58.57 | 75.23 | 21.67 | 17.91 | 0.376 |
ContactGen [2] | 1.40 | 3.85 | 92.64 | 46.84 | 90.14 | 5.56 | 17.26 | 0.366 | |
SceneDiffuser [3] | 1.37 | 3.19 | 97.52 | 53.12 | 68.43 | 20.00 | 16.91 | 0.353 | |
Flex [4] | 2.50 | 1.62 | 99.94 | 59.10 | 33.59 | 6.74 | 5.61 | 0.520 | |
Ours | 2.40 | 1.42 | 99.31 | 65.29 | 43.12 | 7.27 | 5.79 | 0.570 | |
Stacking | GraspTTA [1] | 4.30 | 0.28 | 100.00 | 35.00 | 0.30 | 26.04 | 8.32 | 0.237 |
ContactGen [2] | 0.66 | 1.87 | 96.77 | 76.43 | 64.54 | 8.42 | 9.75 | 0.631 | |
SceneDiffuser [3] | 0.53 | 1.64 | 94.89 | 77.72 | 63.06 | 25.30 | 9.81 | 0.523 | |
Flex [4] | 0 | 10.65 | 0 | 0.00 | 107.87 | 0 | 0 | 0 | |
Ours | 1.09 | 1.03 | 94.97 | 84.31 | 48.59 | 14.94 | 4.51 | 0.684 | |
Shelving | GraspTTA [1] | 1.78 | 2.56 | 99.13 | 58.94 | 75.04 | 15.46 | 13.43 | 0.431 |
ContactGen [2] | 1.43 | 3.90 | 93.13 | 46.32 | 89.79 | 6.42 | 13.17 | 0.376 | |
SceneDiffuser [3] | 1.38 | 3.31 | 96.43 | 51.48 | 67.90 | 14.52 | 13.59 | 0.380 | |
Flex [4] | 2.81 | 1.54 | 99.90 | 57.26 | 28.72 | 4.39 | 4.47 | 0.522 | |
Ours | 2.12 | 1.62 | 99.47 | 67.49 | 52.81 | 8.72 | 10.31 | 0.552 |
@article{liu2024tohgs,
author = {An-Lun Liu, Yu-Wei Chao and Yi-Ting Chen},
title = {Task-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers},
journal = {},
year = {2024},
}
[1] Jiang et al. Hand-Object Contact Consistency Reasoning for Human Grasps Generation. ICCV 2021
[2] Liu et al. ContactGen: Generative Contact Modeling for Grasp Generation. ICCV 2023
[3] Huang et al. Diffusion-based Generation, Optimization, and Planning in 3D Scenes. CVPR 2023
[4] Tendulkar et al. FLEX: Full-Body Grasping Without Full-Body Grasps. CVPR 2023