Task-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers

1National Yang-Ming Chiao-Tung University 2NVIDIA Research

In this work, we propose task-oriented human grasp synthesis, a new endeavor that aims to generate human grasps that takes environmental context and task objectives into account.

Abstract

In this paper, we study task-oriented human grasp synthesis, a new task aiming at synthesizing human grasps that require the awareness of its task and context.

At the core of our method is the task-aware contact maps. Unlike traditional contact maps that only reason about the object itself and its relation with the hand, our enhanced maps take into account scene and task information.

This comprehensive map is critical in hand-object interaction, leading to accurate grasping poses that align with the task. We proposed a two-stage pipeline composed of two diffusion models that first constructs a task-aware contact map informed by the scene and task with ContactDiffuser. In the subsequent stage, we use this contact map to predict task-oriented grasping poses with GraspDiffuser.

To validate our approach, we introduced a new dataset for task-oriented grasp synthesis. Our experiments demonstrate the superior performance of our approach, surpassing existing methods on both grasp quality and task performance.

Task-oriented Human Grasps Dataset

We design three daily tasks for evaluating task-oriented human grasp synthesis (i.e., Placing, Stacking, and Shelving). We employ PyBullet as our physics simulator to generate diverse task configurations. Task configuration is composed of initial and goal position of target object, position of obstacles and the human grasp.

Two-stages Diffusion-based Framwork

To this end, we propose a new two-stage diffusion-based framework, driven by the proposed task-aware contact map, to address the synthesis of task-oriented human grasps.

A task-aware contact map incorporates crucial information about the context and the task. It builds upon existing object-centric contact maps by integrating crucial contextual and task information through a distance map, which represents the proximity between an object and its environment.

The distance map is computed as the shortest distance between the target object and its surroundings. It serves two key purposes:

  1. Collision avoidance: It prioritizes grasping points farther from obstacles.
  2. Environmental context: It provides insights into object-environment interactions and object states or goals.

This approach integrates surrounding information into the object representation, enhancing scene and task understanding.

The two-stage diffusion-based framework consists of:

  1. ContactDiffuser: Predicts a task-aware contact map for an object, given the point clouds of the initial and goal scenes along with the initial and goal distance maps.
  2. GraspDiffuser: Synthesizes human grasps from the predicted task-aware contact map and the object's point cloud.

ContactDiffuser animation

ContactDiffuser animation

GraspDiffuser animation

GraspDiffuser animation

Experiments

We evaluate the synthesized human grasps based on their physical plausibility, stability, and diversity, following prior works [1,2,3,4]. We propose a proper metric called Task Score (TS) to evaluate the quality of task-oriented human grasp synthesis.

Metrics

Penetration Volume (PV): We calculate the penetration volume by converting the meshes into 1mm cubes and calculating the overlap of these voxels.

Simulation Displacement (SD): We simulate the object and predicted grasps in PyBullet [14] for 1 sec. and then compute the object's center of mass displacement.

Contact Ratio (CR): Contact percentage of predicted grasps with objects.

Qualified Ratio (QR): The metric jointly considers both penetration volume and simulation displacement. Note that, a higher penetration volume generally leads to a lower simulation displacement, which is not satisfactory. We set thresholds at 3 × 10-6 cm3 and 2 cm for penetration volume and simulation displacement, respectively. We calculate the percentage of predicted grasps that satisfy both criteria.

Diversity Score (DS): We follow Flex [4] to compute the average L2 pairwise distance to evaluate the diversity of predicted grasps.

Obstacle Penetration Percentage (OPP): We compute the penetration percentage of human grasp vertices in obstacles for initial and goal scenes.

Task Score (TS): A proper metric for task-oriented human grasp synthesis should take grasp physically-plausibility, stability, and collision avoidance in both initial and goal scenes. Thus, we propose a new metric and define it as TS = QR × (1 - Init OPP) × (1 - Goal OPP).

Table 1

Task Method PV↓ SD↓ CR(%)↑ QR(%)↑ DS ↑ Init OPP(%)↓ Goal OPP(%)↓ TS↑
Placing GraspTTA [1] 1.85 2.60 98.86 58.57 75.23 21.67 17.91 0.376
ContactGen [2] 1.40 3.85 92.64 46.84 90.14 5.56 17.26 0.366
SceneDiffuser [3] 1.37 3.19 97.52 53.12 68.43 20.00 16.91 0.353
Flex [4] 2.50 1.62 99.94 59.10 33.59 6.74 5.61 0.520
Ours 2.40 1.42 99.31 65.29 43.12 7.27 5.79 0.570
Stacking GraspTTA [1] 4.30 0.28 100.00 35.00 0.30 26.04 8.32 0.237
ContactGen [2] 0.66 1.87 96.77 76.43 64.54 8.42 9.75 0.631
SceneDiffuser [3] 0.53 1.64 94.89 77.72 63.06 25.30 9.81 0.523
Flex [4] 0 10.65 0 0.00 107.87 0 0 0
Ours 1.09 1.03 94.97 84.31 48.59 14.94 4.51 0.684
Shelving GraspTTA [1] 1.78 2.56 99.13 58.94 75.04 15.46 13.43 0.431
ContactGen [2] 1.43 3.90 93.13 46.32 89.79 6.42 13.17 0.376
SceneDiffuser [3] 1.38 3.31 96.43 51.48 67.90 14.52 13.59 0.380
Flex [4] 2.81 1.54 99.90 57.26 28.72 4.39 4.47 0.522
Ours 2.12 1.62 99.47 67.49 52.81 8.72 10.31 0.552
Task-oriented Human Grasp Synthesis Evaluation. PV: Penetration Volume, SD: Simulation Displacement, CR: Contact Ratio, QR: Qualified Ratio, OPP: Obstacle Penetration Percentage, and TS: Task Score. The table reports the proposed method can synthesize favorable task-oriented human grasp synthesis, compared to strong baselines. Flex [4] struggles to synthesize stable grasps in Stacking due to its inability to handle small bricks.

Qualitative Result

Task-oriented Human Grasp Prediction

Contact Map Prediction

BibTeX

@article{liu2024tohgs,
      author  = {An-Lun Liu, Yu-Wei Chao and Yi-Ting Chen},
      title     = {Task-Oriented Human Grasp Synthesis via Context- and Task-Aware Diffusers},
      journal   = {},
      year      = {2024},
    }

[1] Jiang et al. Hand-Object Contact Consistency Reasoning for Human Grasps Generation. ICCV 2021

[2] Liu et al. ContactGen: Generative Contact Modeling for Grasp Generation. ICCV 2023

[3] Huang et al. Diffusion-based Generation, Optimization, and Planning in 3D Scenes. CVPR 2023

[4] Tendulkar et al. FLEX: Full-Body Grasping Without Full-Body Grasps. CVPR 2023