DMFDepth: Monocular Depth in Dynamic Scenes with Temporal Information & Beyond

Anonymous Anonymous
Institution Name
Conferance name and year

The video shows an example of depth maps and absolute-relative error maps between DMFDepth and ManyDepth on the improved KITTI ground truth dataset

Abstract

Self-supervised Cost-volume-based monocular depth estimation has received huge consideration since its spatial-temporal awareness ability gives high-quality depth without any labeling cost. However, most recent works only consider alleviating the network sensitivity on dynamic scenes, and less focus on utilizing the temporal multi-view cue. This work proposes the Semantic-aware Inconsistency Mask generated from the difference between multi-view and mono depths on motion objects to alleviate the static-view assumption. We also present an alternative to the Multi-scale Feature Fusion module that extracts and aggregates features from multiple scales to improve the cost volume and the texture feature. Additionally, we design the Local-adaptive Cross-cue module to model the relation between the multi-view and mono-view cues. Finally, we propose the future-awareness distillation loss by adding a reference multi-view depth network that receives an additional frame from the future. We archieve state-of-the-art performance on the KITTI Benchmark with more stable metrics.

Motivation

MY ALT TEXT

We observe using additional future frame improve the depth performance on absolute-relative error.

MY ALT TEXT

The intermediate result of the Inconsistency Mask on Cityscapes. Raw Inconsistency Mask is generated from the difference between Consistency Depth Dc and Inconsistency Depth Di. Our Semantic-aware Inconsistency masks using foreground mask Mf g can select dynamic objects better than Multi-Scale-DynDepth filtering with ground-plane mask Mg.

MY ALT TEXT

We model the local-intra-relation (third row) and global-intra-relation (fourth row) at the point at the edge of the object (green dots). Observing that the attention map between the feature and cost volume (second row) are similar, we motivate the way to align the cost-volume cue with the attention map from the feature, and vice-versa. However, aligning with the global attention would misguide the intermediate feature since the feature and the cost volume have a different characteristic.

Method

MY ALT TEXT

We introduce the Semantic-awareness Inconsistency Mask that generated from the inconsistency between multi-view and monocular depths by filtering noise with the foreground mask from pseudo segmentation map.

We design the Multi-scale Feature Fusion module to extracts and aggregates features from multiple scales to improve the cost volume and the texture feature.

We propose the Local-adaptive Cross-cue module to combine the multi-view and mono-view cue effectively.

We introduce the Future-awareness Distillation Loss using an oracle reference multi-view depth network to further regularize target network achieving better performance.

Result

MY ALT TEXT

Depth performance on KITTI eigen.

MY ALT TEXT

Depth performance on KITTI improved ground truth.

MY ALT TEXT

Qualitative result on KITTI dataset.

BibTeX

BibTex Code Here