There’s plenty of multimodality literature in Vision and under themultimodality tag, likely including some I neglected to add above.

Multimodal datasets are under Multimodal Datasets

Surveys

Evaluation

Implementation (Code)