[ \textAttention l = \textsoftmax!\left(\fracQ lK_l^\top\sqrtd_k\right)V_l, \quad l \in \textAct,\textScene,\textDialogue ] To align modalities, the loss encourages matching pairs (text‑image, text‑audio) to have higher cosine similarity than mismatched pairs: