條件序列

AR 要怎麼知道哪段文字要配上哪段 embedding

AR Transformer（自回歸生成音訊 token）每一步都能看到：

目前為止產生的 audio token（= context）

條件序列（= 全部的 text token + speaker embedding）

「哪段文字」要對應「哪一段音訊 token」——這個 mapping，其實沒有 explicit alignment 標籤，而是模型自己學出來的。

為什麼做得到？

因為訓練時給模型的 supervision 是：「這句文字 t1…tN，配上這個 speaker embedding，目標是產生這串 audio token y1…yM。」

模型會自動學會：哪些語音 token 是跟哪些文字 token 對齊的

例如，生成「你」的音訊要 output y1～y8，然後「好」要 output y9～y17 …以此類推
AR 模型靠「attention 機制」自己去動態選取「當前應該關注哪個 text token」