Sometimes, people see an object, place, or reference, and think of a tune that goes with it. This sound/song becomes a way to easily remember a place’s culture, and we hope to amplify that memory by making a language model that generates sounds.

Tools and References:

Linearity Mapping From Image to Text Space
CLIP - OpenAI
AudioCraft - Meta AI
Pytorch