A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models

Prabal Gupta

Proceedings of the International Conference on New Interfaces for Musical Expression

Abstract

We present a real-time musical interface that converts natural-language scene descriptions into evolving procedural soundscapes. A performer types a prompt such as "warm jazz cafe at midnight" and steers it through direct parameter adjustments—stepping brightness down, switching a rhythm style—each producing a predictable, audible shift without re-prompting. Where GPU-bound text-to-audio systems synthesize monolithic waveforms, our instrument generates human-readable configurations over a categorical schema, enabling fine-grained performer control; most valid combinations are designed to sound musically coherent. Three interchangeable backends—embedding retrieval for sub-second CPU-only use, hosted LLMs via API, and a fine-tuned 270M local model—all emit the same schema. A live generator architecture continuously emits audio while resolving new instructions in the background, crossfading seamlessly when ready; even when an LLM takes 5-12 seconds to respond, the audience hears uninterrupted sound—reframing text-to-music as an ongoing performable stream rather than a one-shot generation. We evaluate text-audio semantic alignment using LAION-CLAP on held-out prompts as a technical proxy, finding that retrieval-based configuration outperforms random valid configurations on this metric, while noting that LAION-CLAP also informed retrieval-map construction. We report performance observations, informal listener feedback, and release materials for the SDK, dataset artifacts, model, and audiovisual performance interface.

Citation

Prabal Gupta. 2026. A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models. Proceedings of the International Conference on New Interfaces for Musical Expression. DOI: 10.5281/zenodo.20784374 [PDF]

BibTeX Entry

@inproceedings{nime2026_120,
 abstract = {We present a real-time musical interface that converts natural-language scene descriptions into evolving procedural soundscapes. A performer types a prompt such as "warm jazz cafe at midnight" and steers it through direct parameter adjustments—stepping brightness down, switching a rhythm style—each producing a predictable, audible shift without re-prompting. Where GPU-bound text-to-audio systems synthesize monolithic waveforms, our instrument generates human-readable configurations over a categorical schema, enabling fine-grained performer control; most valid combinations are designed to sound musically coherent. Three interchangeable backends—embedding retrieval for sub-second CPU-only use, hosted LLMs via API, and a fine-tuned 270M local model—all emit the same schema. A live generator architecture continuously emits audio while resolving new instructions in the background, crossfading seamlessly when ready; even when an LLM takes 5-12 seconds to respond, the audience hears uninterrupted sound—reframing text-to-music as an ongoing performable stream rather than a one-shot generation. We evaluate text-audio semantic alignment using LAION-CLAP on held-out prompts as a technical proxy, finding that retrieval-based configuration outperforms random valid configurations on this metric, while noting that LAION-CLAP also informed retrieval-map construction. We report performance observations, informal listener feedback, and release materials for the SDK, dataset artifacts, model, and audiovisual performance interface.},
 address = {London, United Kingdom},
 articleno = {120},
 author = {Prabal Gupta},
 booktitle = {Proceedings of the International Conference on New Interfaces for Musical Expression},
 doi = {10.5281/zenodo.20784374},
 editor = {Benedict Gaster and João Tragtenberg and Anna Xambó and Tom Mitchell},
 issn = {2220-4806},
 month = {June},
 note = {},
 numpages = {4},
 pages = {983--986},
 presentation-video = {https://youtu.be/w57o9Ox-qdI},
 title = {A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models},
 track = {paper},
 url = {http://nime.org/proceedings/2026/nime2026_120.pdf},
 year = {2026}
}