A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models
Prabal Gupta
Proceedings of the International Conference on New Interfaces for Musical Expression
- Year: 2026
- Location: London, United Kingdom
- Track: paper
- Pages: 983–986
- Article Number: 120
- DOI: 10.5281/zenodo.20784374 (Link to paper and supplementary files)
- PDF Link
- Presentation/Demo Video
Abstract
We present a real-time musical interface that converts natural-language scene descriptions into evolving procedural soundscapes. A performer types a prompt such as "warm jazz cafe at midnight" and steers it through direct parameter adjustments—stepping brightness down, switching a rhythm style—each producing a predictable, audible shift without re-prompting. Where GPU-bound text-to-audio systems synthesize monolithic waveforms, our instrument generates human-readable configurations over a categorical schema, enabling fine-grained performer control; most valid combinations are designed to sound musically coherent. Three interchangeable backends—embedding retrieval for sub-second CPU-only use, hosted LLMs via API, and a fine-tuned 270M local model—all emit the same schema. A live generator architecture continuously emits audio while resolving new instructions in the background, crossfading seamlessly when ready; even when an LLM takes 5-12 seconds to respond, the audience hears uninterrupted sound—reframing text-to-music as an ongoing performable stream rather than a one-shot generation. We evaluate text-audio semantic alignment using LAION-CLAP on held-out prompts as a technical proxy, finding that retrieval-based configuration outperforms random valid configurations on this metric, while noting that LAION-CLAP also informed retrieval-map construction. We report performance observations, informal listener feedback, and release materials for the SDK, dataset artifacts, model, and audiovisual performance interface.
Citation
Prabal Gupta. 2026. A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models. Proceedings of the International Conference on New Interfaces for Musical Expression. DOI: 10.5281/zenodo.20784374 [PDF]
BibTeX Entry
@inproceedings{nime2026_120,
abstract = {We present a real-time musical interface that converts natural-language scene descriptions into evolving procedural soundscapes. A performer types a prompt such as "warm jazz cafe at midnight" and steers it through direct parameter adjustments—stepping brightness down, switching a rhythm style—each producing a predictable, audible shift without re-prompting. Where GPU-bound text-to-audio systems synthesize monolithic waveforms, our instrument generates human-readable configurations over a categorical schema, enabling fine-grained performer control; most valid combinations are designed to sound musically coherent. Three interchangeable backends—embedding retrieval for sub-second CPU-only use, hosted LLMs via API, and a fine-tuned 270M local model—all emit the same schema. A live generator architecture continuously emits audio while resolving new instructions in the background, crossfading seamlessly when ready; even when an LLM takes 5-12 seconds to respond, the audience hears uninterrupted sound—reframing text-to-music as an ongoing performable stream rather than a one-shot generation. We evaluate text-audio semantic alignment using LAION-CLAP on held-out prompts as a technical proxy, finding that retrieval-based configuration outperforms random valid configurations on this metric, while noting that LAION-CLAP also informed retrieval-map construction. We report performance observations, informal listener feedback, and release materials for the SDK, dataset artifacts, model, and audiovisual performance interface.},
address = {London, United Kingdom},
articleno = {120},
author = {Prabal Gupta},
booktitle = {Proceedings of the International Conference on New Interfaces for Musical Expression},
doi = {10.5281/zenodo.20784374},
editor = {Benedict Gaster and João Tragtenberg and Anna Xambó and Tom Mitchell},
issn = {2220-4806},
month = {June},
note = {},
numpages = {4},
pages = {983--986},
presentation-video = {https://youtu.be/w57o9Ox-qdI},
title = {A Text-Steerable Instrument for Sketching Procedural Soundscapes via Language Models},
track = {paper},
url = {http://nime.org/proceedings/2026/nime2026_120.pdf},
year = {2026}
}