In the opening moments of Jean Sibelius' Violin Concerto, the young soloist plays delicately, almost languidly. The orchestra responds in kind, muting the repeated string motif to a whisper. As the piece progresses, soloist and orchestra alternatively perform the main motifs in increasing measures of power and virtuosity, which inexorably lead toward the movement's stirring resolution. The soloist looks relieved as she crosses the stage to shake the conductor's hand.
This violinist, like most others in music education, can benefit enormously from interacting with large ensembles in honing her performing skills. However, the demand far exceeds the number and capabilities of existing orchestras, ensuring most of these students won't have access to this experience. Our soloist is no exception. The previous paragraph describes her interaction with Chris Raphael's Music Plus One system: A machine learning-driven alternative to working with orchestras that retains much of the expressivity and interactivity that makes concerto performance such a rewarding and educational experience. The following paper details the approach, for videos see http://www.music.informatics.indiana.edu/papers/icml10/.
Automatic music accompaniment has been actively researched since the 1980s, starting with the work of such pioneers as Barry Vercoe and Roger Dannenberg. The problem can be broken into three subparts:1 tracking the playing of a human soloist, matching it to a known musical score, and synthesizing an appropriate accompaniment to that solo part in real time. Solutions usually involve ingenious pattern-matching mechanisms for dealing with expressive, incorrectly played or missing notes in the soloist performance, while using the output of the pattern match to drive the scheduling of accompaniment events. However, as Raphael notes, it is impossible to accomplish score following by reaction alone. The system must incorporate a predictive component that attempts to align upcoming notes of the accompaniment with imminent attacks of the human player. Failing to solve this problem can result in potentially disastrous consequences for the performance.
The proposed approach starts by using a hidden Markov model-based score follower tasked with estimating the start time of the notes played by the soloist and matching them to a position in the score. The model considers the sequence of frame-wise signal features, characterizing transient and pitch information on the audio input, as its output, and the state graph for the Markov chain as a sequence of note sub-graphs modeling the soloist's performance. In a process akin to echo cancellation, the contribution of the accompanist to the audio signal is explicitly modeled to avoid the system following itself.
The estimated sequence of note onset times could be used to adaptively control the playback rate of the orchestral recording and match the player's position in the score. However, it is not possible to accurately estimate the soloist's timing without a certain amount of latency, thus causing the orchestra to consistently lag behind. The author's solution is to use a Gaussian graphical model to predict the timing of the next orchestra event based on previous observation of both solo and orchestra note occurrences. In this context, the orchestra's playback rate, modulated using the well-established phase vocoder method, is continuously re-estimated as new events are observed, a formulation that is robust to missing notes, as pending orchestra notes are conditioned only on those events that have been observed. Crucially, Raphael uses information from rehearsals to adapt the predictive model to the soloist's interpretation style, thus mimicking the process by which human performers learn to anticipate each other's actions through practice.
The system's architecture is constructed using common staples of the machine learning literature such as hidden Markov models and Gaussian graphical models. Yet, these elements and others are combined using a healthy dose of musically meaningful insights, as well as the engineering acumen necessary to make the system work robustly in real time, as emphatically demonstrated by the accompanying videos. The result is an effective, albeit limited, model of a human accompanist that has been extensively tested by student performers at one of the country's premier conservatories, the Jacobs School of Music at the University of Indiana.
The system is an important milestone toward machine musicianship. Through the use of machine learning, computers are acquiring new skills once thought to be uniquely human. If music communicates human emotions, isn't it futile to teach computers to play music? The beauty of Music Plus One is that it follows and amplifies the emotions of the human player. In that sense, it is very much like a traditional musical instrument, albeit a highly sophisticated one. Music Plus One relies on a predetermined musical score. Perhaps the next step would be to create an automatic rhythm section that reacts to a soloist the way jazz musicians instantly react to each other's improvisations. It would require a new level of machine musicianship, and would constitute a major challenge for machine learning, one that will increase our understanding and appreciation of the human mind's ability to create and improvise.
©2011 ACM 0001-0782/11/0300 $10.00
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from email@example.com or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2011 ACM, Inc.