Stephen McAdams & Kai SiedenburgPerception of musical timbrePerception and cognition of musical timbre1Running title: Musical timbreStephen McAdamsSchulich School of MusicMcGill University555 Sherbrooke Street WestMontreal, QCCanada H3A [email protected] SiedenburgDepartment of Medical Physics and AcousticsUniversity of OldenburgKüpkersweg 7426129 Oldenburg, [email protected] in P. J. Rentfrow & D. J. Levitin (Eds.),Foundations of Music Psychology: Theory and Research, pp. 71-120,Cambridge, MA, MIT Press, 2019.ABSTRACTThis chapter explains timbre as a perceptual property of a specific fused auditory event. Timbrerefers to a complex set of auditory attributes that carry musical qualities, collectively contributeto sound source recognition and identification, and complement other auditory attributes ofmusic. Psychophysics of timbre and covariance with other musical parameters will be discussed.This chapter will also explore absolute and relative perception of timbre, memory for timbre, andtimbre as a structuring force in music perception. This chapter also includes discussion of currentand future applications of timbre research for music information retrieval and musical machinelearning.KEYWORDS:Timbre, Music perception, Music cognition, Hearing, Psychophysics, Neuroscience1This chapter is an updated and expanded version of McAdams (2013).1

Stephen McAdams & Kai SiedenburgPerception of musical timbreTimbre refers to a complex set of auditory attributes that carry musical qualities, and thatcollectively contribute to sound source recognition and identification. It complements otherauditory attributes of musical sounds such as pitch, loudness, duration, and spatial position.Timbral attributes arise from an event produced by a single acoustic or electroacoustic soundsource or from events produced by several sound sources that are perceptually fused or blendedinto a single auditory image. Timbre is thus a perceptual property of a specific fused auditoryevent.A major property of timbre is that it covaries with many other musical parameters in allacoustic and some electroacoustic instruments. For example, a specific oboe played with a givenfingering (pitch) at a given playing effort (dynamic) with a particular articulation andembouchure configuration produces a note that has a distinct timbre. The timbre will change ifany of these parameters are changed. Therefore, an instrument such as an oboe does not have “atimbre”, it has a constrained universe of timbres that covary with the other musical parameters.For example, the timbres of clarinet sounds are vastly different in the lower (chalumeau) registerthan in the higher (clarion) register, and a trombone player can make the sound darker by playinga bit softer. There may, however, be certain acoustic invariants that are common across all of theevents producible by an instrument that signal its identity. Timbre is thus a rather vague wordthat implies a multiplicity of perceptual qualities, some of which were addressed as early as thelate 19th century by Helmholtz (1885/1954) followed by the seminal work of Stumpf (1926) andSchumann (1929). The vast majority of contemporary research has been conducted over the last45 years or so, starting with the pioneering work of Plomp (1970) and Wessel (1973).We now understand timbre to have two broad characteristics that contribute to theperception of music (see reviews in Hajda, Kendall, Carterette & Harshberger, 1997; Handel,1995; McAdams, 1993; Risset, 2004):1) It is a multitudinous set of perceptual attributes, some of which are discrete orcategorical (e.g., the “blatt” at the beginning of a sforzando trombone sound or thepinched offset of a harpsichord sound), others of which are continuously varying (e.g.attack sharpness, brightness, noisiness, richness, roughness). In this sense, timbrevaries continuously as do the other auditory attributes of pitch, loudness and spatialposition. Just as sounds can be higher or lower and louder or softer and to the left orright, up or down, near or far, they can also be more or less bright, rough, sharp inattack, rich, nasal, inharmonic, and a plethora of other qualities.2) It is one of the primary perceptual vehicles for the recognition, identification, andtracking over time of a sounding object (tenor voice, violin, tubular bells), and thus isinvolved in the absolute categorization of a sounding object.Understanding timbre perception thus involves a wide range of issues connecting thephysics of sound sources to relevant aspects of perception and cognition: determining theproperties of vibrating objects and of the acoustical waves emanating from them, developingtechniques for quantitatively analyzing and characterizing sound waves, formalizing models ofhow the acoustic signal is analyzed and coded neurally by the auditory system, characterizing theperceptual representation of the sounds used by listeners to compare sounds in an abstract way orto categorize or identify their physical source, and understanding the role timbre can play inperceiving musical patterns and forms and in shaping musical performance expressively. Moretheoretical approaches to timbre have also included considerations of the musical implications of2

Stephen McAdams & Kai SiedenburgPerception of musical timbretimbre as a set of form-bearing dimensions in music (cf. McAdams, 1989). This chapter willfocus on some of these issues in detail: the psychophysics of timbre, timbre as a vehicle forsource identity, the perception of timbre relations, memory for timbre, the role of timbre inmusical grouping, and timbre as a structuring force in music perception, including the effect ofsound blending on the perception of timbre, timbre’s role in the grouping of events into auditorystreams and musical patterns, the role of timbre in the building and release of musical tension,and implicit learning of timbral grammars. We also include a brief survey of neuroscientificstudies of timbre. We will conclude by examining a number of issues that have not beenextensively studied yet—issues concerning the role of quantifying timbral characteristics inmusic information retrieval systems, control of timbral variation by instrumentalists and soundsynthesis control devices to achieve musical expressiveness, the link between timbre perceptionand orchestration and electroacoustic music composition, and finally, consideration of timbre’sstatus as a primary or secondary parameter in musical structure.Psychophysics of timbreOne of the main approaches to timbre perception attempts to quantify the ways in whichpeople perceive sounds to differ. The primary conception of timbre from the time of Helmholtzin the mid-1800s until the 1970s was in terms of its relation to spectral shape. Early research onthe perceptual nature of timbre focused on preconceived aspects such as the relative weights ofdifferent frequencies present in a given sound, or its “sound color” (Slawson, 1985). Forexample, both a voice singing a constant middle C while varying the vowel being sung and abrass player holding a given note while varying the embouchure and mouth cavity shape wouldvary the shape of the sound spectrum (cf. McAdams, Depalle & Clarke, 2004b). Helmholtz(1885/1954) invented ingenious resonating devices for controlling spectral shape to explore thesespectral aspects of timbre. He used air jets blowing out of tubes across tuned water jugs andvaried the air speed stimulating each jug and the tuning of the jugs to vary the timbre. Licklider(1951) discussed various aspects of complex sounds but only concluded that “Until carefulscientific work has been done on the subject, it can hardly be possible to say more about timbrethan that it is a ‘multidimensional’ dimension” (p. 1019). Plomp (1970) explored the notion oftimbre’s multidimensionality, but the real advances in understanding the perceptualrepresentation of timbre had to wait for the development of signal generation and processingtechniques, and of multidimensional data analysis techniques in the 1950s and 1960s. Wessel(1973) was the first to apply these to timbre perception and in particular began to emphasize theimportance of time-varying aspects of sound for timbre perception. From this approach, thenotion of timbre dimensions in a timbre space was developed.Timbre spaceMultidimensional scaling (MDS) makes no preconceptions about the physical orperceptual structure of the data it is being used to analyze. For timbre, listeners typically rate ona scale varying from very similar to very dissimilar all pairs from a given set of sounds. Thesounds are usually equalized in terms of pitch, loudness, and duration and are presented from thesame spatial location so that only the timbre varies in order to focus listeners’ attention on thisset of attributes. The dissimilarity ratings are then fit to a distance model (or spatial map) inwhich sounds with similar timbres are closer together and those with dissimilar timbres are3

Stephen McAdams & Kai SiedenburgPerception of musical timbrefarther apart. The analysis approach is presented in Figure 1. The graphic representation of thedistance model is called a “timbre space.” Such techniques have been applied to synthetic sounds(Caclin, McAdams, Smith & Winsberg, 2005; Miller & Carterette, 1975; Plomp, 1976, chap. 6),resynthesized or simulated instrument sounds (Grey, 1977; Kendall, Carterette & Hajda, 1999;Krumhansl, 1989; McAdams, Winsberg, Donnadieu, De Soete & Krimphoff, 1995; Wessel,1979), recorded instrument sounds ( Elliott, Hamilton & Theunissen, 2013; Iverson &Krumhansl, 1993; Lakatos, 2000; Wessel, 1973), and dyads of recorded instrument sounds(Kendall & Carterette, 1991; Tardieu & McAdams, 2012).[Insert Figure 1 about here]The basic MDS algorithm (Kruskal, 1964a,b), is expressed in terms of continuousdimensions that are shared among the timbres, the underlying assumption being that all listenersuse the same perceptual dimensions to compare the timbres. The model distances are fit to theempirically derived proximity data (usually dissimilarity ratings or confusion ratings amongsounds). More complex algorithms, also include dimensions or features that are specific toindividual timbres, called “specificities” (EXSCAL, Winsberg & Carroll, 1989) and differentperceptual weights accorded to the dimensions and specificities by individual listeners or latentclasses of listeners (INDSCAL, Carroll & Chang, 1970; CLASCAL, Winsberg & De Soete,1993; McAdams et al., 1995). The equation defining distance in the more general CLASCALmodel is:. "# & 2)34 𝑤 ) *𝑥") 𝑥#) - 𝑣 *𝑠" 𝑠# -,(Eq. 1)where ijc is the distance between sounds i and j for latent class c, xid is the coordinate of sound ion dimension d, D is the total number of dimensions, wcd is the weight on dimension d for classc, si is the specificity on sound i, and vc is the weight on the whole set of specificities for class c.The basic algorithm doesn’t model weights or specificities and only has one class of listeners.EXSCAL has specificities, but no weights. INDSCAL has no specificities, but has weights oneach dimension for each listener. Finally, the CONSCAL algorithm allows for continuousmapping functions between audio descriptors and the position of sounds along a perceptualdimension to be modeled for each listener using spline functions, with the constraint that theposition along the perceptual dimension respect the ordering along the physical dimension(Winsberg & De Soete, 1997). This technique allows one to determine the auditory transform ofeach physical parameter for each listener. Examples of the use of these different analysisalgorithms include: Kruskal’s technique by Plomp (1976), INDSCAL by Wessel (1973), Millerand Carterette (1975), Plomp (1976), and Grey (1977), EXSCAL by Krumhansl (1989),CLASCAL by McAdams et al. (1995), and CONSCAL by Caclin et al. (2005). Descriptions ofhow to use the CLASCAL and CONSCAL algorithms in the context of timbre research areprovided in McAdams et al. (1995) and Caclin et al. (2005), respectively. One of the difficultiesof this approach is that the number of ratings that each listener has to make increasesquadratically with the number of sounds to be compared. Elliott et al. (2013) used de Leeuw andMair’s (2009) SMACOF algorithm, which can perform multi-way constrained MDS in whichmultiple similarity ratings from different listeners are used for each pair of stimuli; a givenlistener only rates a subset of a large set of stimulus pairs.Specificities are often found for complex acoustic and synthesized sounds. They areconsidered to represent the presence of a unique feature that distinguishes a sound from all others4

Stephen McAdams & Kai SiedenburgPerception of musical timbrein a given context. For example, in a set of brass, woodwind, and string sounds, a harpsichordhas a feature shared with no other sound: the return of the hopper which creates a slight “thump”and quickly damps the sound at the end. Or in a set of sounds with fairly smooth spectralenvelopes such as brass instruments, the jagged spectral envelope of the clarinet due to theattenuation of the even harmonics at lower harmonic ranks yields a perceptual feature (oftendescribed as “hollowness”) that is specific to that instrument. Such features might appear asspecificities in the distance models derived with the EXSCAL and CLASCAL algorithms(Krumhansl, 1989; McAdams et al., 1995), and the strength of each feature is represented by thesquare root of the specificity value in equation 1.As an example, the timbre space reported by McAdams et al. (1995) is shown in Figure 2.It is based on dissimilarity ratings by 84 listeners including nonmusicians, music students andprofessional musicians. Listeners were presented digital simulations of instrument sounds andchimæric sounds combining features of different instruments (such as the vibrone with bothvibraphone-like and trombone-like featur