ISMIR 2018 round up

At the end of September, FeedForward was pleased to sponsor ISMIR 2018 (International Society for Music Information Retrieval) and the team spent an excellent week in Paris. Here’s a round-up of the key things we took away from the week.

What is music information retrieval?

For anyone not familiar with the field here’s a brief introduction…

Music tends to be exist in the physical world in one of two forms:

  1. Symbolic: the music is represented using symbols e.g. notes on a score, MIDI, piano roll.

  2. Audio: a performance was recorded (or created on production software) and now exists in an audio file e.g. .wav

There are many scenarios in the real world where we want to categorise, manipulate or create music. In order to do this, we need to extract useful information from the music to work with, and this is what MIR seeks to do.

Some key applications for MIR:

  • Recommender systems: used for search, playlisting, recommendations etc.

  • Source separation & instrument recognition: extracting original stems as recorded from full tracks.

  • Automatic music transcription: converting audio into symbolic representation.

  • Automatic categorisation: categorising music by genre or other tags.

  • Music/sound generation: generating new full tracks, new parts of an existing track, new sounds, vocal synthesis etc.

Key takeaways

Deep learning continues to match / outperform traditional approaches

One of the main takeaways from this year’s ISMIR is that, in every area of MIR, there were researchers using deep learning techniques. Whilst the field of MIR is decades old and deep learning has only been making a significant appearance over the last five years or so, basic deep learning models are already competitive with or outperforming the existing traditional state of the art models.

We particular noted that VAEs (Variational Autoencoders) were being used for a range of applications. This is an architecture that we believe has significant possibilities for a range of creative applications and it was great to see the research results speaking for themselves.

Availability of data is a major challenge

Music separation with DNNs_ making it work.png

If there’s one thing that can’t be said enough about deep learning - the quality of the output is dependent upon availability of data. Whereas deep learning researchers working with images often have access to datasets with millions, and sometimes billions, of images, the availability of suitably-large datasets in music & audio is a real challenge.

This was particularly noticeable in the source separation tutorial at the beginning of the week. The most useful datasets for training deep learning systems for source separation contain both the full track and its stems (a stem can be used as the “target” part to separate from the full track) and you can see below that the largest datasets that fulfil this criteria contain just 100 & 150 tracks.

More on source separation

It was interesting to see Wave-U-Net adapt the U-Net architecture for source separation, tackling the source separation problem in the time domain, whilst another piece of work proposed using stacked hourglass networks.

Recommendation systems are increasingly important


At the tutorial on recommendation systems, co-presented by Fabien Gouyon, Chief Scientist at Pandora, it was clear that successful recommendation is increasingly important for content-based businesses and that MIR R&D has the potential to disrupt many parts of the music industry landscape.

Changes in the way we consume music are creating new data points for understanding users and, as the music industry moves away from a “Discover & own” model towards an “Access” model, users expect a personally meaningful experience, driven by context-aware recommendations.

At the moment, voice interfaces mostly enable “Command & fetch” (e.g. ‘Play me Pharrell Williams “Happy”’) but it is expected they will evolve and users will expect context-aware search (e.g. ‘Play me something that will make me happy’). To achieve this technically, the strength of multi-modal systems that combine collaborative filtering and deep learning methods was emphasised - this is the approach that sits behind our FIGARO.AI framework.


The tutorial addressed many other topics that are key considerations for recommendation systems, including how to address the challenges of sequence-aware recommendation. It was proposed that a system with multiple strategies for predicting which track will fulfil user expectation is an effective way for a system to adapt to user-specific intent.

It was made clear that recommendation systems are not only important for music discovery & consumption, but also have a role to play in creative music making. With an increasing number of new (academic) interfaces for sample browsing, we can expect the way that producers find sounds in the future to become more intuitive, and driven by MIR.

User anonymity


We particularly like this paper from the National Institute of Advanced Industrial Science and Technology (AIST), Japan It highlighted that streaming services are able to predict demographic features such as age and gender with high accuracy using a user’s play log. To counteract this, it proposed a service which preserves anonymity by suggesting songs to camouflage a user’s demographic identity. It was good to see academic work addressing the increasing popular fear around user data ownership & corporate manipulation.

Awarded papers

  1. Learning to Listen, Read, and Follow: Score Following as a Reinforcement Learning Game (Dorfer, Henkel & Widmer): Uses reinforcement learning to train a score following model. The model is trained on pixels (score image inputs) and spectograms.

  2. End-to-end learning for music audio tagging at scale (Pons, Nieto, Prockup, Schmidt, Ehmann & Serra): compares model architectures for different dataset sizes. Shows that non-handcrafted features (raw waveforms) work best with DNNs when enough training data is available.

  3. Bridging audio analysis, perception and synthesis with perceptually-regularized variational timbre spaces (Esling, Chemla-Romeu-Santos & Bitton): Combines deep learning with available (high-level) metadata on audio to produce a semantically meaningful neural synthesizer.

Other works of interest

Stefan Lattner presented work that he collaborated on in Linz on the use of gated autoencoders for generative composition models that learn the transformations of musical material within the data.

Keynote speakers, WiMIR & industry meetup

It is worth highlighting these three elements of ISMIR 2018. The keynote speakers, Patrick Flandrin & Rebecca Fiebrink added excellent context to the scientific program, respectively presenting the history of representing sound and the use of machine learning in the creative process. /

The WiMIR initiative continues to offer practical action to support women in the field and the high attendance of this session is testament to the support and interest from the whole community.

ISMIR 2018 also saw the first “Industry Meetup”, which we were pleased to be part of. The crossover between conference participants and company representatives demonstrates how increasingly important MIR research is within the commercial music industry.

We’re already looking forward to 2019 in Delft!