Mozart Ex Machina

In this project, I prototype some new ideas for machine learning algorithms. Currently, artificial neural networks (ANNs) have serious problems of reliability, reproducability and stability, which collectively erode trust in their results. The hope with my new prototypes is to add transparency and robustness to ANNs, thereby restoring confidence in their widespread implementation.

We start small, making sure these ideas can work on the simplest of ANNs. I've chosen to focus on the classification and automatic generation of the music of Mozart. I'll be saving my notes and observations in this blog, hopefully so others can make use of them.

Tools and Resources

Dataset

The first thing I need for this project is a dataset. Since I've chosen Mozart's music as the focus, this means I'll need a selection of his pieces. All of these are in the public domain, and so I am free to re-use them however I like. I've taken the pieces from http://www.piano-midi.de/mozart.htm, where they are stored as MIDI files (more on this file format later). The copyright on this website is cc-by-sa Germany License, which requires that I acknowledge the original copyright owner (Bernd Krueger) and make any content built on this data available under the same license. This does not apply to the pieces themselves, as they are in the public domain.

The dataset contains 21 movements, three each for seven pieces. These pieces are numbered 311, 330, 331, 332, 333, 545, and 570. Their files names are mz_311_1.mid, indicating composer, piece, and movement.

Machine learning package

Next, I'll need software to program my ANNs. I've gone with scikit-learn, a Python package designed for simple machine learning. It's open-source, which is always good in research because it makes the programs accessible. It also means the code won't become useless if a company goes bankrupt. The package uses NumPy, SciPy, and matplotlib, so I need to have those installed as well, along with a Python release.

Music analysis toolkit

While scikit-learn will give me the machine learning tools I need, I'm also going to need software to analyze the music itself. For that, I'm using music21, an open-source Python toolkit from MIT. This manipulates MIDI files, letting me dissect the pieces into their features.

To get the essential components of a MIDI file, one can run the following code snippet.


      import music21
      midi_data=music21.converter.parse('mz_311_1.mid')
    

This extracts information from the MIDI file and puts it into the format for music21. From there, we can pick out specific measures: midi_data.measure(144).show('text'). This returns all information found in measure 344 of this piece, including instruments played, clef, key, tempo, meter, and list of notes.

The format for music21 stores information as streams. These streams contain all other objects relevant to this project. The data extracted from the MIDI files in the dataset is stored as scores. Each of these scores is divided into the separate parts, usually Piano Left and Piano Right. These parts are then comprised of the measures that make up the piece as well as information that is true of the piece throughout, such as the clef, key, meter, and instrument. Note that some of this information may change for some measures. Finally, within each measure is a list of notes, rests, and tempos.

While individual measures can be called using midi_data.measure(#), individual parts have to be called as lists, midi_data.parts[0]. The following code snippet returns relevant information on the piece and follows from the previous code snippet.


      midi_data.parts[0].clef                                                        # returns the clef
      midi_data.parts[0].getInstrument()                                             # returns the instrument
      midi_data.parts[0].measure(1).notes.show('text') # list of chords, notes and rests, including offset from start of measure
      midi_data.parts[0].measure(1).getElementsByClass('KeySignature').show('text')  # key
      midi_data.parts[0].measure(1).getElementsByClass('TimeSignature').show('text') # meter
      midi_data.parts[0].measure(1).getElementsByClass('MetronomeMark').show('text') # tempo(s)
    

To extract notes from `midi_data`, we can recurse into each part and measure.


      data=midi_data.recurse().notes
    

We now have the master list of notes for the entire song. The length of this list is the number of notes: `N=len(data)`. Each entry is one note or chord, complete with several relevant pieces of information. We are interested in three in particular: offset, pitch, and duration.

Before we extract these data points, we need to separate chords into individual notes. Suppose `data[i]` is a chord, then it is stored as a list of notes. We can go through each note in the chord individually.


    def read_chord(chord,group_offset):
      output=[]
      if type(chord)==music21.note.Note:
          output.append(chord.pitch.ps)
          output.append(chord.duration.quarterLength)
          output.append(chord.offset - group_offset)
      elif type(chord)==music21.chord.Chord:
          for k in range(len(chord.notes)):
              noteChord=chord.notes[k]
              output.append(noteChord.pitch.ps)
              output.append(noteChord.duration.quarterLength)
              output.append(noteChord.offset - group_offset)
      return output
    

Feature space

To feed data into an ANN, we need to divide each datum into features. For this project, each datum is a selection of music. Since the music is stored as a sequence of chords, each with a set of notes, an offset, a length of time played, etc., the feature space should be precisely these descriptors. That is, for each chord in the selection, there is a feature for the set of notes, the offset, and the length of the chord, as well as the instrument played and other important details.

The size of each datum can be as small as a single chord or as long as the entire song. However, it must be constant.

Notes and their durations and offsets are easily described by numbers. Chords, however, are not easily reduced to an ordinal format. The complexity of representing chords is rich enough for its own project. To sidestep this, we divide each chord into its constituent notes and take as datum sets of n notes.

Thus, we are using 3*n features: for each note in the datum, there is its pitch space, its quarter length, and its offset relative to the first note in the sequence.

Classifier: Name That Tune

Let's begin with an easy task for scikit-learn to handle: classifying a measure as belonging to a specific song. This ANN will be called Name That Tune, or NTT for short.

NTT takes a single datum and returns the song title. I have 21 songs composed by Mozart as a dataset. The available classes should then be these 21 songs. Note that each 'song' is in fact a movement in a larger piece. Ideally, this information can also be encoded into NTT.

To preprocess the dataset for this portion of the project, we'll need to split up each song, classify each portion, then feed it into one of scikit-learn's classifiers. The following function parses an input song and splits up its collection of notes into sequences of n notes using the `read_chord()` function from earlier.


    def read_noteGroups(file,n):
      midi_data=music21.converter.parse(file)
      data=midi_data.recurse().notes
      N=len(data)
      output=[]
      for i in range(N-n):
          group_offset=data[i].offset
          group=[]
          for j in range(n):
              chord_out=read_chord(data[i+j],group_offset)
              for note_out in chord_out:
                  group.append(note_out)
          output.append(group[0:3*n])
      return output
    

The data is read into Python using the following code.


      import music21
      X=[]; y=[]; i=0;
      for ind in ['311','330','331','332','333','545','570']:
          for j in ['_1','_2','_3']:
              i+=1
              pitches_output=read_noteGroups('mozart/mz_'+ind+j+'.mid',20)
              num_output=len(pitches_output)
              for pitch_out in pitches_output:
                  X.append(pitch_out)
                  y.append(i)
    

We'll use scikit-learn's RandomForestClassifier, a random forest model for classifying samples into qualitative groups. A random forest model uses a collection of decision trees, with each node on the tree a binary split based on a single variable extracted from the feature space. (That is to say, this variable could be a function of many features.) As the name suggests, each tree is random, in the sense that the variable chosen at each node is selected randomly. The binary split is then chosen based on the data for this variable.

For example, consider a computer vision classifier for gender. A decision node on a tree in the random forest may use hair length as its variable. If hair length correlates strongly with female, then this decision node will split those with long hair into the female category and those with short hair into the male category. The tree may continue to grow, adding more decision nodes to further refine its categorization.

There are four main parameters to tweak for this model:

Setting max_depth to None and min_samples_split to 2 results in fully developed trees (all samples split, trees as large as it takes). Standard practice is to set max_features to 'sqrt' for classification problems like ours.

To use RandomForestClassifier, start by initializing the classifier, then feeding it data. The list of datum is stored in X, an array with two dimensions. Each row (first dimension) of X is a datum, and each element in this row (second dimension) is a feature of the datum. The list of classifications of each datum is stored in y, a 1D array, so that the i-th element in y classifies the i-th row of X.

Data is first split into training and testing sets.


      from sklearn.model_selection import train_test_split
      X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0)
    

We now have datasets to plug into our `RandomForestClassifier()`, which we now need to construct.


    from sklearn.ensemble import RandomForestClassifier
    clf = RandomForestClassifier(n_estimators=100, max_features='sqrt', max_depth=None, min_samples_split=2)
    clf.fit(X_train,y_train)
    

With the classifier trained, all that remains is to test it against the test dataset. This will give a percentage of the remaining data that the trained classifier gets right. A number closer to 1 indicates the classifier is good at classifying these note sequences, while a number closer to 0 indicates it cannot correctly associate note sequences with particular songs. With random guesses, a random classifier has a score of 1/m, where m is the number of possible classifications. Therefore, a classifier must be better than this to be considered useful. In our case, that number is 1/21, or 4.8%.


    from sklearn.metrics import accuracy_score
    print(accuracy_score(clf.predict(X_test),y_test))
    

I chose to take sequences of 20 notes, resulting in several thousand sequences. These had 60 features. The resulting score was 87.3%. I found that the choice of max_features was particularly critical to the accuracy. Even taking longer note sequences, if the classifier could not collect more features, then it was as if these sequences were signficantly shorter.

Mistakes I've made along the way

One of the biggest blunders I made was my first attempt to annotate the data and encode the feature space. I originally tried to split each song into measures, keeping chords whole. The problem with this was the number of notes and chords in each measure differs. What, then, is the relevant feature space? I looked only at the collection of notes for a given measure, ignoring duration and relative offset, then assigned a bit to each note. If the note was present in the measure, then the relevant bit was set to 1, and 0 otherwise. The size of the feature space would then be the number of unique notes in the collection of songs. This ended up being too large, so I also threw out octaves, leaving me with 12 features per measure. This was not effective, as it reduced or trivialized much of the information.

It also took me a while to realize the importance of the max_features parameter when initializing the classifier. I originally thought that I only needed to increase the size of the note sequences, n, to get enough unique information about them. But without a corresponding increase in max_features, the classifier couldn't recognize the increase in this size. It was only when I switched to using a max_features proportional to n did I see marked improvement in the classifier's accuracy.