November 1999

The Academic College of Tel Aviv – Yaffo
Department of Computer Science
Final Project in Artificial Neural Networks:

Speedy Composer
(Artificial Neural Network Melody Composer)

URL: https://www.speedysoftware.com/composer/


Presented by:
Instructor:

Introduction:

This project aim is to research and implement an Artificial Neural Network Melody Composer – a computer program that will use Artificial Neural Networks to compose melodies. The networks are trained on given melodies, most of them melodies of famous Hebrew songs. Then the networks are used to compose new music with a style similar to the original melodies.

The goal is to train Artificial Neural Networks on notes in this way: I gave the network input about the current chord, the next chord, and the history of notes in this melody up to this note. The network is then trained to select the next note in the melody. When composing, the program goes note by note, composes it and moves to the next note. There is no need to use recurrent networks – the data is entered to the input manually.

The main purpose is to compose melodies which sound nice & interesting to a human listener. For this sake, I decided to focus on melodies composed according to a given progress of chords, given by the user or taken from a known song. I also made experiments on composing melodies without selecting chords in advance, but in this case the melodies sounded less interesting and with no purpose. So I decided to focus on composing melodies according to given chords.

In this project, I focused solely on composing single-note melodies, for a few reasons:

  1. Most of the known songs and melodies I’m familiar with are single-note melodies.
  2. single-note melodies are interesting enough.
  3. this area has not been researched a lot, and a success in single-note melodies is satisfying enough.

I didn’t want to restrict the melodies to a given pace, so I put in the training set both waltz melodies (pace 3/4), and melodies with pace 2/4 or 4/4.

The reason for choosing to compose music by Artificial Neural Networks, is because I believe there are many "unwritten rules" in the songs we know and like, which are difficult to put a finger on, but an Artificial Neural Network can find rules like these and implement them.

Background:

In the book "Music and Connectionism", by Peter M. Todd & D. Gareth Loy, there are a few articles regarding composing music. most of the articles refer to compositions by Artificial Neural Networks. I will mention a few of them:

In Dan Gang's site, there are a few articles about musical Artificial Neural Networks projects which were developed by Gang in cooperation with other computer science and music experts. There are several projects with different targets, especially regarding to composing music or harmonising (adding chords to) a given melody. The writers have made impressive achievements in these areas, and succeeded in building Artificial Neural Networks which can generalize well. They usually represented the input in an exponential decay method similar to Todd's method, in addition to other units which represent the pace. In the majority of the cases they used feed forward Artificial Neural Networks with one hidden layer. Dan Gang also gives courses on this subject. I learned a lot from their experience and implemented it in my project.

There is a whole area of research and software dealing fractal music, or music based on the appearance of prime numbers in the number scale. These programs create music by using equations of fractals and converting sequences of numbers which they produce into a sequence of notes, and similar methods regarding sequences of prime numbers. There are many people who use these methods, and I estimate that at least 80% of the work in automatic composition of music focus on these methods. Especially, there are a lot of websites that deal with this kind of composing. There is a certain regularity in the fractal music or in fractals in general, but personally I think this music is not usually interesting enough. I think other methods, including Artificial Neural Networks, can achieve better results.

I have not encountered yet commercial programs for composing music (except for fractal music, as mentioned above), and most of the existing material was developed and written for academic or research purposes. Not many people tried to develop programs for automatic music composition, and only a few of them did this with Artificial Neural Networks. For this reason, I found special challenge in this project, which is done in an area in which there is still a lot to discover.

The network architecture and output:

I have considered a few training algorithms and network architectures, including 1 layer perceptrons, and feed-forward, cascade-forward and Elman back-propagation networks of up to 2 hidden layers. I did a lot of testing, including training feed-forward and cascade-forward networks with 1 or 2 hidden layers on the songs, and testing the network on the same songs. Then I compared the results of the trained networks – usually the best results were with feed-forward networks with 2 hidden layers, 20 to 40 units in the first hidden layer, and 15 to 20 units in the second hidden layer. The results also depend on randomness, so networks with identical structure not always lead to the same results.

I compared training one network on both the pitch class units (12 units) and interval units (7 units) as output units, and training two separate networks on each set of units (with the same input set, and the same number of units in the input and hidden layers). It appeared that the separate networks perform better than the combined network, even if the training time of the combined network is the sum of the training time of the two separate networks. So I decided to use couples of separate networks – each couple has one pitch class network (12 output units) and one interval network (7 output units), both are trained on the same training set and input representation, and are identical in the number of units in the input and hidden layers.

The pitch class and interval networks, do not decide when a new note will begin. I have trained a the note_begin network in order to do this. This network receives input in the same representation, and has to decide if the next note will begin now or later (it’s also possible to do it in other ways, for example to have a few output units, each representing a note length). So this network has only one output unit, which can be 1 for a new note and 0 for no new note. In the beginning, I added this output unit to the other networks, and trained one network for all the 3 purposes. But I found out, that since in many cases there is no beginning of a new note – this confuses the other networks, which tend to repeat the last note too often. So I decided to separate the networks, and train the other 2 networks only on notes where the note actually begins.

Currently, there are 3 separate networks, each of them with a different task.

The first network is the note_begin network, and it has one output unit, which decides if the next note will begin now or later. Currently, this network output is treated in a deterministic way, and not in a random way – any value above 0.5 is rounded to 1, and below 0.5 rounded to 0. This network is optional, and it’s also possible to use pace from a given melody or template.

The other two networks are the pitch class and interval networks. These networks are trained both on the same data, where all the notes in their training data are separate notes, where the note begins. When composing, these networks are used only after Speedy Composer have already decided there will be a new note beginning.

The notes are chosen randomly – the output units are taken as probabilities for selecting the next note, where there are 12 output units for the 12 pitch classes, and 7 units for intervals (interval ranges are -12 to -6, -5 to -3, -2 to -1, 0, +1 to +2, +3 to +5, and +6 to +12). All probabilities which are lower than X_max * the max value are not considered (they are assumed as 0), where X_max is a number between 0 and 1, and a typical value is 0.25 (X_max does not have to be the same number for each unit – it’s possible to give different numbers for each unit to manipulate the results). Note, that if X_max is close to 1 – only the unit with the maximum value is taken, resulting in a more deterministic results; and if X_max is close to 0 – all values are taken. The purpose of X_max is to force the next note to be selected by the more probable selections.

For each possible note, the probability of its pitch is multiplied by the probability of its interval. This can lead, sometimes, to all probabilities being 0 – in this case, the last note is taken again.

All probabilities are multiplied by an array called pitch_probability_array, for all possible absolute pitches (from C4 to C7), which its purpose is to make pitches closer to the edges less probable. Here are the values of pitch_probability_array (see graph in Appendix 5):

C4 :    0
C#4:    0.0008
D4 :    0.0063
D#4:    0.0210
E4 :    0.0490
F4 :    0.0939
F#4:    0.1585
G4 :    0.2447
G#4:    0.3536
A4 :    0.4850
A#4:    0.6381
B4 :    0.8107
C5 :    1.0000
C#5:    1.0000
D5 :    1.0000
D#5:    1.0000
E5 :    1.0000
F5 :    1.0000
F#5:    1.0000
G5 :    1.0000
G#5:    1.0000
A5 :    1.0000
A#5:    1.0000
B5 :    1.0000
C6 :    1.0000
C#6:    0.8107
D6 :    0.6381
D#6:    0.4850
E6 :    0.3536
F6 :    0.2447
F#6:    0.1585
G6 :    0.0939
G#6:    0.0490
A6 :    0.0210
A#6:    0.0063
B6 :    0.0008
C7 :    0
see compose.m for the function by which this array is calculated.

Representing the input for the networks:

There has been a lot of research about different ways for representing the input for Artificial Neural Networks, and many articles were written about it. Especially I considered Peter M. Todd’s and Michael C. Mozer’s representations (written in their articles in the book "Music and Connectionism"), and chose Todd’s representation of pitch with exponential decay (units with decay mean that the new value is added to the previous value, after decay. The decay factor must be less than 1, and not negative), with a few changes. I divided the notes into 12 pitch classes (C, C#, D, D#, E, F, F#, G, G#, A, A#, B), and each pitch class was represented by a decay input unit in each input decay set. I tested the networks on a variety of decay set numbers and decay factors, and found out that the best performance was usually with 3 input decay sets for pitch class, and 3 input decay sets for interval between notes, and the decay factors chosen were about 0.9 to 0.95, 0.4 to 0.7 and 0 respectively (a typical network can have, for example, decay factors of 0.9, 0.6 and 0). Notice that 0 means no decay – this input set contains only the last value given.

The idea of having more than one decay set, is to let the network have "long term memory", "short term memory" and "last note memory" at the same time, and it resulted in networks that learn well. But the performance of the networks is also dependant on chance, and sometimes networks with less input decay sets made better performance. When comparing networks and representations of inputs to each other – all networks were trained for the same period of time (usually in the range of 10 to 40 minutes, depending on the size of the network and the train set) and the same number of notes, which were consisted of either the entire data set or a randomly chosen part of it.

More input units represent the chords – each chord is represented by 12 pitch-class units. For example, the chord "C Major" is represented by the pitch classes C, E and G as 1, and all other units as 0. The networks consisted one input set of the next chord, and one or two input sets of the average chord starting at the current note, and counting n notes, where n is 1 to 6. A typical network contains, for example, two sets like this with 1 and 4 as the n. notice that 1 means the current chord (no average is taken).

More input units represent the position of the note relevant to the chords (a new chord usually represents a beginning of a bar, so no information about bars is given), in this way: one unit represents if this note is in the beginning of the beat or in the middle of it (all songs resolution was half beat). And another set of n units represents 1 if a chord starts at the relevant beat (n is always an odd number). The middle unit will always represent the current beat, and the rest represent the neighbour beats. For example, if n=5, this set will represent 2 beats backwards and 2 beats forward. I also compared networks trained on this number of beats from 0 to 8, and usually the best performance was around 2 beats.

(see file train_all.m for the implementation of preparing the input and output, and training the networks).

Here are a few examples, if the number of beats is 2, and the number of units is 5:
0 0 1 0 0 – chord begins at current beat, with pace 3/4 or 4/4.
1 0 1 0 1 – chord begins at current beat, with pace 2/4.
0 1 0 1 0 – middle of bar, pace 2/4.
1 0 0 0 1 – middle of bar, pace 4/4.
0 1 0 0 1 – one beat after the chord, pace 3/4.

All input units values range from 0 to 1 (most of them receive only values of 0 or 1), where in the decay units this value is added to the previous value after decay. So for a decay unit with a decay value d, the range is 0 to 1/(1-d).

The input representation lets the networks be independent of the actual pitch height – same melody, when transposed by a full octave, will lead to exactly the same input. But, transposing a melody to a different key will have effect on the networks. Currently, all the melodies used for training are in the same key (C Major or A minor), and the networks are trained to compose melodies in this key.

It would be interesting to train the networks on melodies in more than one key – maybe even on melodies in all the 12 possible keys (each melody can appear in more than one key in the training set). I haven’t tried this yet, for a few reasons. First, this would lead to a much larger training set (12 times larger). second, it can lead to melodies which would skip from one key to another (this can overcome by giving more input units that would represent the melody key). and third, I didn’t want to confuse the networks too much.

In the beginning, I used networks with less input units and data – only 2 sets of pitch class decay units (12 units each), and 1 set of interval decay units (7 units) – totally 31 units in the input layer. I used these networks to compare different numbers of units in the hidden layers. Later I added more decay sets, and compared different decay factors for the networks. As the project advanced, I added more information about the chords, pace etc – and this lead to better performance of the networks.

In the future, it will be a good idea to give the networks more data. For example, currently, the networks don’t receive any data about the length of the current composed note. This data can help the networks learn better. Other data can be, for example – extending the chord neighbourhood units to half beats (this will double the number of such units), giving information about where the notes begin and so on.


Number of unitsNumber of setsDescriptionTotal number of units
Typical rangeTypical valueTypical rangeTypical value Typical value
7 – 12120 – 11Next chord in melody. Each unit represents a pitch class.12
7 – 12120 – 11Current chord in melody. Each unit represents a pitch class.12
7 – 12120 – 20 or 1Average of the chords for the next n beats (typically 2 or 3 beats). Each unit represents a pitch class.0 or 12
7 – 12120 – 33Decay sets for pitch class. Typical values for decay are 0.9, 0.6 and 0 . Each unit represents a pitch class.36
3 – 2570 – 33Decay sets for interval. Typical values for decay are 0.9, 0.6 and 0 . Each unit represents an interval range.21
1 – 950 – 11Represents where the chords begin in the neighborhood of the current note. Each unit represents a beat. Chords never start at half beats.5
110 – 11Indicates if the current note start at full beat or half beat.1
1 – 12-0 – 10Represents where the notes begin in the neighborhood of the current note. Each unit represents half a beat. Currently not used.0
Total:87 or 99

The training set:

There are 63 melodies in the training set, 20 out of them are waltzes (3/4) melodies, and the rest (43 melodies) are melodies with pace 4/4 or 2/4. Most of the melodies are melodies of known hebrew songs, and there are a few more songs I added. Most of the melodies (except 12 of the waltzes) are entered twice to the training set, once with normal speed and once with half speed – letting half speed melodies, when played at normal speed, contain also 1/16 notes. The total number of melodies in the training set is 114 (after duplication). All the melodies in the training set were transposed to the keys of C Major or A minor, to let the networks learning and generalization better.

I entered the songs to the computer from a few hebrew books of popular hebrew songs, using a ShareWare program called NoteWorthy Composer. All songs contain one track of melody and one of chords. They are translated to MIDI format, and then to text format using midinote.exe (a FreeWare program written by Guenter Nagler), then read by a C++ program I wrote (makedata.cpp), which validates the data and writes them into my own simple format – an array of 32-bit integers, each of them representing a note value, chord value, note_begin, chord_begin and a few extra information.

The total number of 1/8 notes in the training set is 46365 (17433 separate notes with 6988 separate chords), but each melody in my training set is repeated twice, and only the second part is taken. This is to have some history regarding the previous notes, even for the first note. I have also decided to include only once notes that repeat more than once in the same song, with the notes back equal in the preceding 12 beats (24 * 1/8 notes). This is because many songs contain parts that repeat more than once in the song. This makes the total number of 1/8 notes in the training set 20734 (7982 separate notes with 3129 separate chords).

I have trained each kind of network a few times on the same input, each time using slightly different values for the input representation, for the decay factors and for the number of units in each layer. Then I compared the results of the trained networks, and selected the networks with the best performances.

The note_begin network is trained on 8000 notes, randomly chosen from the 20734 notes in the training set, where in 4000 notes there is a new note beginning and in the rest 4000 notes there is a continuation. I have trained 6 note_begin networks on randomly chosen data.

I have trained the pitch class and interval networks on 2 training sets – one made of all the notes in the training set taken only where the note begins. This training set consists of a total of 7982 notes, and I have selected 6 couples of networks trained on this set, out of 38 trained couples.

The second training set is made of only the notes where the chords were Major chords C, F, G or G7. This training set consists of a total of 3321 notes, and I have 6 couples of networks trained on this set. The networks trained on this set are used to compose melodies that contain only the above chords.

The training algorithm:

I have considered a few training algorithms in MATLAB, including TRAINGD (Gradient descent back-propagation), TRAINGDX (Gradient descent w/momentum & adaptive lr back-propagation), TRAINELM (Elman recurrent network), TRAINLM (Levenberg-Marquardt back-propagation), TRAINBFG (BFGS Newton back-propagation), TRAINOSS (One step secant back-propagation), and TRAINRP (RPROP back-propagation). I selected TRAINRP, where the best results were. Some algorithms, for example TRAINLM, did not work due to the big amount of memory required (because of my large training set). The performance of networks trained on TRAINGD, TRAINGDX etc was not as good as the performance of networks trained on TRAINRP.

TRAINRP is a network training function that updates weight and bias values according to the resilient back-propagation algorithm (RPROP). It eliminates the harmful effect of having a small slope at the extreme ends of sigmoid squashing transfer functions. RPROP is generally much faster than the standard steepest descent algorithm. It also has the nice property that it requires only a modest increase in memory requirements. We do need to store the update values for each weight and bias, which is equivalent to storage of the gradient.

Multilayer networks typically use sigmoid transfer functions in the hidden layers. These functions are often called squashing functions, since they compress an infinite input range into a finite output range. Sigmoid functions are characterized by the fact that their slope must approach zero as the input gets large. This causes a problem when using steepest descent to train a multilayer network with sigmoid functions, since the gradient can have a very small magnitude, and therefore cause small changes in the weights and biases, even though the weights and biases are far from their optimal values. The purpose of the resilient back-propagation (RPROP) training algorithm is to eliminate these harmful effects of the magnitudes of the partial derivatives. Only the sign of the derivative is used to determine the direction of the weight update; the magnitude of the derivative has no effect on the weight update. The size of the weight change is determined by a separate update value. The update value for each weight and bias is increased by a factor delt_inc whenever the derivative of the performance function with respect to that weight has the same sign for two successive iterations. The update value is decreased by a factor delt_dec whenever the derivative with respect that weight changes sign from the previous iteration. If the derivative is zero, then the update value remains the same. Whenever the weights are oscillating the weight change will be reduced. If the weight continues to change in the same direction for several iterations, then the magnitude of the weight change will be increased.

A complete description of the RPROP algorithm is given in Reidmiller and Braun’s article "A direct adaptive method for faster back-propagation learning: The RPROP algorithm". For more information about each training function, see MATLAB’s Neural Network Toolbox User’s Guide (available in PDF format).

Comparing the networks to each other:

After training many networks, with different representations of the input, and different number of units in the input & hidden layers – I wanted to compare the networks to each other. I usually compared the networks by simulating them on the same data used for training. It would be interesting to test the networks also on melodies not from the training set.

I used 3 methods for comparing the networks to each other (always comparing networks with the same number of output units, trained on the same number of notes, and for the same training time).

The first method is comparing the average square error for each output unit.

The second method is rounding the output units to 0 or 1 (values above 0.5 are rounded upwards, and below 0.5 downwards). Then I compared the percent of notes, on which the rounded output units match the expected value for all output units.

The third method is used only with the pitch class and interval networks. in these networks – one output unit should be 1, and the rest should be 0. This method is usually called "the winner takes all" – comparing the percent of notes, on which the output unit with the highest value is the output unit that is expected to be 1. This usually leads to a higher percent of success, because the units are compared relative to each other. In many cases, the output unit with the highest value is the expected output unit, but its output is still lower than 0.5.

(see file check_train_all.m for the implementation of testing the nets).

NetworkNumber of output unitsNumber of notes in the training setBest percent of success achieved
Second methodThird method
note_begin network1800089.1%-
interval network7798233.4%57.6%
pitch class network12798228.6%55.5%

Composing new melodies:

It’s possible either to let Speedy Composer decide when the new notes will begin (using the note_begin network), or to use the pace of a given song to compose new melodies. I’m using both methods to compose.

The chords for the composed melody are taken from a given song or template.

The composition is done note by note – after composing each note, the program updates the input units with the composed note data, and then moves to the next note. This continues until the end of the melody. the pitch class and interval networks are used only in case of a new note.

The networks were trained only to compose the next note for a given set of notes and chords, so when I used it also to compose the beginning of a melody – it was usually a primitive beginning, and many times it was the same beginning on more than one melody. Also, the input representation doesn’t give the networks a lot of information about when a melody ends, although it can see when we have reached the last chord in a melody (when the next chord is a silence). When composing using the note_begin network, this network usually doesn’t decide about a new note after it has reached a harmonic ending – when the network has reached the last chord (bar), and the last note is identical to the last chord. But, when composing according to a given pace, the pitch class and interval networks don’t "know" when we are composing the last note in a melody. So I usually forced the composing program to give a note which is identical to the last chord (ignoring the pitch class network), when composing the last note in a melody.

In the future, we can try to test the "quality" or "harmony" of a composed melody automatically, and take only some of the melodies composed. This can also be done with Neural Networks. one possible way to do it, is by checking the expectations of trained networks for the melody – and see if all the notes (or most of them) are chosen with high probabilities. The higher the probabilities of the notes are – the better the melody fits to the melodies in the training set. So the probabilities can give some indication wheter the melody is harmonic or not. In this case, I think we would have to check all values above 0 (by selecting the X_max factor as 0) – and it would be a good idea to test one melody with a few neural networks – different than the network that composed the melody. It would be interesting to test this, and I haven't implemented it yet. It would also be interesting to compose melodies by using average of a few Neural Networks.

Conclusions:

The melodies composed sound very good and interesting, and fit the chords and the pace of the song given – composing different melodies for the same given chord sequence (and sometimes pace). Because of the randomness in the composition – each melody composed is unique, and of course some sound better than others. But it’s also possible to compose songs in a deterministic way – by choosing always the note with the highest probability given. It’s possible to decide on the amount of randomness given to a composed song.

Future research and projects:

Here are a few ideas for future projects:


Acknowledgements:

Thanks to all the people who helped me work on this project, with advice and information about the subject.

Special thanks to Guenter Nagler, whose free programs for reading and writing MIDI files saved me a lot of work. He is also a good friend, and helped me a lot with his advice. It was his idea to compose melodies according to a given chord progress.



Back to main page.

Speedy