Edited by: Masanobu Miura, Hachinohe Institute of Technology, Japan
Reviewed by: Andrew McPherson, Queen Mary University of London, United Kingdom; Luca Turchet, Queen Mary University of London, United Kingdom
This article was submitted to Performance Science, a section of the journal Frontiers in Psychology
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Gestures in music are of paramount importance partly because they are directly linked to musicians' sound and expressiveness. At the same time, current motion capture technologies are capable of detecting body motion/gestures details very accurately. We present a machine learning approach to automatic violin bow gesture classification based on Hierarchical Hidden Markov Models (HHMM) and motion data. We recorded motion and audio data corresponding to seven representative bow techniques (
香京julia种子在线播放
A
Current motion capture technologies are capable of detecting body motion details very accurately, and they have been used in a variety of sports industries to enhance athletes throughput, or in rehabilitation applications (Chi et al.,
TELMI (Technology Enhanced Learning of Musical Instrument Performance), is the framework where this study is being developed (TELMI,
Among many existing machine learning algorithms, Hidden Markov models (HMMs) have been widely applied to motion and gesture recognition. HMMs describe motion-temporal
There have been several approaches to study gestures in a musical context. Sawada and Hashimoto (
Kolesnik and Wanderley (
Caramiaux et al. (
Bevilacqua et al. (
Schedel and Fiebrink (
In the context of IoMusT, Turchet et al. (
In collaboration with the Royal College of Music, London, a set of seven gestural violin-bowing techniques were recorded as a reference by professional violinist Madeleine Mitchell. All gestures were played in G mayor for technical accommodation to cover three octaves using a comprehensive violin range within the four strings. Below we describe the seven recorded bowing gestures (music score reference in
Music score of the seven bow strokes. All in G Mayor as explained in Music Material section.
In total 8,020 samples within seven gestures, with a median of 35.8 samples per bow-stroke, having 32 bow-strokes per gesture. Each bow-stroke covers a time window range approximately of 200 ms.
The Essentia library was used to extract audio features from the recordings. The descriptors extracted with real-time audio buffering analysis were:
RMS: The Root-Mean-Square descriptor informs about the absolute area under the audio waveform. In other words, it describes the power voltage that the waveform sends to the amplifier.
Onset: It is a normalized value (0.0 to 1.0) which reports locations within the frame in which the onset of a musical phrase, rhythm (percussive event) or note has occurred.
Pitch Confidence: It is a range value from zero to one to determine how stable the description of a pitch presence is in a defined windowing buffer as opposed to non-harmonic or not tonally defined sound.
Pitch Salience: It is a measure of tone sensation, which describes in a range from zero to one when a sound contains several harmonics in its spectrum. It may be useful to discriminate, for instance, between rhythmic sound presence and instrumental pitched sound presence.
Spectral Complexity: It is based on the number of peaks in the sound spectrum referred to a windowing sound buffer. It is defined as the ratio between the spectrum's maximum peak's magnitude and the “bandwidth” of the peak above half its amplitude. This ratio reveals whether the spectrum presents a pronounced maximum peak.
Strong Decay: A normalized value that gives a reference to express how strong or pronounced is the distance between the sound power centroid to its attack. Hence, a signal containing a temporal centroid near its start boundary and high energy is said to have a steady decay.
We used RMS, Pitch Confidence and Onset to segment the Myo gesture to eliminate non-gesture data. In this way, we defined meaningful gesture time-intervals and used the corresponding Myo data for training the system.
Also, to use audio descriptors for data segmentation, a second objective was to complement the Myo information with relevant audio information to train the machine learning models with multimodal data. While the
We applied a Hierarchical Hidden Markov Model (HHMM) for real-time continuous gesture recognition (Schnell et al.,
HHMM-likelihood progression of a single bow-stroke phrase example in each technique. x-axis is time (ms) and y-axis is percentage of correct prediction (1:100).
Three different musical phrases covering low, mid and high pitch registers were provided for each gesture as performed by the expert. Hence, the model was trained using examples of “
Following this methodology, it is possible to have accurate results without the need for a big dataset of training examples. The data is sent from the custom application to the Max implementation through OSC (explained in Synchronization section). For the regression phase, the HHMM provides an output with a normalized number corresponding to the gesture prediction, and a set of values called
HHMM illustration consists of 4 states, which emit 2 discrete likelihood estimations
An instance of the
We evaluated three HHMMs: one trained with the information from the
Databases setup.
Audio | RMS, Onset, Pitch Confidence, Pitch Salience, Spectral Complexity, Strong Decay |
Myo (IMU) | Euler, Accelerometer, Gyroscope |
Combined Audio and Myo | Euler, Accelerometer, Gyroscope, RMS, Pitch Confidence |
We trained decision trees models using three feature sets: Myo motion features, audio features, and motion and audio features combined. Applying 10-fold cross-validation, we obtained correctly classified instances percentages of 93.32, 39.01, and 94.62% for the motion only, audio only, and combined feature sets, respectively. As it can be seen in the confusion matrix reported in
Confusion matrix (decision tree).
0.000 | 0.005 | 0.001 | 0.031 | 0.000 | 0.000 | ||
0.001 | 0.000 | 0.027 | 0.000 | 0.011 | 0.012 | ||
0.000 | 0.001 | 0.000 | 0.000 | 0.000 | 0.000 | ||
0.000 | 0.025 | 0.001 | 0.000 | 0.017 | 0.006 | ||
0.040 | 0.002 | 0.000 | 0.001 | 0.001 | 0.000 | ||
0.000 | 0.092 | 0.000 | 0.095 | 0.003 | 0.084 | ||
0.000 | 0.030 | 0.000 | 0.037 | 0.000 | 0.050 |
Accuracy by class (combined audio and motion).
Détaché | 0.963 | 0.005 | 0.979 | 0.963 | 0.971 | 0.964 | 0.988 | 0.967 |
Martelé | 0.950 | 0.015 | 0.948 | 0.950 | 0.949 | 0.934 | 0.975 | 0.940 |
Spiccato | 0.999 | 0.001 | 0.993 | 0.999 | 0.996 | 0.995 | 0.999 | 0.993 |
Ricochet | 0.951 | 0.016 | 0.936 | 0.951 | 0.943 | 0.929 | 0.975 | 0.905 |
Sautillé | 0.955 | 0.007 | 0.938 | 0.955 | 0.947 | 0.940 | 0.987 | 0.942 |
Staccato | 0.725 | 0.010 | 0.773 | 0.725 | 0.749 | 0.738 | 0.903 | 0.682 |
Bariolage | 0.882 | 0.008 | 0.889 | 0.882 | 0.886 | 0.877 | 0.960 | 0.865 |
Weighted Avg. | 0.946 | 0.010 | 0.946 | 0.946 | 0.946 | 0.937 | 0.978 | 0.930 |
We trained the HHMM previously described for real-time gesture estimation, resulting in a correctly classified instances percentage of 100% for detaché, martelé and spiccato; 95.1% for ricochet; 96.1% for sautillé; 88.1% for Staccato, and 98.4% for bariolage. These percentages represent the median of the gesture estimation in time. Each bow stroke has ten internal temporal states, and the model produces evaluations as likelihood probabilities progressions. The box-plot in the
Box-plot summarizing all HHMM-likelihood progression in 7,846 samples with a mean of 42,394 samples per gesture. Bow strokes are organized as:
X-axis: gestures collection. Y-axis: 0. to 1. range as percentage of correct estimations (1:100). The graph shows a summarizing state of all gestures correct-estimations and their similitude. For instance, Gesture Détaché and Spiccato has some similarities in motion as they are closely described in the likelihood probability. Articulations: (1) Détaché, (2) Martelé, (3) Spiccato, (4) Ricochet, (5) Sautillé, (6) Staccato, (7) Bariolage.
Confusion matrix (HHMM).
0.335 | 0.673 | 0.050 | 0.643 | 0.000 | 0.514 | ||
0.007 | 0.000 | 0.251 | 0.075 | 0.473 | 0.016 | ||
0.551 | 0.000 | 0.000 | 0.200 | 0.000 | 0.334 | ||
0.004 | 0.671 | 0.047 | 0.105 | 0.422 | 0.823 | ||
0.299 | 0.491 | 0.000 | 0.000 | 0.000 | 0.000 | ||
0.000 | 0.331 | 0.000 | 0.447 | 0.165 | 0.690 | ||
0.319 | 0.000 | 0.041 | 0.103 | 0.150 | 0.248 |
Openframework implementation to visualize and synchronize the IMU's and audio data. It reports in a spider chart the probability of the bow-stroke performed.
Cluster: Euler-angle spacial distribution of the seven articulations from the
A single sample of the gestural phrase per each bow stroke technique.
In TELMI project, colleges develop interactive applications to provide information to students about the quality of the sound and temporal precision of interpretation, in future work, we intend to embed the IMU's sensors into the Bow and Violin and merge both strategies, postural sensing technologies and a desktop/online app. Furthermore, we plan to implement the IMU device called
The datasets [GENERATED/ANALYZED] for this study can be found in
DD recorded, processed and analyzed the motion and audio data, and wrote the paper. RR supervised the methodology, the processing, and the analysis of the data, and contributed to the writing of the paper.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We would like to thank Madeleine Mitchel from the Royal College of Music, London, for her willingness to participate in the recordings of the data used in the study.
1MIT License. This gives everyone the freedoms to use OF in any context: commercial or non-commercial, public or private, open or closed source.