The previous blog in this miniseries gave an introduction into sound classification in general. Now, it’s time to dive a bit deeper into the technology behind sound classification. In fact, there is not one single type of technology that can be used for this purpose. For example, the most basic classification systems are rule-based; a type of so-called expert systems. Such a system decides if a sound is part of a particular class using basic “if X then Y”-like rules. Such rules can be combined in a more complex manner, yielding so-called decision trees. When multiple decision trees are combined, the classifier becomes a random forest. While the most basic rule-based systems need to be designed manually, decision trees and random forests can also be trained. Training algorithms by letting them learn from data is called machine learning.
Nowadays, neural networks are the most popular type of machine learning algorithm for common types of inputs like images, text or audio,. This is because neural networks can learn complex, non-linear relationships between inputs and outputs, which is often necessary to make accurate prediction about the class of an image or sound. As the name suggests, a neural network is a network of neurons (or nodes), which each can process multiple inputs and apply some form of non-linearity to it. In what way the inputs are exactly combined through parameters called weights is part of the learning process. This type of structure is based on that of the human brain.
Usually, neurons are arranged in layers: an input layer (with inputs), an output layers (e.g. with classifications) and in-between layers called hidden layers. When a network contains multiple of these hidden layers, we speak of deep learning.
Training a neural network for classification purposes basically requires two types of data: input data (audio, images, etc.) and a “ground truth” in the form of labels. This means that the data needs to be annotated by humans, which often is a labour-intensive process. In a future blog from this series, our annotation process will be discussed in more detail.
Running a neural network-based classifier in production requires quite a bit of computational power, especially when the network is large in terms of number of layers and/or nodes. In fact, training such a network especially requires a lot of power. Luckily, thanks to the specialized capabilities of chips like Graphical Processing Units (GPUs) and Tensor Processing Units (TPUs), training large neural nets can be performed in a reasonable amount of time.
During training the class labels predicted by the network are continuously compared with the ground truth annotations. Based on the difference between the two, the weights of the neurons in the network are adjusted slightly. If the process is working as intended this adjustment will make the average difference (“error”) between the predicted labels and the ground truth a bit smaller in each step. This process continuous in a long loop until a specific criterion is met, e.g., a maximum number of loops or a minimum error.
While sound event classifiers that accept raw audio as input do exist, in most cases the audio is actually transformed into something else first. But, we’ll discuss that in our next blog of this mini-series!



