Quick start tutorial

This quick start tutorial walks through all steps required to make DeepSS work with your data, using a recording of fly song as an example. A comprehensive documentation of all menus and options can be found in the GUI documentation.

In the tutorial, we will train DeepSS using an iterative and adaptive protocol that allows to quickly create a large dataset of annotations: Annotate a few song events, fast-train a network on those annotations, and then use that network to predict new annotations on a larger part of the recording. Initial, these predictions require manually correction, but correcting is typically much faster than annotating everything from scratch. This correct-train-predict cycle is then repeated with ever larger datasets until network performance is satisfactory.

Download example data

To follow the tutorial, download and open this audio file. The recording is of a Drosophila melanogaster male courting a female, recorded by David Stern (Janelia, part of this dataset). We will walk through loading, annotating, training and predicting using this file as an example.

Start the GUI

Install DeepSS following these instructions. Then start the GUI by opening a terminal, activating the conda environment created during install and typing dss gui:

conda activate dss
dss gui

The following window should open:

start screen

Fig. 1 Loading screen.

Load audio data

Choose Load audio from file and select the downloaded recording of fly song.

In the dialog that opens, leave everything as is except set Minimal/Maximal spectrogram frequency—the range of frequencies in the spectrogram display—to 50 and 1000 Hz. This will restrict the spectrogram view to only show the frequencies found in fly song.

loading screen

Fig. 2 Loading screen.

Waveform and spectrogram display

Laoding the audio will open a window that displays the first second of audio as a waveform (top) and a spectrogram (bottom). You will see the two major modes of fly song—pulse and sine. The recording starts with sine song—a relatively soft oscillation resulting in a spectral power at ~150Hz. Pulse song starts after ~0.75 seconds, evident as trains of brief wavelets with a regular interval.

To navigate the view: Move forward/backward along the time axis via the A/D keys and zoom in/out the time axis with the W/S keys (see also the Playback menu). The temporal and frequency resolution of the spectrogram can be adjusted with the R and T keys.

You can play back the waveform on display through your headphones/speakers by pressing E.

waveform and spectrogram display

Fig. 3 Waveform (top) and spectrogram (bottom) display of a single-channel recording of fly song.

Initialize or edit song types

Before you can annotate song, you need to register the sine and pulse song types for annotation. DeepSS discriminates two principal categories of song types:

  • Events are defined by a single time of occurrence. The aforementioned pulse song is a song type of the event category.

  • Segments are song types that extend over time and are defined by a start and a stop time. The aforementioned sine song and the syllables of mouse and bird vocalizations fall into the segment category.

Add two new song types for annotation via Annotations/Add or edit song types: ‘pulse’ of category ‘event’ and ‘sine’ of category ‘segment’:

edit annotation types

Fig. 4 Create two new song types for annotation.

Create annotations manually

The two new song types “pulse” or “sine” can now be activated for annotation using the dropdown menu on the top left of the main window. The active song type can also be changed with number keys indicated in the dropdown menu—in this case 1 activates pulse, 2 activates sine.

Song is annotated by left-clicking the waveform or spectrogram view. If an event-like song type is active, a single left click marks the time of an event. A segment-like song type requires two clicks—one for each boundary of the segment.

annotate song

Fig. 5 Left clicks in waveform or spectrogram view create annotations.

Annotate by thresholding the waveform

Annotation of events can be sped up with a “Thresholding mode”, which detects peaks in the sound energy exceeding a threshold. Activate thresholding mode via the Annotations menu. This will display a draggable horizontal line - the detection threshold - and a smooth pink waveform - the energy envelope of the waveform. Adjust the threshold so that only “correct” peaks in the envelope cross the threshold and then press I to annotate these peaks as events.

annotate song

Fig. 6 Annotations assisted by thresholding and peak detection.

Edit annotations

In case you mis-clicked, you can edit and delete annotations. Edit event times and segment bounds by dragging the lines or the boundaries of segments. Drag the shaded area itself to move a segment without changing its duration. Movement can be disabled completely or restricted to the currently selected annotation type via the Audio menu.

Delete annotations of the active song type by right-clicking on the annotation. Annotations of all song types or only the active one in the view can be deleted with U and Y, respectively, or via the Annotations menu.

annotate song

Fig. 7 Dragging moves, right click deletes annotations.

Export annotations and make a dataset

DeepSS achieves good performance with little manual annotation. Once you have completely annotated the song in the first 18 seconds of the tutorial recording—a couple of pulse trains and sine song segments—you can train a network to help with annotating the rest of the data.

Trainining requires the audio data and the annotations to be in a specific dataset format. First, export the audio data and the annotations via File/Export for DeepSS to a new folder (not the one containing the original audio)—let’s call the folder quickstart. In the following dialog set start seconds and end seconds to the annotated time range - 0 and 18 seconds, respectively.

export audio and annotations

Fig. 8 Export audio data and annotations for the annotated range between 0 and 18 seconds.

Then make a dataset, via DeepSS/Make dataset for training. In the file dialog, select the quickstart folder you exported your annotations into. In the next dialog, we will adjust how data is split into training, validation and testing data. For the small data set annotated in the first step of this tutorial, we will not test the model. To maximize the data available for optimizing the network (training and validation), set the test split to 0.0 (not test) and the validation split to 40:

assemble dataset

Fig. 9 Make a dataset for training.

This will create a dataset folder called quickstart.npy that contains the audio data and the annotations read for training.

Fast training

Configure a network and start training via DeepSS/Train. This will ask you select the dataset folder, quickstart.npy. Then, a dialog allows you to configure the network. For the fast training change the following:

  • Set both Number of filters and Filter duration (seconds) to 16. This will result in a smaller network with fewer parameters, which will be faster to train and requires fewer annotations to achieve adequate performance.

  • Set Number of epochs to 10, to finish training earlier.


Fig. 10 Train options

Then hit Start training in GUI - this will start training in a background process. Monitor training progress in the terminal. Training with this small dataset will finish within fewer than 10 minutes on a CPU and within 2 minutes on a GPU. For larger datasets, we highly recommend training on a machine with a discrete Nvidia GPU.


Once training finished, generate annotations using the trained network via DeepSS/Predict. This will ask you to select a model file containing the trained. Training creates files in the quickstart.res folder, starting with the time stamp of training—select the file ending in _model.h5.

In the next dialog, predict song for 60 seconds starting after your manual annotations:

  • Set Start seconds to 18 and End seconds to 78.

  • Make sure that Proof reading mode is enabled. That way, annotations created by the network will be assigned names ending in _proposals - in our case sine_proposals and pulse_proposals. The proposals will be transformed into proper sine and pulse annotations during proof reading.

  • Enable Fill gaps shorter than (seconds) and Delete segments shorter than (seconds) by unchecking both check boxes.


Fig. 11 Predict in proof-reading mode on the next 60 seconds.

In contrast to training, prediction is very fast, and does not require a GPU—should finish within 30 seconds. The proposed annotations should be already good — most pulses should be correctly detected. Sine song is harder to predict and will likely be often missed or chopped up into multiple segments with gaps in between.

Proof reading

To turn the proposals into proper annotations, fix and approve them. Correct any prediction errors—add missing annotations, remove false positive annotations, adjust the timing of annotations. Once you have corrected all errors in the view, approve annotations with G or H for approving only the active or all song types, respectively. This will rename the proposals in the view to the original names (for instance, sine_proposals -> sine).

Go back to “Export”

Once all proposals have been approved, export all annotations (now between 0 and 78 seconds), make a new dataset, train, predict, and repeat. If prediction performance is adequate, fully train the network, this time using a completely new recording as the test set (TODO: add option to specify a file as the test set to the “Make dataset” dialog) and with a larger number of epochs.