Choice of structural network parameters

DAS performance is relatively robust to the choice of structural network parameters like filter duration and number, or network depth. The networks tested in (Steinfath et al., 2021) are good starting points for adapting DAS to your own data (see table below). In our experience, a network with 32 filters, filter duration 32 samples, 3 TCN blocks, and a chunk duration of 2048 samples will produce good results for most signals. An STFT-downsampling layer with 32 frequency band and 16x downsampling should be included for most signals except when the signals have a pulsatile character. Given that DAS trains quickly, network structure can be optimized by training DAS networks with different structural parameters, for instance to find the simplest network (in terms of number of filters and TCN blocks) that saturates performance but has the shortest latency. We here provide additional guidelines for choosing a network’s key structural parameters:

The chunk duration corresponds to the length of audio the network processes in one step and constitutes an upper bound for the context information available to the network. Choose this sufficiently long so the network has access to key features of your signal. For instance, for fly song, we ensured that a single chunk encompasses several pulses in a train, so the network can learn to detect song pulses based on their regular occurrence in trains. Longer chunks relative to this timescale can reduce short false positive detections, for instance for fly sine song and for bird song. Given that increasing chunk duration does not increase the number of parameters for training, we recommend using long chunks unless low latency is of essence (see below). However, networks with longer chunks require more memory, which is often a limit on GPUs.

Downsampling/STFT weakly affects performance but strongly accelerates convergence during training. This is because A) the initialization with STFT filters is a good prior that reduces the number of epochs it takes to learn the optimal filters, and B) the downsampling reduces the data bandwidth and thereby the time it takes to finish one training epoch. The overall increase in performance from adding the STFT layer is low because convolutional layers in the rest of the network can easily replicate the computations of the STFT layer. For short pulsatile signals or signals with low sampling rates, STFT and downsampling should be avoided since they can decrease performance due to the loss of temporal resolution.

The number of TCN blocks controls the network’s depth. A deeper network can extract more high-level features, though we found that even for the spectro-temporally complex song of Bengalese finches, deeper networks only weakly improved performance.

Multi-channel audio can be processed with multi-channel filters via full convolutions or with shared channel-wise filters via time-channel separable convolutions. This can be set on a per-TCN-block basis. We recommend to use separable convolutions in the first 1–2 layers, since basic feature extraction is typically the same for each channel. Later layers can then have full multi-channel filters to allow more complex combination of information across channels.

Real-time performance can be optimized by the chunk duration and the network complexity (number and duration of filters, number of TCN blocks). We recommend starting with the default parameters suggested above and then benchmarking latency. If required, latency can be further reduced by reducing chunk duration, the number and duration of filters, and the number of TCN blocks.

Structural parameters of existing networks

Use these parameter sets as starting points for designing your own networks. 16x STFT down sampling is achieved by setting pre_nb_conv=4.

Species

Sample rate [kHz]

Chunk duration (samples)

Audio channels

STFT down sampling

Separable conv

TCN stacks

Kernel size

Kernels

D. mel. (pulse and sine)

10.0

4096

1

-

-

3

32

32

D. mel. multi channel (pulse or sine)

10.0

2048

9

-

TCN stacks 1+2

4

32

32

Mice

300.0

8192

1

16x

-

2

16

32

Marmosets

44.1

8192

1

16x

-

2

16

32

Beng. finches

32.0

1024

1

16x

-

4

32

64

Zebra finches (directed song)

32.0

2048

1

16x

-

4

32

64