{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Make a dataset from custom data formats\n", "If your data is readable by the GUI, you only need to convert annotations. You can then load both into the GUI and export to data and annotations for DAS.\n", "\n", "Three alternatives:\n", "\n", "### 1. Export your data as wav/npz and csv to a folder and make a dataset with the GUI\n", "`data` folder with `*.wav` or `npz` files files with the recording and matching `*_annotations.csv` files with the annotations (see format details [here]](/technical/data_formats))- recordings and annotations will be matched according to the file base name:\n", "```shell\n", "data\\\n", " file01.wav\n", " file01_annotations.csv\n", " another_file.wav\n", " another_file_annotations.csv\n", " yaf.wav\n", " yaf_annotations.csv\n", "```\n", "\n", "Simplest way - no need of having to deal with specific of the data structure required by DAS for training - simply hit _DAS/Make dataset for training_ and select the folder.\n", "\n", "### 2. Use notebook with custom loaders to directly read recordings and annotations\n", "Intermediate simplicity and flexibility) Folder with the data and annotations in a custom format - provide custom functions for loading audio data and annotations. This is illustrated in the next section.\n", "\n", "### 3. DIY dataset\n", "Complex, maximally flexible) Bring your own mappable (for experts): Generate the data structure yourself. The description of the [dataset structure](/technical/data_formats) has information on how to do that." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use notebook with custom loaders to directly read recordings and annotations\n", "This notebook creates an annotated data set for training das from lists of filenames for the song recordings and associated annotations.\n", "\n", "This may look a bit complicated at first sight but only requires you to change two things in the next two cells:\n", "\n", "1. The functions for loading the data (song recordings) and the annotations need to be adapted to work with your data.\n", "2. Two lists with the names of data files (consumed by `load_data`) and the associated annotation files (consumed by `load_annotation`) need to be created.\n", "\n", "The notebook works with a small toy data set (with song recordings from the data provided with [Stern (2014)](https://www.janelia.org/lab/stern-lab/tools-reagents-data)) and with annotations produced by [Murthylab_FlySongSegmenter](https://github.com/murthylab/MurthyLab_FlySongSegmenter).\n", "\n", "__Note:__ For this tutorial to work, you first need to download some data and example models (266MB) from [here](https://www.dropbox.com/sh/wnj3389k8ei8i1c/AACy7apWxW87IS_fBjI8-7WDa?dl=0) and put the four folders in the `tutorials` folder.\n", "\n", "## Internals\n", "The dataset is created in two steps. \n", "\n", "First, the data and annotations for different recordings are combined in a large matrix. This works even with large datasets that don't fit in memory thanks to [zarr](https://zarr.readthedocs.io/en/stable/tutorial.html#storage-alternatives), which provides an interface for on-disk arrays. The annotations are converted from lists of event times to one-hot-encoded probability vectors, (0 for no event, 1 for a specified short duration (default is $\\pm$30 samples) surrounding the event).\n", "\n", "Second, the zarr dataset is converted to a directory hierarchy of numpy files (npy) using `das.npy_dir_save`. Numpy files allow extremely fast memory mapping, providing faster out-of-memory access than zarr during training. \n", "\n", "Code below closely follows the code used in the GUI - see the [source](https://github.com/janclemenslab/xarray-behave/blob/master/src/xarray_behave/gui/DAS.py) code." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import zarr\n", "import logging\n", "import das.npy_dir\n", "import das.make_dataset\n", "import das.annot\n", "from glob import glob\n", "import matplotlib.pyplot as plt\n", "import h5py\n", "import pandas as pd\n", "from typing import Tuple, Optional\n", "\n", "plt.style.use('ncb.mplstyle')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define custom audio and annotation loaders\n", "\n", "If your recordings and annotations are not in the format expected by the standard loaders used above (`scipy.io.wavefile.read` for the recordings, `pd.read_csv` with name/start_seconds/stop_seconds for annotations), or if it's hard to convert your data into these standard formats, you can provide your own loaders as long as they conform to the following interface:\n", "\n", "- _data loaders_: `samplerate, data = data_loader(filename)`, accepts a single string argument - the path to the data file and returns two things: the samplerate of the data and a numpy array with the recording data [time, channels]. Note: `scipy.io.wavefile.read` returns `[time,]` arrays - you need to add a new axis to make it 2d!\n", "- _annotation loaders_: `df = annotation_loader(filename)`, accepts a single string argument with the file path and returns a pandas DataFrame with these three columns: `name`, `start_seconds`, `stop_seconds` (see 1).\n", "\n", "Below, we read example data produced in matlab (D. Stern). The recording was exported and saved as a matlab struct - the audio data is found in `data`, the samplerate is not saved with the data. The annotations were produced by the [Murthylab_FlySongSegmenter](https://github.com/murthylab/MurthyLab_FlySongSegmenter) and the pulse times (in samples) and a mask indicating for each sample in the recording whether it was occupied by noise (0), a pulse train (1), or sine song (2) were exported saved as a struct - `annotations.mask`, `annotations.pulsesamples`.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [], "source": [ "\n", "def data_loader(filename: str, dataset: Optional[str] = None) -> Tuple[float, np.ndarray]:\n", " with h5py.File(filename, mode='r') as f: \n", " data = f['data'][:]\n", " samplerate = 10_000\n", " return samplerate, data\n", "\n", "def annotation_loader(filename: str) -> pd.DataFrame:\n", " samplerate = 10_000\n", " with h5py.File(filename, 'r') as f:\n", " # fss saves mask with sine=1, pulsetrains=2\n", " sine = (f['annotations']['mask'][0, :] == 2).astype(np.float)\n", " sine_onsets = np.where(np.diff(sine) == 1)[0] / samplerate\n", " sine_offsets = np.where(np.diff(sine) == -1)[0] / samplerate\n", " \n", " # fss saves pulse times in samples\n", " pulse_times = f['annotations']['pulsesamples'][0] / samplerate\n", "\n", " # make lists:\n", " names = []\n", " start_seconds = []\n", " stop_seconds = []\n", "\n", " for pulse_time in pulse_times:\n", " names.append('pulse')\n", " start_seconds.append(float(pulse_time))\n", " stop_seconds.append(float(pulse_time))\n", "\n", " for sine_onset, sine_offset in zip(sine_onsets, sine_offsets): \n", " names.append('sine')\n", " start_seconds.append(float(sine_onset))\n", " stop_seconds.append(float(sine_offset))\n", "\n", " event_times = das.annot.Events.from_lists(names, start_seconds, stop_seconds)\n", " df = event_times.to_df()\n", " return df\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lists the names of data and annotation files\n", "Lists of `data_files` and associated `annotation_files` are consumed by `load_data` and `load_annotation`, respectively." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data \"../data/PS_20130625155828_ch7_recording.mat\" with annotations in \"../data/PS_20130625155828_ch7_song.mat\".\n", "data \"../data/PS_20130625111709_ch10_recording.mat\" with annotations in \"../data/PS_20130625111709_ch10_song.mat\".\n", "data \"../data/PS_20130628144304_ch15_recording.mat\" with annotations in \"../data/PS_20130628144304_ch15_song.mat\".\n", "data \"../data/PS_20130625155828_ch11_recording.mat\" with annotations in \"../data/PS_20130625155828_ch11_song.mat\".\n" ] } ], "source": [ "# data_dir = '/Users/clemens10/Dropbox/code.py/DAS/tutorials/dat.mat/'\n", "data_dir = '../data/'\n", "files_data = glob(data_dir + '*_recording.mat') # list all data files\n", "files_annotation = [file.replace('_recording.mat', '_song.mat') for file in files_data] # generate the names of associated annotation files\n", "[print(f'data \"{d}\" with annotations in \"{a}\".') for d, a in zip(files_data, files_annotation)];" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Test loaders" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Audio data with samplerate 10000, shape (4100001, 1), and data type float32\n", "Annotations\n" ] }, { "data": { "text/html": [ "
\n", " | name | \n", "start_seconds | \n", "stop_seconds | \n", "
---|---|---|---|
0 | \n", "sine | \n", "5.1539 | \n", "5.3898 | \n", "
1 | \n", "sine | \n", "127.9297 | \n", "128.8389 | \n", "
2 | \n", "sine | \n", "171.8631 | \n", "172.0899 | \n", "
3 | \n", "sine | \n", "178.0876 | \n", "178.2290 | \n", "
4 | \n", "sine | \n", "181.9101 | \n", "182.4161 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
1956 | \n", "pulse | \n", "404.4817 | \n", "404.4817 | \n", "
1957 | \n", "pulse | \n", "404.5128 | \n", "404.5128 | \n", "
1958 | \n", "pulse | \n", "404.5440 | \n", "404.5440 | \n", "
1959 | \n", "pulse | \n", "404.5751 | \n", "404.5751 | \n", "
1960 | \n", "pulse | \n", "404.6469 | \n", "404.6469 | \n", "
1961 rows × 3 columns
\n", "