{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Make a dataset from custom data formats\n", "If your data is readable by the GUI, you only need to convert annotations. You can then load both into the GUI and export to data and annotations for DAS.\n", "\n", "Three alternatives:\n", "\n", "### 1. Export your data as wav/npz and csv to a folder and make a dataset with the GUI\n", "`data` folder with `*.wav` or `npz` files files with the recording and matching `*_annotations.csv` files with the annotations (see format details [here]](/technical/data_formats))- recordings and annotations will be matched according to the file base name:\n", "```shell\n", "data\\\n", " file01.wav\n", " file01_annotations.csv\n", " another_file.wav\n", " another_file_annotations.csv\n", " yaf.wav\n", " yaf_annotations.csv\n", "```\n", "\n", "Simplest way - no need of having to deal with specific of the data structure required by DAS for training - simply hit _DAS/Make dataset for training_ and select the folder.\n", "\n", "### 2. Use notebook with custom loaders to directly read recordings and annotations\n", "Intermediate simplicity and flexibility) Folder with the data and annotations in a custom format - provide custom functions for loading audio data and annotations. This is illustrated in the next section.\n", "\n", "### 3. DIY dataset\n", "Complex, maximally flexible) Bring your own mappable (for experts): Generate the data structure yourself. The description of the [dataset structure](/technical/data_formats) has information on how to do that." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use notebook with custom loaders to directly read recordings and annotations\n", "This notebook creates an annotated data set for training das from lists of filenames for the song recordings and associated annotations.\n", "\n", "This may look a bit complicated at first sight but only requires you to change two things in the next two cells:\n", "\n", "1. The functions for loading the data (song recordings) and the annotations need to be adapted to work with your data.\n", "2. Two lists with the names of data files (consumed by `load_data`) and the associated annotation files (consumed by `load_annotation`) need to be created.\n", "\n", "The notebook works with a small toy data set (with song recordings from the data provided with [Stern (2014)](https://www.janelia.org/lab/stern-lab/tools-reagents-data)) and with annotations produced by [Murthylab_FlySongSegmenter](https://github.com/murthylab/MurthyLab_FlySongSegmenter).\n", "\n", "__Note:__ For this tutorial to work, you first need to download some data and example models (266MB) from [here](https://www.dropbox.com/sh/wnj3389k8ei8i1c/AACy7apWxW87IS_fBjI8-7WDa?dl=0) and put the four folders in the `tutorials` folder.\n", "\n", "## Internals\n", "The dataset is created in two steps. \n", "\n", "First, the data and annotations for different recordings are combined in a large matrix. This works even with large datasets that don't fit in memory thanks to [zarr](https://zarr.readthedocs.io/en/stable/tutorial.html#storage-alternatives), which provides an interface for on-disk arrays. The annotations are converted from lists of event times to one-hot-encoded probability vectors, (0 for no event, 1 for a specified short duration (default is $\\pm$30 samples) surrounding the event).\n", "\n", "Second, the zarr dataset is converted to a directory hierarchy of numpy files (npy) using `das.npy_dir_save`. Numpy files allow extremely fast memory mapping, providing faster out-of-memory access than zarr during training. \n", "\n", "Code below closely follows the code used in the GUI - see the [source](https://github.com/janclemenslab/xarray-behave/blob/master/src/xarray_behave/gui/DAS.py) code." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import zarr\n", "import logging\n", "import das.npy_dir\n", "import das.make_dataset\n", "import das.annot\n", "from glob import glob\n", "import matplotlib.pyplot as plt\n", "import h5py\n", "import pandas as pd\n", "from typing import Tuple, Optional\n", "\n", "plt.style.use('ncb.mplstyle')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define custom audio and annotation loaders\n", "\n", "If your recordings and annotations are not in the format expected by the standard loaders used above (`scipy.io.wavefile.read` for the recordings, `pd.read_csv` with name/start_seconds/stop_seconds for annotations), or if it's hard to convert your data into these standard formats, you can provide your own loaders as long as they conform to the following interface:\n", "\n", "- _data loaders_: `samplerate, data = data_loader(filename)`, accepts a single string argument - the path to the data file and returns two things: the samplerate of the data and a numpy array with the recording data [time, channels]. Note: `scipy.io.wavefile.read` returns `[time,]` arrays - you need to add a new axis to make it 2d!\n", "- _annotation loaders_: `df = annotation_loader(filename)`, accepts a single string argument with the file path and returns a pandas DataFrame with these three columns: `name`, `start_seconds`, `stop_seconds` (see 1).\n", "\n", "Below, we read example data produced in matlab (D. Stern). The recording was exported and saved as a matlab struct - the audio data is found in `data`, the samplerate is not saved with the data. The annotations were produced by the [Murthylab_FlySongSegmenter](https://github.com/murthylab/MurthyLab_FlySongSegmenter) and the pulse times (in samples) and a mask indicating for each sample in the recording whether it was occupied by noise (0), a pulse train (1), or sine song (2) were exported saved as a struct - `annotations.mask`, `annotations.pulsesamples`.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [], "source": [ "\n", "def data_loader(filename: str, dataset: Optional[str] = None) -> Tuple[float, np.ndarray]:\n", " with h5py.File(filename, mode='r') as f: \n", " data = f['data'][:]\n", " samplerate = 10_000\n", " return samplerate, data\n", "\n", "def annotation_loader(filename: str) -> pd.DataFrame:\n", " samplerate = 10_000\n", " with h5py.File(filename, 'r') as f:\n", " # fss saves mask with sine=1, pulsetrains=2\n", " sine = (f['annotations']['mask'][0, :] == 2).astype(np.float)\n", " sine_onsets = np.where(np.diff(sine) == 1)[0] / samplerate\n", " sine_offsets = np.where(np.diff(sine) == -1)[0] / samplerate\n", " \n", " # fss saves pulse times in samples\n", " pulse_times = f['annotations']['pulsesamples'][0] / samplerate\n", "\n", " # make lists:\n", " names = []\n", " start_seconds = []\n", " stop_seconds = []\n", "\n", " for pulse_time in pulse_times:\n", " names.append('pulse')\n", " start_seconds.append(float(pulse_time))\n", " stop_seconds.append(float(pulse_time))\n", "\n", " for sine_onset, sine_offset in zip(sine_onsets, sine_offsets): \n", " names.append('sine')\n", " start_seconds.append(float(sine_onset))\n", " stop_seconds.append(float(sine_offset))\n", "\n", " event_times = das.annot.Events.from_lists(names, start_seconds, stop_seconds)\n", " df = event_times.to_df()\n", " return df\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lists the names of data and annotation files\n", "Lists of `data_files` and associated `annotation_files` are consumed by `load_data` and `load_annotation`, respectively." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data \"../data/PS_20130625155828_ch7_recording.mat\" with annotations in \"../data/PS_20130625155828_ch7_song.mat\".\n", "data \"../data/PS_20130625111709_ch10_recording.mat\" with annotations in \"../data/PS_20130625111709_ch10_song.mat\".\n", "data \"../data/PS_20130628144304_ch15_recording.mat\" with annotations in \"../data/PS_20130628144304_ch15_song.mat\".\n", "data \"../data/PS_20130625155828_ch11_recording.mat\" with annotations in \"../data/PS_20130625155828_ch11_song.mat\".\n" ] } ], "source": [ "# data_dir = '/Users/clemens10/Dropbox/code.py/DAS/tutorials/dat.mat/'\n", "data_dir = '../data/'\n", "files_data = glob(data_dir + '*_recording.mat') # list all data files\n", "files_annotation = [file.replace('_recording.mat', '_song.mat') for file in files_data] # generate the names of associated annotation files\n", "[print(f'data \"{d}\" with annotations in \"{a}\".') for d, a in zip(files_data, files_annotation)];" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Test loaders" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Audio data with samplerate 10000, shape (4100001, 1), and data type float32\n", "Annotations\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namestart_secondsstop_seconds
0sine5.15395.3898
1sine127.9297128.8389
2sine171.8631172.0899
3sine178.0876178.2290
4sine181.9101182.4161
............
1956pulse404.4817404.4817
1957pulse404.5128404.5128
1958pulse404.5440404.5440
1959pulse404.5751404.5751
1960pulse404.6469404.6469
\n", "

1961 rows × 3 columns

\n", "
" ], "text/plain": [ " name start_seconds stop_seconds\n", "0 sine 5.1539 5.3898\n", "1 sine 127.9297 128.8389\n", "2 sine 171.8631 172.0899\n", "3 sine 178.0876 178.2290\n", "4 sine 181.9101 182.4161\n", "... ... ... ...\n", "1956 pulse 404.4817 404.4817\n", "1957 pulse 404.5128 404.5128\n", "1958 pulse 404.5440 404.5440\n", "1959 pulse 404.5751 404.5751\n", "1960 pulse 404.6469 404.6469\n", "\n", "[1961 rows x 3 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "samplerate, data = data_loader(files_data[0])\n", "print(f\"Audio data with samplerate {samplerate}, shape {data.shape}, and data type {data.dtype}\")\n", "print('Annotations')\n", "df = annotation_loader(files_annotation[0])\n", "df" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "for file_data in files_data:\n", " samplerate, data = data_loader(file_data)\n", " with h5py.File(file_data + '.h5', 'w') as f:\n", " f['data'] = data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parse annotation files to identify song types\n", "This will collect all unique values in `names` of the annotation files and infer their categories based on the equality of start_seconds and stop_seconds. Skip this step and define `class_names` and `class_categories` manually if you know the song types and their categories." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "../data/PS_20130625155828_ch7_song.mat\n", "../data/PS_20130625111709_ch10_song.mat\n", "../data/PS_20130628144304_ch15_song.mat\n", "../data/PS_20130625155828_ch11_song.mat\n", "found pulse of type event\n", "found sine of type segment\n" ] } ], "source": [ "# go through all annotation files and collect info on classes\n", "class_names = []\n", "class_categories = []\n", "for file_annotation in files_annotation:\n", " print(file_annotation)\n", " df = annotation_loader(file_annotation)\n", " event_times = das.annot.Events.from_df(df)\n", " class_names.extend(event_times.names)\n", " class_categories.extend(event_times.categories.values())\n", "\n", "class_names, first_indices = np.unique(class_names, return_index=True)\n", "class_categories = list(np.array(class_categories)[first_indices])\n", "class_names = list(class_names)\n", "\n", "for class_name, class_category in zip(class_names, class_categories):\n", " print(f'found {class_name} of type {class_category}')\n", "\n", "# Need to add a \"noise\" song type for when there is no song\n", "class_names.insert(0, 'noise')\n", "class_categories.insert(0, 'segment')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Split files into train/val/test sets\n", "Split the list of recordings into train/val/test files:\n", "- `train` is used for optimizing the parameters during model fitting\n", "- `val` is used during training to track model performance and save the current best model based on the performance on the validation data\n", "- `test` is used to evaluate the best model after training is done and fine tune inference" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Indices of test file(s): [3] \n", "indices of validation file(s): [2] \n", "indices of train files: [0 1]\n" ] } ], "source": [ "np.random.seed(1) # seed random number generator for reproducible splits\n", "test_idx, val_idx, train_idx = np.split(np.random.permutation(len(files_data)), (1, 2)) # this will split the recordings into one for testing, one for validation, and the remained for training \n", "print('Indices of test file(s):', test_idx, '\\nindices of validation file(s):', val_idx, '\\nindices of train files:', train_idx)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initialize the data structure (store)\n", "The store will hold the audio data and the annotations in a format ready for training (see [here](/technical/data_formats) for details). The store also contains metadata which is important for training and inference:\n", "- samplerates of the data in Hz\n", "- description of the different classes (names, types (segment or events)) - largely used to post-process the output of the network after inference (e.g. detect event times from the probabilities produced by the network).\n", "\n", "We use zarr as a data structure because it provides in-memory as well as out-of-memory arrays with the same, dict-like interface (similar to h5py).\n", "\n", "Choose the zarr `store_type` based on the total size of your dataset:\n", "- If it fits in memory, use a `DictStore`, which will place the data and annotation arrays in memory. \n", "- For \"big data\", use a `DirectoryStore`, this will place the arrays in chunked files in a directory.\n", "\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "samplerate = 10_000 # this is the sample rate of your data and the pulse times\n", "store = das.make_dataset.init_store(\n", " nb_channels=1, # number of channels/microphones in the recording\n", " nb_classes=len(class_names), # number of classes to predict - [noise, pulse]\n", " make_single_class_datasets=True, # also make y_pulse and y_sine\n", " samplerate=samplerate, # make sure audio data and the annotations are all on the same sampling rate\n", " class_names=class_names,\n", " class_types=class_categories,\n", " store_type=zarr.DictStore, # use DirectoryStore for big data \n", " store_name='intermediate.zarr', # only used with DirectoryStore - this is the path to the directory created\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load the individual data files\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Assembling data set:\n", " ../data/PS_20130625155828_ch7_recording.mat -> train set\n", " ../data/PS_20130625111709_ch10_recording.mat -> train set\n", " ../data/PS_20130628144304_ch15_recording.mat -> val set\n", " ../data/PS_20130625155828_ch11_recording.mat -> test set\n", "Got (8090002, 1), (4150001, 1), (4000001, 1) train/test/val samples.\n" ] } ], "source": [ "print(f\"Assembling data set:\")\n", "for idx, (data_file, annotation_file) in enumerate(zip(files_data, files_annotation)):\n", " # Determine whether file is test/val/train\n", " if idx in test_idx:\n", " target = 'test'\n", " elif idx in val_idx:\n", " target = 'val'\n", " elif idx in train_idx:\n", " target = 'train'\n", " else:\n", " continue\n", "\n", " print(f\" {data_file} -> {target} set\")\n", " fs, x = data_loader(data_file)\n", " nb_samples = x.shape[0]\n", "\n", " # load annotations\n", " df = annotation_loader(annotation_file)\n", " df = df.dropna()\n", "\n", " # make initial annotation matrix\n", " y = das.make_dataset.make_annotation_matrix(df, nb_samples, fs, class_names)\n", " \n", " # blur events\n", " for class_index, class_category in enumerate(class_categories):\n", " if class_category == 'event':\n", " y[:, class_index] = das.make_dataset.blur_events(y[:, class_index],\n", " event_std_seconds=.0016,\n", " samplerate=samplerate)\n", " \n", " # Append the recording (x) and the prediction target (y) to the data set\n", " store[target]['x'].append(x)\n", " store[target]['y'].append(das.make_dataset.normalize_probabilities(y))\n", "\n", " # Make prediction targets for individual song types\n", " for cnt, class_name in enumerate(class_names[1:]):\n", " store[target][f'y_{class_name}'].append(das.make_dataset.normalize_probabilities(y[:, [0, cnt+1]]))\n", "\n", "print(f\"Got {store['train']['x'].shape}, {store['val']['x'].shape}, {store['test']['x'].shape} train/test/val samples.\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspect the dataset\n", "Plot x/y values from the dataset to make sure everything is well aligned." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "class_names = store.attrs['class_names']\n", "for t0 in [45_000]: #range(0, 500_000, 10_000):\n", " t1 = t0 + 10_000\n", " plt.gcf().set_size_inches(20, 6)\n", " T = np.arange(t0, t1) / store.attrs['samplerate_x_Hz']\n", " plt.subplot(211)\n", " plt.plot(T, store['train']['x'][t0:t1], 'k')\n", " plt.xticks([])\n", " plt.ylabel('microphone voltage')\n", " plt.xlim(min(T), max(T)) \n", "\n", " plt.subplot(413)\n", " plt.imshow(store['train']['y'][t0:t1, :].astype(np.float).T, cmap='Greys')\n", " plt.yticks(range(len(class_names)), labels=class_names)\n", " plt.xticks([])\n", "\n", " plt.subplot(414) \n", " plt.plot(T, store['train']['y'][t0:t1, 1:])\n", " plt.xlabel('time [seconds]')\n", " plt.xlim(min(T), max(T))\n", " plt.legend(class_names[1:])\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save as npy_dir\n", "Once all data and annotation files have been appended to the store, save the data as an `npy_dir` store - a directory hierarchy that replicates the nested dictionary structure of the zarr store. For instance, the data set at `root['train']['x']` (the audio data for training) will be stored as `store_name/train/x.npy`. We use these npy files rather than the zarr store because access is faster. Saving is done by `das.npy_dir_save`. The directory structure is mapped back to a nested dictionary via `das.npy_dir.load`. If you used the `DirectoryStore` during assembly, you can delete the `*.zarr` directory after that final step." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# Save the zarr store as a hierarchy of npy files\n", "store_folder = 'tutorial_dataset.npy'\n", "logging.info(f' Saving to {store_folder}.')\n", "das.npy_dir.save(store_folder, store)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can now train using the [GUI](/tutorials_gui/train), or a [script](/tutorials/train), or the [command line](/tutorials/train)." ] } ], "metadata": { "file_extension": ".py", "kernel_info": { "name": "python3" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.11" }, "mimetype": "text/x-python", "name": "python", "npconvert_exporter": "python", "nteract": { "version": "0.15.0" }, "pygments_lexer": "ipython3", "version": 3 }, "nbformat": 4, "nbformat_minor": 4 }