das.data

Utils for loading and manipulating data for training and prediction.

class das.data.AudioSequence(x: numpy.ndarray, y: Optional[numpy.ndarray] = None, batch_size: int = 32, shuffle: bool = True, nb_hist: int = 1, y_offset: Optional[int] = None, stride: int = 1, cut_trailing_dim: bool = False, with_y_hist: bool = False, data_padding: int = 0, first_sample: int = 0, last_sample: Optional[int] = None, output_stride: int = 1, nb_repeats: int = 1, shuffle_subset: Optional[float] = None, unpack_channels: bool = False, mask_input: Optional[int] = None, batch_processor: Optional[Callable[[numpy.ndarray], numpy.ndarray]] = None, **kwargs)[source]

[summary]

...

[summary]

x and y can be mem-mapped numpy arrays or lazily loaded hdf5 (zarr, xarray) datasets. Dask arrays do not work since they are immutable. :param x: [nb_samples, …] :type x: np.ndarray :param y: [nb_samples, nb_classes] - class probabilities - so sum over classes for each sample should be 1.0. Defaults to None.

If None, getitem will only return x batches - neither y nor sample weights

Parameters
  • batch_size (int, optional) – number of batches to return. Defaults to 32.

  • shuffle (bool, optional) – randomize order of batches. Defaults to True.

  • nb_hist (int, optional) – nb of time steps per batch. Defaults to 1.

  • y_offset (int, optional) – time offset between x and y. nb_hist/2 if None (predict central sample in each batch). Defaults to None.

  • stride (int, optional) – nb of time steps between batches. Defaults to 1.

  • cut_trailing_dim (bool, optional) – Remove trailing dimension. Defaults to False.

  • with_y_hist (bool, optional) – y as central value of the x_hist window (False) or the full sequence covering the x_hist window (True). Defaults to False.

  • data_padding (int, optional) – if > 0, will set weight of as many samples at start and end of nb_hist window to zero. Defaults to 0.

  • first_sample (int) – 0

  • last_sample (int, optional) – None - last_sample in x, otherwise last_sample

  • output_stride (int) – Take every Nth sample as output. Useful in combination with a “downsampling frontend”. Defaults to 1 (every sample).

  • nb_repeats (int) – Number of repeats before the dataset runs out of data. Defaults to 1 (no repeats).

  • shuffle_subset (float) – Fraction of batches to use - only works if shuffle=True

  • unpack_channels (bool) – For multi-channel models with single-channel preprocessing - unpack [nb_hist, nb_channels] -> [nb_channels * [nb_hist, 1]]

  • mask_input (int) – half width of the number of central samples to mask. Defaults to None (no masking).

  • batch_processor (Callable[[np.ndarray], np.ndarray], optional) – For augmentations. Defaults to None.

unroll(return_x=True, merge_batches=True)[source]

[summary]

Parameters
  • return_x=True

  • merge_batches=True

Returns

[description] (xx, yy), (None, yy) or (xx,)

Return type

[type]

das.data.sub_range(data_len, fraction: float, min_nb_samples: int = 0, seed=None)[source]

[summary]

Parameters
  • data_len (int) – total length of data

  • fraction (float) – fraction of data_len to use

  • seed (float) – seed random number generator for reproducible subset selection

Returns

first_sample (int), last_sample (int)

das.data.unpack_batches(x: numpy.ndarray, padding: int = 0)[source]

[summary]

Parameters
  • x ([type]) – [description]

  • padding (int, optional) – [description]. Defaults to 0.

Returns

[description]

Return type

[type]