Utils for loading and manipulating data for training and prediction.
- class das.data.AudioSequence(x: numpy.ndarray, y: Optional[numpy.ndarray] = None, batch_size: int = 32, shuffle: bool = True, nb_hist: int = 1, y_offset: Optional[int] = None, stride: int = 1, cut_trailing_dim: bool = False, with_y_hist: bool = False, data_padding: int = 0, first_sample: int = 0, last_sample: Optional[int] = None, output_stride: int = 1, nb_repeats: int = 1, shuffle_subset: Optional[float] = None, unpack_channels: bool = False, mask_input: Optional[int] = None, batch_processor: Optional[Callable[[numpy.ndarray], numpy.ndarray]] = None, class_weights: Optional[Sequence[float]] = None, **kwargs)#
x and y can be mem-mapped numpy arrays or lazily loaded hdf5 (zarr, xarray) datasets. Dask arrays do not work since they are immutable. :param x: [nb_samples, …] :type x: np.ndarray :param y: [nb_samples, nb_classes] - class probabilities - so sum over classes for each sample should be 1.0. Defaults to None.
If None, getitem will only return x batches - neither y nor sample weights
batch_size (int, optional) – number of batches to return. Defaults to 32.
shuffle (bool, optional) – randomize order of batches. Defaults to True.
nb_hist (int, optional) – nb of time steps per batch. Defaults to 1.
y_offset (int, optional) – time offset between x and y. nb_hist/2 if None (predict central sample in each batch). Defaults to None.
stride (int, optional) – nb of time steps between batches. Defaults to 1.
cut_trailing_dim (bool, optional) – Remove trailing dimension. Defaults to False.
with_y_hist (bool, optional) – y as central value of the x_hist window (False) or the full sequence covering the x_hist window (True). Defaults to False.
data_padding (int, optional) – if > 0, will set weight of as many samples at start and end of nb_hist window to zero. Defaults to 0.
first_sample (int) – 0
last_sample (int, optional) – None - last_sample in x, otherwise last_sample
output_stride (int) – Take every Nth sample as output. Useful in combination with a “downsampling frontend”. Defaults to 1 (every sample).
nb_repeats (int) – Number of repeats before the dataset runs out of data. Defaults to 1 (no repeats).
shuffle_subset (float) – Fraction of batches to use - only works if shuffle=True
unpack_channels (bool) – For multi-channel models with single-channel preprocessing - unpack [nb_hist, nb_channels] -> [nb_channels * [nb_hist, 1]]
mask_input (int) – half width of the number of central samples to mask. Defaults to None (no masking).
batch_processor (Callable[[np.ndarray], np.ndarray], optional) – For augmentations. Defaults to None.
class_weights (Sequence[float], optional) – Weights for each class used for balancing. Defaults to None (no balancing).
- unroll(return_x=True, merge_batches=True)#
[description] (xx, yy), (None, yy) or (xx,)
- Return type
- das.data.compute_class_weights(y: numpy.ndarray) List[float] #
y (np.ndarray) – [T, nb_classes]
- Return type
- das.data.sub_range(data_len, fraction: float, min_nb_samples: int = 0, seed=None)#
data_len (int) – total length of data
fraction (float) – fraction of data_len to use
seed (float) – seed random number generator for reproducible subset selection
first_sample (int), last_sample (int)
- das.data.unpack_batches(x: numpy.ndarray, padding: int = 0)#
x ([type]) – [description]
padding (int, optional) – [description]. Defaults to 0.
- Return type