splits
module: Introducedcreate_splits
function to perform pre-assigned data splits.evaluation
module: Added new module to provide metrics for evaluation, compatible with trainer classes.datasets
module:- Enabled
XrDataset
to support train/test splitting. - Updated
XrDataset
andMultiProcSampler
to perform feature scaling (normalization and standardization), with support for custom scaling functions. - Added callback functionality to
PTXrDataset
for unification, aligning it with other dataset classes.
- Enabled
preprocessing
module:- Introduced
assign_mask
method for automatic filtering of data cubes.
- Introduced
- Module renaming:
xr_plots
→plotting
cube_insights
→insights
cube_utilities
→utils
data_splits
→splitting
datasets
module:- Renamed
LargeScaleXrDataset
toPTXrDataset
(forPyTorch
) andTFXrDataset
(forTensorFlow
). - Updated
MultiProcSampler
to automatically add metadata to the datacube.
- Renamed
preprocessing
module:- Renamed
fill_masked_data
→fill_nan_values
. - Updated
drop_nan_values
method to allow NaN dropping strategy selection via themode
parameter. - Modified
get_range
/get_statistics
to return comprehensive dataset statistics within a dictionary, excluding predefined variables and thesplit
variable. - Standardization of all variables returned by
get_range
/get_statistics
fornormalize
andstandardize
functions.
- Renamed
Trainer
class (training.pytorch
module): Fixed issue with transferring validation data to GPU when required.
- Script added for timeseries analysis with CNN (
time_series_analysis.py
) - Replaced examples utilizing multidimensional data samples by (
use_case_lst_pytorch_masked.ipynb
anduse_case_lst_tensorflow_masked.ipynb
).
- Files that were unintentionally included in version 1.0.0 during the build have been removed to streamline the package and correct the release content.
cube_insights
module introduced to provide an overview of the data cube state.get_insights
function for extracting and printing various characteristics of the data cube.get_gap_heat_map
for generating heat maps of value counts (non-NaN values) across dimensions.
training
module:plot_loss
method, accessible ifcreate_loss_plot
is true in Trainer classes (pytorch
,pytorch_distributed
,sklearn
,tensorflow
).- If
mlflow_run
specified, models are saved under the name extracted from user defined model path instead of saving them under 'model' onMlFlow
server. - Option to define validation metrics dictionary added to Trainer classes added.
datasets
module:- Introduced
MultiProcSampler
for efficient creation of train/test sets as zarr data cubes.
- Introduced
cube_utilities
module:- Added
split_chunk
to divide a chunk of the data cube into machine learning samples. - Added
get_dim_range
for retrieving minimum and maximum values of specific cube dimensions. assign_dims
function added to map user-defined dimension names for later transformation to xarray.
- Added
- preprocessing module:
- Added
fill_masked_data
to fill NaNs using different methods (mean, noise, constant).
- Added
xr_plots
module:- Renamed from
geo_plots
and transformed theplot_geo_data
toplot_slice
, utilizing NumPy instead of GeoPandas for increased performance and broader usability.
- Renamed from
gapfilling
module:- Generalized to accommodate any naming convention of cube dimensions.
- Enabled user-defined directory storage for additional predictors extracted by
HelpingPredictor
.
datasets
module:- Updated
LargeScaleXrDataset
for both,PyTorch
andTensorFlow
, and theXrDataset
class to handle multidimensional data samples.
- Updated
- preprocessing module:
drop_nan_values
andapply_filter
methods updated to handle multi-dimensional data.- Introduced
drop_sample
option inapply_filter
to decide on dropping samples or setting unmatched values to NaN. - Updated
drop_nan_values
to utilize a mask for sample validity, dropping invalid samples or those with NaN values in valid data.
- Added and updated examples in the
Examples
directory:distributed_dataset_creation.py
demonstrating the use ofMultiProcSampler
.- Updated
distributed_training.py
to utilize the new train and test sets. - New examples on ESDCs (
use_case_lst_pytorch_masked.ipynb
anduse_case_lst_tensorflow_masked.ipynb
) for multidimensional samples.
training
module:- Trainer classes for
sklearn
,PyTorch
, andTensorFlow
to improve usability. pytorch_distributed.py
for distributed machine learning (originallydistributed_training.py
).
- Trainer classes for
datasets
module:XrDataset
for smallerxarray
data manageable in memory.LargeScaleXrDataset
forPyTorch
andTensorFlow
to handle large datasets by iterating over (batches of) chunks.- Enabled batch training (partial fit) for the
training.sklearn
trainer class using bothLargeScaleXrDataset
(fromdatasets.pytorch_xr
) orXrDataset
, integrated with aPyTorch
data loader. - Configuration options for
XrDataset
and bothLargeScaleXrDataset
to select training data prepare_dataloader
method forPyTorch
andprepare_dataset
forTensorFlow
to configure data processing during training (e.g., batch size, distributed training).
preprocessing.py
module with methods:apply_filter
for filtering training data based on a filter variable contained in the dataset.drop_nan_values
to remove data points containing any NaN values.
- Published the
ml4xcube
package (v0.0.6) on PyPI and Conda-forge.
- Renamed
mltools
toml4xcube
due to name conflict. - Updated
gapfilling
module:- Added a progress bar to visualize progress during gap filling.
- Updated final print statement to show the location of gap-filled data.
- Renamed
rand
method toassign_rand_split
indata_assignment
module and unified its usage withassign_block_split
to improve usability. - Updated the use cases withing the
Examples
directory to demonstrate new / edited functionalities. - Renamed module
distributed_training
topytorch_distributed
- Removed
torch_training
module including the containing methodstrain_one_epoch
andtest
.
- New functionality for gap filling in datasets, implemented in the
gap_dataset.py
andgap_filling.py
scripts within themltools
package. This includes:- The
GapDataset
class for handling datasets with data gaps, enabling the slicing of specific dimensions and the addition of artificial gaps for testing gap-filling algorithms. - The
EarthSystemDataCubeS3
subclass to facilitate operations on ESDC. - The
Gapfiller
class to fill gaps using machine learning models, with a focus on Support Vector Regression (SVR). This class supports various hyperparameter search methods and predictor strategies for gap filling. - Methods in
gap_dataset.py
for retrieving additional data matrices as predictors and processing the actual data matrix for gap analysis.
- The
- Example script (
gapfilling_process.py
) and detailed usage instructions for the new gap-filling functionality, demonstrating how to apply these methods to real-world datasets. - New
geo_plots.py
script within themltools
package, which includes:plot_geo_data
function for creating geographical plots of data from a DataFrame, complete with customizable visual features and the ability to save figures.
- The
data_processing
module of themltools
package has been renamed tostatistics
to more accurately reflect its purpose and functionality. - Updated use cases in the
Examples
directory now utilize theplot_geo_data
function for enhanced visualization of geographic data. - Revision of Examples: Updated Python examples for
sklearn
,PyTorch
, andTensorFlow
to be compatible with the current versions of these packages.
- Streamlined the Conda (and pip) package creation process to facilitate faster and more efficient package builds with Miniconda.
- Transformed
mltools.py
into a Python package with modules:cube_utilities.py
,data_processing.py
,torch_training.py
,sampling.py
. - New functions integrated into the corresponding modules based on functionality.
- Two new methods in
cube_utilities.py
:get_chunk_by_index
for retrieving data cube chunks by index.rechunk_cube
for re-chunking the data cube.
distributed_training.py
added to themltools
package for PyTorch distributed training.- Corresponding
distributed_training.py
example added to theExamples
directory for practical use. - Pip packaging support with the
pip_env
directory, includingsetup.py
andrequirements.txt
. - Conda packaging option with the
conda_recipe
directory containingmeta.yaml
.
- Organize the initial
mltools.py
functions by functionality and expand the collection with new functions. - Improved
get_statistics
(data_processing.py
) method performance by utilizing Dask's parallel computation capabilities.
- Removed semicolons in use cases in the
Examples
directory to maintain Python syntax.