Getting started#
How to use the benchmark and evaluate your own sampler
Installation#
The first step is to fork or clone the benchmark GitHub repository.
$ git clone https://github.com/dataiku-research/benchmark_tabular_active_learning.git
Then you have to install all the required packages used inside the benchmark.
Please run this command when you are located at the root of the benchmark folder
$ pip install -r requirement.txt
Implement your custom sampler#
Sampler architecture#
Your sampler should be defined inside my_sampler.py
file inside the MyCustomSamplerClass
class.
The sampler you want to evaluate in this benchmark must follow this architecture, implementing the fit and select samples method :
class MyCustomSamplerClass(): """ Args: batch_size: Numbers of samples to select. random_state: Random seeding """ def __init__(self, batch_size: int, random_state: RandomStateType = None): # do some stuff return def fit(self, X: np.ndarray, y: np.ndarray = None): """Fit the model on labeled samples. Args: X: Labeled samples of shape (n_samples, n_features). y: Labels of shape (n_samples). Returns: The object itself """ # do some stuff return self def select_samples(self, X: np.array) -> np.array: """Selects the samples from unlabeled data using the internal scoring. Args: X: Pool of unlabeled samples of shape (n_samples, n_features). strategy: Strategy to use to select queries. Returns: Indices of the selected samples of shape (batch_size). """ # do some stuff return index
Sampler input parameters#
Your custom sampler parameters should be defined inside main.py
file or main.ipynb
file inside the get_my_sampler
function.
Everything is already imported so that you just have to manage your sampler input parameters defined inside this function.
Your can either add your custom sampler parameters and remove the dynamic input parameters already implemented.
This function will be used later in the benchmark in order to instanciate your sampler with it’s custom parameters, and additional dynamic parameters if needed.
def get_my_sampler(params : dict) :
"""
Function used to instanciate your sampler with it's parameters
Parameters:
params : parameters that will be passed to generated your sampler with automatically generated ’batch_size’, ’classifier’, 'iteration' and ’random_state’ according to the selected dataset, current AL iteration and the current seed used
You can remove these parameters from the initialisation parameters below if they are not used inside your custom sampler
"""
# TODO : add your custom sampler parameters and remove the default ones if not useful
sampler = MyCustomSamplerClass(
# remove some of these parameters if not useful
batch_size = params['batch_size'],
classifier = params['clf'],
iteration = params['iteration'], # AL iteration
random_state = params['seed'], # Important for reproducibility purpose (Use it as much as possible)
# add you custom parameters here
)
return sampler
About available dynamic parameters :
batch_size
refers to the sampling batch size of the sampler. It could be automatically generated according to the selected dataset.clf
refers to the estimator of the sampler. It could be automatically generated according to the selected dataset.iteration
refers to the current AL iteration.seed
refers to the current seed used. As it is really important for reproducibility purpose, you should use this parameter inside your sampler as much as possible.
Run the benchmark#
After you properly defined your custom sampler as shown below, there are 2 possible ways to run the benchmark, depending on the file in which you choosed to define your sampler input parameters.
If you choosed to define your input parameters inside the main.ipynb
file, you can run the benchmark running the notebook cells.
On the other hand, if you choosed to define your input parameters inside the main.py
file, you can run the benchmark typing the command below from your terminal (inside the root of the benchmark folder).
python main.py -datasets_ids [list of datasets ids you want to run]
# Example :
# python main.py -datasets_ids 1461 cifar10
Note: If you want to run all the benchmark datasets in a row, you can leave datasets_ids argument empty
python main.py -datasets_ids
Save your results#
After you ran the benchmark for the dataset, a window will automatically pop-up and you will have the possibility to merge your sampler results inside benchmark results.
If you accept to share your results to the AL community, you just need to create a Git Pull Request in the main repository so that your experiments could be verified and merged into the main repository.
Then your results would be available to everyone