Measuring Data Drift

Data drift refers in changes to the information in your project dataset over time. This can happen intentionally as a result of changing the scope of your model (i.e. adding a new class to identify, which requires gathering information in a new environment), but can also happen unintentionally (i.e. as a result of adding a camera to collect images in an environment different from that where your model is operating).

To measure data drift, first create a new project in Roboflow. Then, provide the ID associated with that project as the DRIFT_PROJECT documented above in the Configure Roboflow Collect documentation.

For each frame gathered by Roboflow Collect, there is a 1 in 100 chance that image will be uploaded to your DRIFT_PROJECT.

The drift.py script measures the difference between the average CLIP vector in your DRIFT_PROJECT versus the average CLIP vector for all images in the validation set of your ROBOFLOW_PROJECT.

After gathering some data from Roboflow Collect, run the drift.py script with these arguments:

python3.9 drift.py --ROBOFLOW_KEY=<key> --ROBOFLOW_WORKSPACE=<workspace> --ROBOFLOW_PROJECT=<project> --DRIFT_PROJECT=<drift-project>

Where ROBOFLOW_PROJECT is the main project in which you are gathering data, and DRIFT_PROJECT is where you are randomly collecting images.

This script will return a table showing data drift for each month for which data is available, as well as an aggregate score showing how similar images are between your ROBOFLOW_PROJECT and the images collected via Roboflow Collect and saved in the DRIFT_PROJECT.