The Million Dollar Question: What are the common use cases for using Unit Testing in a Data Science code pipeline?
As you are no doubt aware, the question is not if it is necessary to use unit testing in a DS pipeline, as the answer most definitely is hell yes! But, what to look out for when unit testing?
What is testing?
Testing is defined as a task which checks whether the actual results match the expected results, thus ensuring that your code is bug-free.
Testing should be an integral part of any project, especially relating to the perspective of every data science project — whether it’s at the analysis stage, the data preparation stage, the algorithmic stage or even in your model evaluation stage!
In this blog post, we will cover some of the more common and less common cases in which we implemented specific unit testing here at Fyber.
Data drift tests
Data drift is a change in any aspect of the data. The timing (e.g. being ingested every 2 hours, instead of every hour), the structure of the data (e.g. an old column being removed, a new column being added), the type of the data (e.g. a column changed from int to string…), etc.
Relevant use cases can then be:
1. Null values inside a column
This use case refers to the fact that following some processing along the way / some data drift, there can be null in a column.
Thus, tests that can be written to tackle this include:
- Check if a column is null and return a boolean flag accordingly / assert it is not null
This can be the case where null is not expected in a particular column
- Count the numbers of null / non-null values inside a column
This is relevant where null is expected in a column, for any given reason, or in a case we would like to see the number of those null VS. non-null values, in a certain column
- Iterate through all / few columns and check for null existence in all of them
This might occur in cases where null is not expected in any of the columns
2. Size / Count of a DataFrame / Column
This use case refers to following certain data, data processing and removal of outliers / filtering. Thus, we’d like to validate ourselves in terms of the data size we are left with, to make sure we don’t fit our model / any further step in our pipeline with a DataFrame that is “suddenly” too big or too small.
Therefore, we can write different tests / validity checks to tackle this:
- Count DataFrame size / Count Column size
This can be used as a type of a validity check, to actually log the size of a dataframe, once manipulations are performed on it.
- Size Assertion Tests
This can be the case where we know (even as a rough estimation) what our data output size is likely to be, following data manipulations. Examples can include more than 0, more than 1M records, not more than 10M, between 1M-2M records, etc.
3. Domain-specific tests
This section refers to the specific domain in which you are working.
Let’s take a generic example: In a chosen column, I know I should always, at any given time and without any filtering applied, have exactly 2 categories (i.e. 2 unique values). So, I can check for distinct observations in that column, to ensure I’m good and analyze the relevant data.
Specific examples might include:
- E-commerce world — analyzing and processing data from any gender
- Banking world — analyzing and processing data from any “income-group” (low, medium, high…)
- Ad-Tech world — analyzing and processing data from any “Device Operation System”
Therefore, we can write a number tests / validity checks to tackle this:
- Count Distinct on a specific column
- Aggregate the data by each “category” and count the number of records / sum over some KPIs
This can be a more detailed case, where it is not enough for you to just “have” a category in the data. Instead, you would like to make sure that this category contains enough data, following manipulations made on your DataFrame.
Finally, if you have a good idea – please contact me with your use cases for unit testing in the data science world.