Engineering Blog

How to Create Synergy Between MLeap and Spark

Nov 10, 2019

By: Daniel Hen

Ever wondered how to develop an ML on Spark and actually make it production-grade?

Ever asked yourself how to get an ML model quickly to production without Python’s pickle / without “dumping it”?

Meet MLeap!

MLeap is a common serialization format and execution engine for machine learning pipelines. It supports Spark, Scikit-learn and Tensorflow for training pipelines and exporting them to an MLeap Bundle.

The above paragraph is taken from MLeap’s official documentation

Digital Turbine’s story with MLeap

At Digital Turbine, our big-data architecture is based on Spark. As a result, our Data Science team needed to adjust itself to develop ML with Spark, which can sometimes be challenging.

And so, our story begins…

We developed an ML model using the known workflow:

EDA on our data
Explored our features
Engineered some new features
Put the data in an ML model (XGBoost4J – Spark’s XGBoost version)
Tested our model and made sure our metrics gets better
Tuned hyper-parameters
Tweaked the model
and so on….

Then, after a few iterations, we were finally ready to deploy it to Production.

But, guess what? We couldn’t save it and pass it on to other teams (e.g. the Data Engineering team).

The main issues were mostly low-level-API bugs in the specific ML library. So, we searched for a more collaborative way to share models between the different teams.

That’s how we got to know MLeap!

From a Data Scientist’s point of view, the main advantages of MLeap are that it is:

Super-easy to implement in your environment (just add a new .jar)
Wraps any ML library in its own MLeap “Bundle”(which makes it easy to pass on to other teams as a designated API)
Keeps your model as it was! Your parameters, your tweaks, same as it was the last time you “played” with it!
Open Source — it’s always fun to give back to the community, whether via code or documentation, so you can always contribute and help others
Supports and “saves” the order of your pipelines — with a few steps in your data-model workflow, MLeap saves each and every one of them, and transforms data only within this order – just as you designed and developed it

The code

Now, let’s get into code, and see how it happens.

1. First, import relevant libraries after you have installed it:

2. Once it’s complete, serialize your model into a directory on your machine (or cluster):

3. Next, when you want to “unpack” this bundle, use something like this:

4. Extract the same pipeline that the DS team had developed:

5. Finally, transform data on top of it:

Conclusion

That’s all! MLeap is easy to use, easy to implement and easy to ensure you have production-grade models with Spark!

Enjoy your (Machine) Learning!

By Daniel Hen