• Rick Lentz

Active Learning for Unstructured Data, A CRISP-DM Template

Updated: Jan 18, 2019

Business understanding

Business gains competitive advantage by efficiently leveraging information asymmetries. The frontier of information technology investment has been in realizing increased value from unstructured data. These sources include voice, social networking, transaction dialogs, and other technical data collections. An Active Learning platform allows business users a means of interacting with these unstructured data sources to help the business increase value. Thus an Active Learning platform becomes attractive as a component within a competitive information advantage strategy.

The key value proposition of an Active Learning platform is to reduce the cost of labeling unstructured data. But as with any tool, including an Active Learning platform, the business also cares about other metrics. It should be elegantly designed to cost very little to maintain when not in active use. On the other hand, information advantage has a timeliness component so the platform must scale within minutes when needed. Further, it should be easily accessible and ‘Google easy’ to use, minimizing tedium and onboarding hurdles to technological adoption.

Active Learning Cycle

Data understanding

Although current deep learning approaches operate directly on raw data, reducing the dimensionality of unstructured data into embeddings is a proven, practical and useful approach.

Data preparation

Text an image data initial goes through several steps to process the source data. For text data, this may include translation, filtering of specific encodings, whitespace, or character sequences. For image data, these processing steps can include transcoding, rotation, scaling, normalization, and cropping.


Scikit-learn’s Random Forest and Logistic Regressing models are used for training and prediction. The training step uses the labeled samples and prediction is performed on unlabeled samples. Training takes the labels and embeddings that can be obtained using pretrained language models.


Evaluation occurs between rounds of user labeling. Labeled samples are used to train the model, and unlabeled samples are scored and ranked such that the next most crucial sample can be labeled.   Stopping conditions are also evaluated to see if additional labeling is needed to yield useful improvement to the model.


The Active Learning system is implemented entirely using cloud services to minimize lifecycle costs and to provide dynamic compute resources for search, training, and scoring. This approach reduces fixed cost while optimizing user’s time while using the platform.

If interested in a basic walkthrough of an Active Learning cycle, please check out this Jupyter Notebook:


31 views0 comments