Pro Tip: Classify your data to speed up innovation

Pro Tip: Use machine learning to analyze protein structure to classify them into strong candidates for gelation and emulsification.

Identifying novel protein-based gelators and emulsifiers is a time-consuming task. Even the best researchers require weeks to conduct the experiments needed to compare proteins, and with virtually unlimited protein sources available, tools to identify the best candidates should be used.

In a previous Pro Tip, we discussed constructing bioinformatic models and suggested that they can be scanned automatically for features, including surface hydrophobicity and hydrogen bonding, which impact gelation and emulsification.

In this tip, we propose different machine-learning algorithms for choosing the best proteins for emulsification and gelation based on bioinformatic information. This can help your team identify potential egg replacers, or other functional ingredients, with much less bench time.

Bioinformatic models contain huge amounts of information. So, once all of the important features have been extracted, such as the amino acid composition, surface characteristics and secondary structure, identifying useful features for gels and emulsions can be difficult.

By organizing extracted data in .csv files, Python’s Scikit-learn can be implemented to quickly find key variables. The Random Forest Regressor machine learning algorithm is one of the fastest and easiest to use to find variables associated with gelation and emulsification. After those features are identified in protein models, classifiers can be used to separate them into candidates for different functions.

There are a number of popular classifiers including Random Forest, Discriminate Analysis and Artificial Neural Networks. These techniques can all be applied from Scikit to separate proteins into different categories based on their similar attributes.

For example, if the goal is to replicate gels that ovalbumin (~55% of the protein in eggs) forms, finding similar structure in your protein dataset may suggest similar mechanisms of gelation.

However, it is also possible to classify based on bioinformatic features that are fundamental in gelation, as identified in the random forest model. By grouping in this way, the best possible gel formers can be identified quickly.

In our work, we have found by grouping proteins based on their hydrophobicity, ratios of positive and negative amino acids, and molecular weight, it is a great starting point to predict the strength of emulsions formed from protein.

Harrison Helmick is a PhD candidate at Purdue University. Connect on LinkedIn and see his other baking tips at BakeSci.com.

His research is conducted with the support of Jozef Kokini, Andrea Liceaga, and Arun Bhunia.