Eugene Tuv

Eugene Tuv

How we ranked at the top in all Tweakathons!

Hi, my name is Eugene Tuv. I am a principal research scientist with Logic Technology Group, Intel Corp. I have been leading ASML (statistical machine learning) group for nearly 15 years. Our team is focused on research, prototyping, development, and deployment of scalable, robust real-time state of the art machine learning and computer vision systems targeting acceleration of semiconductor process development and yield learning.

From the beginning, our primary focus was on building “universal” learning machines capable of dealing with massive, heterogeneous, noisy, and rapidly changing data that are easily deployable for a variety of learning tasks: predictive modeling, feature learning/selection, image/signal analysis, fault detection & isolation, etc.

We routinely “validated” our internally developed learning engines by participating in academic ML competitions since 2003, consistently demonstrating top results. It kept us on our toes and has been a major driving force for continuous innovation/improvement.

Machine Learning challenges are designed, validated, and supervised by top researchers in the field - associated with the top academic conferences with worldwide participation from major universities and labs. Participants are guaranteed to compete against the best minds in machine & statistical learning, use the best known technologies, and learn about emerging research trends.

AutoML challenge is an integral part of the rapid ML research advancement towards flexible learning machines capable of tackling problems that have been deemed impossible before. AutoML specifically targets designing “universal” learning machines with wide applicability and minimum human supervision - “Designing for AutoML, the challenge that takes the human out of the loop.”

Summary of method:

Eugene Tuv presented the method of the ideal.intel.analytics team at the CiML workshop at NIPS, Montreal, December 2015. The software used in the challenge is proprietary. It is a fast implementation of boosted decision trees written in C. The software was developed to classify defects in semi-conductor manufacturing. The Intel data include time series that are first preprocessed as a bag-of-features. Using this software, the Intel team consistently ranked high in several ChaLearn challenges since 2005, usually entering just in the last few days of the challenge (overall second in NIPS 2003 “feature selection challenge”, second on dataset Gina in the WCCI 2003 “performance prediction challenge”, overall second in the agnostic track of the IJCNN 2007 “agnostic learning vs. prior knowledge challenge”, and winners of the AISTATS 2010 “active learning challenge”).

The method is based on gradient boosting of trees built on a random subspaces dynamically adjusted to reflect learned features relevance. A Huber loss function is used. No pre-processing was done, except for feature selection. The feature selection method used is a forward selection technique, incrementally adding features that are most predictive in the null space of previously selected features. The feature set is augmented with a number of ``artificial contrast” features, which are irrelevant features. A threshold on the fraction of irrelevant features selected (false discovery rate) is used to halt the procedure. The classification method called “Stochastic Gradient Tree and Feature Boosting” (SGTFB) selects a small sample of features at every step of the ensemble construction. The sampling distribution is modified at every iteration to promote more relevant features. The SGTFB complexity is of the order of Ntree Ntr log Ntr log Nfeat.

The asml.intel.com team (Victor Kocheganov) competed to try to reproduce the proprietary solution of Intel using publicly available software. He used the GTB version of scikit-learn.

Slides from the CiML 2015 challenge on AutoML: [download]

References:

Borisov A., Eruhimov V. and Tuv, E. Tree-Based Ensembles with Dynamic Soft Feature Selection, In Feature Extraction Foundations and Applications Series: Studies in Fuzziness and Soft Computing , Vol. 207, Guyon, I.; Gunn, S.; Nikravesh, M.; Zadeh, L.A. (Eds.), Springer, 2006.