Machine Learning and Data Science in R on Microsoft ML and SQL Servers - with Rafal Lukawiecki

This very intensive, hands-on, 5-day course has been designed for those who want to learn more in-depth machine learning and data science using R. While this week will teach you a lot about the free, open source R, you will also realise that making R incredibly fast, scalable, enterprise-ready yet easy-to-use is easily accomplished with Microsoft ML Server and SQL Server ML Services—the key environments, together with RStudio, that we will focus on during this course.

You will learn 

  • how to prepare and visualise data in R
  • how to build, evaluate and validate models, and how to deploy them to production in-database, in-app and as a web service 
  • the most important classification, clustering, regression and forecasting algorithms 
  • how to write your own code for analysing your results and drawing statistically meaningful conclusions—and understanding the limits of confidence that you can place in that approach.

How you will benefit 

You will not only learn the technology and how to use it, but, much more importantly, you will understand how ML works, how to avoid common mistakes, such as overfitting/overtraining, how to balance model accuracy against its reliability—the bias-variance trade-off—and how to relate many important ML performance metrics to your business goals.

You will gain clarity how to start your data science projects and how to finish them. You will know how to express the business need in terms of testable hypotheses, which will guide model building and selection. You will understand what types of work are suited to ML, and which are unlikely to deliver results. You will discover what makes good first projects in your own area of specialisation.

These are the key benefits of studying machine learning with Rafal Lukawiecki: industry veteran who has been practicing ML, data mining, statistical learning, and data science with his customers for well over a decade, and who has studied artificial intelligence at Imperial College in the ‘90s under the guidance of the leaders and the inventors of this area of industry and science.

Customer testimonials

What did our students say about Rafal`s last course?

"Rafal er en utrolig dyktig formidler, og kurspresentasjonene var gjennomgående noe av det beste jeg har vært på av kursing."
"Rafal er utrolig dyktig , godt forberedt og hyggelig/ hjelpsom."
"Fantastisk kursholder! Venter med spenning på flere av hans kurs."

Audience and Prerequisites

Having prior knowledge of programming in any language is helpful when attending this course.

However, based on Rafal Lukawiecki's experience of having taught over 900 data scientists, if you are prepared to work harder, complete novices can use this week to get their introduction to programming. 

About Rafal Lukawiecki

As Data Scientist at Project Botticelli Ltd, Rafal focuses on making advanced analytics and artificial intelligence easy and useful for his clients.

He can help you find valuable, meaningful patterns and statistically valid correlations using data mining and machine learning in data sets both big and small. Rafal is also known for his work in business intelligence, data protection, enterprise architecture, and solution delivery. While majority of his clients come from consumer and corporate finance, entertainment, healthcare, IT, retail, and the public sectors, Rafal has worked in almost all industries.

He has been a popular speaker at major IT conferences since 1998.

Course content - detailed description

Today, R is the most powerful language explicitly designed for advanced analytics, statistical learning, data science, and, of course, cutting-edge general-purpose machine learning. While Python is more popular as a universal programming language, and also widely used for image and text analysis using deep learning, R is a clear leader in data science.

R is very well suited for advanced ML on classical—that is not pure images or natural language—data sets. Even though such data might come from a data lake, typically you will find plenty of it in a data warehouse, a relational databases, or you can acquire it from files generated by transactional business applications, or from devices, such as: healthcare equipment, point-of-sales devices, or manufacturing and transportation machinery. Above all, R is great for exploratory analysis of data and it can help you draw meaningful conclusions from real-world experiments, such as A-B marketing tests or product trials.

Microsoft Machine Learning Server and Microsoft SQL Server 2019/2017 Machine Learning Services support both R and Python in a number of proprietary, high-performance, scalable, enterprise-ready, easy-to-use packages and libraries, notably RevoScale and MicrosoftML. You will learn how to use them during this course. You will also learn how to do almost everything using the most popular algorithms provided by open source R packages, such as rpart, kmeansruns, fps, cluster, clusplot, ts, xts, e1071, caret, glm, and for extra help rattle, qdapTools, MLmetrics, and miscTools.

You will learn how to prepare and visualise data both by using open source packages, mainly dplyr and ggplot2, and other parts of the tidyverse meta-package, like readr, readxl, and lubridate, and how to do it more directly in SQL Server, benefitting from its performance and scalability. We will even combine the power of R with Power BI, to create informative visualisations that are otherwise impossible to do it Power BI alone.

While learning about data science process and hypothesis testing, you will discover that some complex business questions can be answered using simpler, statistical techniques, such as tests of significant differences between sets of data, or visualisations like notched box plots. We will refresh your knowledge of rudimentary statistical concepts that are necessary for machine learning and data science, like knowing the difference between ordinal, interval and ratio data, and thus why it is not possible to calculate a mean star rating, while a median is possible. A little time has been allocated for the discussion of p-values, confidence intervals, and the differences between Bayesian and frequentist interpretation of your results. Bear in mind, that this is not a course about statistics, but a little working knowledge is a must in our industry, and to make the rest of the course easier to follow.

Early in the course, you will learn all the fundamentals of machine learning—prior knowledge is not necessary: data preparation and structure, algorithm classes and their applications, model evaluation and validation, including all the common performance metrics such as precision and recall. At the heart of this course, however, you will gain an intimate understanding of how some of the most important algorithms work and how to prepare data to make the algorithms give you the most they can.

Starting with clustering, you will learn about k-means, k-medians, spherical kmeans and expectation-maximisation. You will find out how to prepare non-numerical and even some numerical data using popular R functions such as mtabulate for these algorithms. Other than using clustering for segmentation, we will also study its use for anomaly detection. We will expand on that subject using other, specialised techniques, such as a One Class SVM and PCA-Based Anomaly Detection, permitting you to predict anomalies, such as fraud.

We dedicate a full day to focus on building classifiers. You will understand the differences between the most important decision tree algorithms: plain, forests and boosting, and you will study both simpler and more complex neural networks, and how they relate to regressions. We will also cover the widely used logistics regression algorithm, which is classifier. Later in the course you will meet the large family of regression techniques, starting with classic linear regression, through generalised linear model, to non-linear ML regressions. We will also have some time to cover remaining big applications of machine and statistical learning, notably forecasting with time series, and, briefly, recommendation engines.

When deploying models to production, the benefits of using ML Server and SQL ML Services will impress. After seeing how to do it using open source R, we will culminate with an extremely fast in-database deployment using T-SQL PREDICT statement, and the related real-time sp_rxPredict, which returns predictions on a nano-second scale! You will also see how deploy your models using web services, interacting via Azure if needed. Please note, however, that this course does not focus on Azure ML, even though we will briefly discuss how to combine those technologies together (please also see our other course by Rafal that focuses on Azure ML).

Every day we will work using the free RStudio, the most popular R IDE and the Microsoft-recommended environment for building R applications on top of SQL and ML Servers. All of our work will follow the modern principles of reproducible research: you will learn how to set-up notebooks, manage packages and their dependencies, including versioning, using snapshots, how to save your work, how to manage change using Git, and how to collaborate. At the end of the course you will keep your own R notebook containing almost 1000 lines of code and results! You are also welcome to keep all data sets that you use during the course labs and tutorials. You will notice that throughout the week you understand and write better and more advanced R, whilst experiencing, first-hand, many of its real-world applications.

Model validity is the most important aspect of any machine learning project. A lot of time has been dedicated to explain it in detail: many validity metrics, such as precision, recall, AUC, F1 score, accuracy (which is rarely a good metric), and the many charts we use to analyse models, especially: confusion matrix, lift/gain charts, ROC curve, precision-recall curve, profit and cost chart, calibration charts, scatter plots, and others used for regression evaluation like histograms of residuals, QQ-Norm plot of residuals, scale-location, Cook’s distance and many others. You will learn how to create those plots using R, and with the help of other tools. At the end of the course you will know when you can trust your models, and you will be able to explain your work to others, especially your project sponsors who rarely are machine learning experts.

Above all, you will be studying and learning from the experience of a recognised industry expert, Rafal Lukawiecki, who has been practicing machine learning, data science and AI for well over a decade on many successful commercial projects.