Machine Learning and Data Science in R on Microsoft ML and SQL Servers - with Rafal Lukawiecki
A very intensive, hands-on, online, 5-day course designed for those who want to learn more in-depth machine learning and data science using R. As you study the free, open source R, you will also learn how to easily make R incredibly fast, scalable, and enterprise-ready with Microsoft ML Server, SQL Server ML Services, and RStudio. You will learn: how to prepare and visualise data in R, build, evaluate, and validate models, how to deploy them to production in-database, in-app, and as a web service. You will study, in detail, the most important classiﬁcation, clustering, regression and forecasting algorithms. You will write your own code for analysing your results and for drawing statistically meaningful conclusions—and you will understand the limits of conﬁdence that you can place in that approach. Having prior knowledge of programming in any language is helpful, however, if you are prepared to do an extra bit of homework Rafal will help even complete novices to get their introduction to programming.
You will learn
- Building and deploying machine learning models using open source R programming language, including data preparation, visualisation, and stringent model validation.
- High-performance ML using the newest version of Microsoft ML Server and SQL Server 2019 with R and RStudio.
- Deployment to production with nanosecond-scale performance.
- Successful data science project formulation and delivery.
Why attend this class?
Because of Rafal’s 10+ years of real-world machine learning experience.
You will not only learn all the concepts and tools that you need to know from an experienced teacher who has trained over 900 data scientists world-wide, a highly-respected presenter, capable of holding your attention, but, above all, from a practitioner of machine learning. Rafal Lukawiecki has been delivering ML, data mining, and data science projects for customers in retail, banking, entertainment, healthcare, manufacturing, education, and government sectors for twelve years. Because of that, you will learn:
- everything essential to starting data science, ML, and AI projects
- all fundamental concepts
- how to avoid common pitfalls
- how to work fast yet accurately
- what is really useful and practical
- what is more theoretical but still important
- what hype you should be wary of
You will be able to ask any questions related to your industry and you will get relevant, pragmatic, no-nonsense answers, helping you get ahead with your own projects. Learn from Rafal who has done it all, not from those who just teach it—this is why it is Practical Machine Learning.
What did our students say about Rafal`s last course?
"Rafal er en utrolig dyktig formidler, og kurspresentasjonene var gjennomgående noe av det beste jeg har vært på av kursing."
"Rafal er utrolig dyktig , godt forberedt og hyggelig/ hjelpsom."
"Fantastisk kursholder! Venter med spenning på flere av hans kurs."
Analysts, budding and current data scientists, data engineers, DBAs, BI developers, programmers, power users, predictive modellers, forecasters, consultants, data engineers, anyone interested in using ML for AI, AI engineers.
General ability to work with data in any form: using spreadsheets, tables, or databases. Prior knowledge of any programming language is helpful, however, if you are prepared to work harder by asking Rafal questions and doing a little additional homework during the week you can use this course to learn R as your very ﬁrst programming language.
This course will teach you machine learning and data science using R and Microsoft technologies: you do not need to know that before attending.
Format, hours and delivery components
50% lectures, 25% demos, 25% lab tutorials.
There are 4 delivery components included in this course format:
- 5 half-day live online lectures by Rafal Lukawiecki, with everyone participating, between the hours of 15:00-18:30 CET). Each session will comprise of a lecture, live demos, and plenty of time to answer any questions.
- Your own work, taking approx 2–3 hours to complete the labs and assignments, which you are expected to do before the next half-day lecture starts. We will provide you with the necessary data/ﬁles and (if needed) Azure VM images that contain a full set-up of all the necessary software that you are expected to run using your own Azure account (free trial is acceptable).
- Small-group (2–3 students) 50–minute online tutoring sessions with Rafal to review the lab work, to provide course assistance, and to answer any additional questions. These sessions will take place outside of the lecture hours and will match the European or American time zones, as needed. Every student will have an opportunity to participate in 2–3 of those tutoring sessions during the week, and we will be ﬂexible in oﬀering additional one-to-one support for anyone struggling with any aspect of the learning process. We want everyone to succeed!
- Students will be able to, and will be encouraged, to work in groups of 2–3 while completing the labs and assignments.
About Rafal Lukawiecki
As Data Scientist at Project Botticelli Ltd, Rafal focuses on making advanced analytics and artiﬁcial intelligence easy and useful for his clients.
He can help you ﬁnd valuable, meaningful patterns and statistically valid correlations using data mining and machine learning in data sets both big and small. Rafal is also known for his work in business intelligence, data protection, enterprise architecture, and solution delivery. While majority of his clients come from consumer and corporate ﬁnance, entertainment, healthcare, IT, retail, and the public sectors, Rafal has worked in almost all industries.
He has been a popular speaker at major IT conferences since 1998.
Course content - detailed description
Above all, this course will teach you modern R: currently, the most powerful language explicitly designed for advanced analytics, statistical learning, data science, and cutting-edge general-purpose machine learning. While Python is more popular as a universal programming language, also widely used for image and text analysis using deep learning, R is a clear leader in data science. You will learn how to do machine learning in R especially on classical data sets that you often encounter in business use. Even though such data might come from a data lake, typically you will ﬁnd plenty of it in a data warehouse, a relational databases, or you can acquire it from transactional business application ﬁles, or from devices, such as: healthcare equipment, point-of-sales devices, or manufacturing and transportation machinery. Above all, R is great for exploratory analysis of data and it can help you draw meaningful conclusions from real-world experiments, such as A-B marketing tests or product trials. This course will teach you the foundations of hypothesis testing in order to be able to draw such conclusions with a high dose of conﬁdence.
Microsoft Machine Learning Server and Microsoft SQL Server 2019/2017 Machine Learning Services support both R and Python in a number of proprietary, high-performance, scalable, enterprise-ready, easy-to-use packages and libraries, notably RevoScale and MicrosoftML. You will learn how to use them during this course. You will also learn how to do almost everything using the most popular algorithms provided by open source R packages, such as rpart, kmeansruns, fps, cluster, clusplot, ts, xts, e1071, caret, glm, and for extra help rattle, qdapTools, MLmetrics, and miscTools.
You will learn how to prepare and visualise data both by using open source packages, mainly dplyr and ggplot2, and other parts of the tidyverse meta-package, like readr, readxl, and lubridate, and how to do it more directly in SQL Server, beneﬁting from its performance and scalability. We will even combine the power of R with Power BI, to create informative visualisations that are otherwise impossible to do it Power BI alone.
While learning about data science process and hypothesis testing, you will discover that some complex business questions can be answered using simpler, statistical techniques, such as tests of signiﬁcant diﬀerences between sets of data, or visualisations like notched box plots. We will refresh your knowledge of rudimentary statistical concepts that are necessary for machine learning and data science, like knowing the diﬀerence between ordinal, interval and ratio data, and thus why it does not make sense to calculate a mean star rating, while a median is possible. A little time has been allocated for the discussion of p-values, conﬁdence intervals, and the diﬀerences between Bayesian and frequentist interpretation of your results. Bear in mind, that this is not a course about statistics, but a little working knowledge is a must in our industry, and to make the rest of the course easier to follow.
Early in the course, you will learn all the fundamentals of machine learning—no prior knowledge is necessary. You will study: data preparation and relevant structures, algorithm classes and their applications, model evaluation and validation, including all the common performance metrics such as precision and recall. At the heart of this course, however, you will gain an intimate understanding of how some of the most important algorithms work and how to prepare data to make the algorithms give you the most they can.
Starting with clustering, you will learn about k-means, k-medians, spherical kmeans and expectation-maximisation. You will ﬁnd out how to prepare non-numerical and even some numerical data using popular R functions such as mtabulate for these algorithms. Other than using clustering for segmentation, we will also study its use for anomaly detection. We will expand on that subject using other, specialised techniques, such as a One Class SVM and PCA-Based Anomaly Detection, permitting you to predict anomalies, such as fraud.
We dedicate a full day to focus on building classiﬁers. You will understand the diﬀerences between the most important decision tree algorithms: plain, forests and boosting, and you will study both simpler and more complex neural networks, and how they relate to regressions. We will also cover the widely used logistics regression algorithm, which, actually, is a classiﬁer. Later in the course you will meet the large family of regression techniques, starting with classic linear regression, through GLM, the generalised linear model, to non-linear ML regressions. We will also have some time to cover remaining big applications of machine and statistical learning, notably forecasting with time series, and, brieﬂy, recommendation engines.
Microsoft SQL Azure LogoWhen deploying models to production, the beneﬁts of using ML Server and SQL ML Services will impress. After seeing how to do it using open source R, we will culminate with an extremely fast in-database deployment using T-SQL PREDICT statement, and the related real-time sp_rxPredict, which returns predictions on a nano-second scale! You will also see how to deploy your models using web services, interacting via Azure if needed. Please note, however, that this course does not focus on Azure ML, even though we will brieﬂy discuss how to combine those technologies together (please also see our other course by Rafal that focuses on Azure ML).
Every day we will work using RStudio, the most popular, and free, R IDE which is recommended by Microsoft for building R applications on top of SQL and ML Servers. All of our work will follow the modern principles of reproducible research: you will learn how to set-up notebooks, manage packages and their dependencies, including versioning, using snapshots, how to save your work, how to manage change using Git, and how to collaborate. At the end of the course you will keep your own R notebook containing almost 1000 lines of code and results! You are also welcome to keep all data sets that you use during the course labs and tutorials. You will notice that throughout the week you understand and write better and more advanced R, whilst experiencing, ﬁrst-hand, many of its real-world applications.
Model validity is the most important aspect of any machine learning project. A lot of time has been dedicated to explain it in detail: many validity metrics, such as precision, recall, AUC, F1 score, accuracy (which is rarely a good metric), and the many charts we use to analyse models, especially: confusion matrix, lift/gain charts, ROC curve, precision-recall curve, proﬁt and cost chart, calibration charts, scatter plots, and others used for regression evaluation like histograms of residuals, QQ-Norm plot of residuals, scale-location, Cook’s distance and many others. You will learn how to create those plots using R, and with the help of other tools. At the end of the course you will know when you can trust your models, and you will be able to explain your work to others, especially your project sponsors who rarely are machine learning experts.
Above all, this course will not only teach you the technology and how to use it, but, much more importantly, you will understand how ML works, how to avoid common mistakes, such as overﬁtting/overtraining, how to balance model accuracy against its reliability—the bias-variance trade-oﬀ—and how to relate key ML performance metrics to your business goals, making your bosses and clients happy with your progress and results. You will gain clarity how to start your data science projects and how to ﬁnish them. You will know how to express the business need in terms of testable hypotheses, which will guide model building and selection. You will understand what types of work are suited to ML, and which are unlikely to deliver results. You will discover what makes good ﬁrst projects in your own area of specialisation.