Apache Spark™ for Machine Learning and Data Science

This course is primarily for data scientists but is directly applicable to analysts, architects, software engineers, and technical managers interested in a thorough, hands-on overview of Apache Spark and its applications to Machine Learning.
The course covers the fundamentals of Apache Spark including Spark’s architecture and internals, the core APIs for using Spark, SQL and other high-level data access tools, Spark’s streaming capabilities and a heavy focus on Spark’s machine learning APIs. The class is a mixture of lecture and hands-on labs.

Audience

Data scientists, analysts, architects, software engineers, and technical managers with experience in machine learning who want to adapt traditional machine learning tasks to run at scale using Apache Spark.

Prerequisites

Some familiarity with Apache Spark is helpful but not required.

Some familiarity with Machine Learning and Data Science concepts are highly recommended but not required.

Basic programming experience in an object-oriented or functional language is required.

Learning Objectives

After taking this class, students will be able to:

Use the core Spark APIs to operate on data

Articulate and implement typical use cases for Spark

Build data pipelines and query large data sets using Spark SQL and DataFrames

Analyze Spark jobs using the administration UIs inside Databricks

Create Structured Streaming jobs

Understand the basics of Spark’s internals

Work with relational data using the GraphFrames APIs

Understand how a Machine Learning pipeline works

Use various ML algorithms to perform clustering, regression and classification tasks.

Train & export ML models

How to train models with 3rd-party libraries like scikit-learn

Create and transform DataFrames to query large datasets.

Improve performance through judicious use of caching and applying best practices.

Visualize how jobs are broken into stages and tasks and executed within Spark.

Troubleshoot errors and program crashes using Spark UI, executor logs, driver stack traces, and local-mode runtimes.

Course duration

3 Days

Course outline

Spark Overview

In-depth discussion of Spark SQL and DataFrames, including:

The DataFrames/Datasets API
Spark SQL
Data Aggregation
Column Operations
The Functions API: date/time, string manipulation, aggregation
Caching and caching storage levels
Use of the Spark UI to analyze behavior and performance

Overview of Spark internals

Cluster Architecture
How Spark schedules and executes jobs and tasks
Shuffling, shuffle files, and performance
The Catalyst query optimizer

An in-depth overview of Spark’s MLlib Pipeline API for Machine Learning

Build machine learning pipelines for both supervised and unsupervised learning
Transformer/Estimator/Pipeline API
Use transformers to perform pre-processing on a dataset prior to training
Train analytical models with Spark ML’s DataFrame-based estimators including Linear Regression, Logistic Regression, Decision Trees + Random Forests, Boosted Trees, K-Means, Alternating Least Squares, and Neural Nets
Tune hyperparameters via cross-validation and grid search
Evaluate model performance

Spark-sklearn

How to distribute single-node algorithms (like scikit-learn) with Spark
Partitioning data concerns

Spark Structured Streaming

Sources and sinks
Structured Streaming APIs
Windowing & Aggregation
Checkpointing & Watermarking
Reliability and Fault Tolerance
Graph processing with GraphFrames
Transforming DataFrames into a graph
Perform graph analysis, including Label Propagation, PageRank, and ShortestPaths

Please contact your training representative for more details on having this course delivered onsite or online

Training Outlines - the one stop shopping center for IT training.
© Training Outlines All rights reserved