Home    |    Instructor-led Training    |    Online Training     
         
 
Courses
ADA
Adobe
Agile
AJAX
Android
Apache
AutoCAD
Big Data
BlockChain
Business Analysis
Business Intelligence
Business Objects
Business Skills
C/C++/Go programming
Cisco
Citrix
Cloud Computing
COBOL
Cognos
ColdFusion
COM/COM+
CompTIA
CORBA
CRM
Crystal Reports
Data Science
Datawarehousing
DB2
Desktop Application Software
DevOps
DNS
Embedded Systems
Google Web Toolkit (GWT)
IPhone
ITIL
Java
JBoss
LDAP
Leadership Development
Lotus
Machine learning/AI
Macintosh
Mainframe programming
Mobile
MultiMedia and design
.NET
NetApp
Networking
New Manager Development
Object oriented analysis and design
OpenVMS
Oracle
Oracle VM
Perl
PHP
PostgreSQL
PowerBuilder
Professional Soft Skills Workshops
Project Management
Python
Rational
Ruby
Sales Performance
SAP
SAS
Security
SharePoint
SOA
Software quality and tools
SQL Server
Sybase
Symantec
Telecommunications
Teradata
Tivoli
Tomcat
Unix/Linux/Solaris/AIX/
HP-UX
Unisys Mainframe
Visual Basic
Visual Foxpro
VMware
Web Development
WebLogic
WebSphere
Websphere MQ (MQSeries)
Windows programming
XML
XML Web Services
Other
Apache Spark™ for Machine Learning and Data Science
Overview

This course is primarily for data scientists but is directly applicable to analysts, architects, software engineers, and technical managers interested in a thorough, hands-on overview of Apache Spark and its applications to Machine Learning.
The course covers the fundamentals of Apache Spark including Spark’s architecture and internals, the core APIs for using Spark, SQL and other high-level data access tools, Spark’s streaming capabilities and a heavy focus on Spark’s machine learning APIs. The class is a mixture of lecture and hands-on labs.

Audience

Data scientists, analysts, architects, software engineers, and technical managers with experience in machine learning who want to adapt traditional machine learning tasks to run at scale using Apache Spark.

Prerequisites

Some familiarity with Apache Spark is helpful but not required.

Some familiarity with Machine Learning and Data Science concepts are highly recommended but not required.

Basic programming experience in an object-oriented or functional language is required.


Learning Objectives

  • After taking this class, students will be able to:
  • Use the core Spark APIs to operate on data
  • Articulate and implement typical use cases for Spark
  • Build data pipelines and query large data sets using Spark SQL and DataFrames
  • Analyze Spark jobs using the administration UIs inside Databricks
  • Create Structured Streaming jobs
  • Understand the basics of Spark’s internals
  • Work with relational data using the GraphFrames APIs
  • Understand how a Machine Learning pipeline works
  • Use various ML algorithms to perform clustering, regression and classification tasks.
  • Train & export ML models
  • How to train models with 3rd-party libraries like scikit-learn
  • Create and transform DataFrames to query large datasets.
  • Improve performance through judicious use of caching and applying best practices.
  • Visualize how jobs are broken into stages and tasks and executed within Spark.
  • Troubleshoot errors and program crashes using Spark UI, executor logs, driver stack traces, and local-mode runtimes.
Course duration

3 Days

Course outline

Spark Overview

In-depth discussion of Spark SQL and DataFrames, including:
  • The DataFrames/Datasets API
  • Spark SQL
  • Data Aggregation
  • Column Operations
  • The Functions API: date/time, string manipulation, aggregation
  • Caching and caching storage levels
  • Use of the Spark UI to analyze behavior and performance
Overview of Spark internals
  • Cluster Architecture
  • How Spark schedules and executes jobs and tasks
  • Shuffling, shuffle files, and performance
  • The Catalyst query optimizer
An in-depth overview of Spark’s MLlib Pipeline API for Machine Learning
  • Build machine learning pipelines for both supervised and unsupervised learning
  • Transformer/Estimator/Pipeline API
  • Use transformers to perform pre-processing on a dataset prior to training
  • Train analytical models with Spark ML’s DataFrame-based estimators including Linear Regression, Logistic Regression, Decision Trees + Random Forests, Boosted Trees, K-Means, Alternating Least Squares, and Neural Nets
  • Tune hyperparameters via cross-validation and grid search
  • Evaluate model performance
Spark-sklearn
  • How to distribute single-node algorithms (like scikit-learn) with Spark
  • Partitioning data concerns
Spark Structured Streaming
  • Sources and sinks
  • Structured Streaming APIs
  • Windowing & Aggregation
  • Checkpointing & Watermarking
  • Reliability and Fault Tolerance
  • Graph processing with GraphFrames
  • Transforming DataFrames into a graph
  • Perform graph analysis, including Label Propagation, PageRank, and ShortestPaths

Please contact your training representative for more details on having this course delivered onsite or online

Training Outlines - the one stop shopping center for IT training.
© Training Outlines All rights reserved