Python for Data Science

In the information age, data is all around us. Within this data are answers to compelling questions across many societal domains (politics, business, science, etc.). But if you had access to a large dataset, would you be able to find the answers you seek?

Specifically, you’ll learn how to use:

python
jupyter notebooks
pandas
numpy
matplotlib
git
and many other tools.

You will learn these tools all within the context of solving compelling data science problems. After completing this course, you’ll be able to find answers within large datasets by using python tools to import data, explore it, analyze it, learn from it, visualize it, and ultimately generate easily sharable reports.

By learning these skills, you’ll also become a member of a world-wide community which seeks to build data science tools, explore public datasets, and discuss evidence-based findings.

Learning Objectives

Basic process of data science
Python and Jupyter notebooks
An applied understanding of how to manipulate and analyze uncurated datasets
Basic statistical analysis and machine learning methods
How to effectively visualize results

Course duration

3 Days

Course outline

Base Python Introduction

History and current use
- Installing the Software
- Python Distributions
String Literals and numeric objects
Collections (lists, tuples, dicts)
Datetime classes in Python
Memory Management in Python
Control Flow
Functions
Exception Handling

Defining actionable, analytic questions

Defining the quantitative construct to make inference on the question
Identifying the data needed to support the constructs
Identifying limitations to the data and analytic approach
Constructing Sensitivity analyses

Bringing Data In

Structured Data
- Structured Text Files
- Excel workbooks
- SQL databases
Working with Unstructured Text Data
- Reading Unstructured Text
- Introduction to Natural Language Processing with Python

NumPy: Matrix Language

Introduction to the ndarray
NumPy operations
Broadcasting
Missing data in NumPy (masked array)
NumPy Structured arrays
Random number generation

Data Preparation with Pandas

Filtering
Creating and deleting variables
Discretization of Continuous Data
Scaling and standardizing data
Identifying Duplicates
Dummy Coding
Combining Datasets
Transposing Data
Long to wide and back

Exploratory Data Analysis with Pandas

Univariate Statistical Summaries and Detecting Outliers
Multivariate Statistical Summaries and Outlier Detection
Group-wise calculations using Pandas
Pivot Tables

Exploring Data graphically

Histogram
Box-and-whiskers plot
Scatter plots
Forest Plots
Group-by plotting

Advanced Graphing with Matplotlib, Pandas, and Seaborn

Python, Hadoop and Spark

Introduction to the difference in Python, Hadoop, and Spark
Importing data from Spark and Hadoop to Python
Parallel execution leveraging Spark or Hadoop

Missing Data

Exploring and understanding patterns in missing data
Missing at Random
Missing Not at Random
Missing Completely at Random
Data imputation methods

Traditional Inferential Statistics

Comparing Groups
- P-Values, summary statistics, sufficient statistics, inferential targets
- T-Tests (equal and unequal variances)
- ANOVA
- Chi-Square Tests
Correlation

Frequentist Approaches to Multivariate Statistics:

Linear Regression
- Multivariate linear regression
- Capturing Non-linear Relationships
- Comparing Model Fits
- Scoring new data
- Poisson Regression Extension
Logistic regression
- Logistic Regression Example
- Classification Metrics

Machine learning approaches to multivariate statistics

Machine Learning Theory
Data pre-processing
- Missing Data
- Dummy Coding
- Standardization
- Training/Test data
Supervised Versus Unsupervised Learning
Unsupervised Learning: Clustering
- Clustering Algorithms
- Evaluating Cluster Performance
Dimensionality Reduction
- A-priori
- Principal Components Analysis
- Penalized Regression

Supervised Learning: Regression

Linear Regression
Penalized Linear Regression
Stochastic Gradient Descent
Scoring New Data Sets
Cross Validation
Variance Bias-Tradeoff
Feature Importance

Supervised Learning: Classification

Logistic Regression
LASSO
Random Forest
Ensemble Methods
Feature Importance
Scoring New Data Sets
Cross Validation

Please contact your training representative for more details on having this course delivered onsite or online

Training Outlines - the one stop shopping center for IT training.
© Training Outlines All rights reserved