Data Science. Programming

Содержание

Слайд 2

About myself Who Faisal Ahmed TalTech Where Narva College Research Communication

About myself

Who

Faisal Ahmed

TalTech

Where

Narva College

Research Communication and Software Engineering

Contact

Faisal.Ahmed@ut.ee

Now

What

Слайд 3

Data?! Neighbor's name A place they consider home Are they working

Data?!

Neighbor's name
A place they consider home
Are they working

at a company now?
How many U.S. states have they visited?
Their favorite unhealthy food… ?
Do they have any "Data Science" background? (statistics, machine learning, CS)

Where?

Слайд 4

Data! Neighbor's name A place they consider home Are they working

Data!

Neighbor's name
A place they consider home
Are they

working at a company now?
How many U.S. states have they visited?
Their favorite unhealthy food… ?
Do they have any "Data Science" background? (statistics, machine learning, CS)

Zachary Dodds

Pittsburgh, PA

Harvey Mudd

Where?

44

mostly CS for me…

M&Ms

Слайд 5

Data! Neighbor's name A place they consider home Are they working

Data!

Neighbor's name
A place they consider home
Are they

working at a company now?
How many U.S. states have they visited?
Their favorite unhealthy food… ?
Do they have any "Data Science" background? (statistics, machine learning, CS)

Zachary Dodds

Pittsburgh, PA

Harvey Mudd

Where?

44

mostly CS for me…

M&Ms

be sure to set up your login + profile for the submission site…

This class is truly seminar-style: I'm here, as you are, in order to gain insights into this very new field… .

Слайд 6

Data Science concerns Is "Data Science" important or just trendy?

Data Science concerns

Is "Data Science" important or just trendy?

Слайд 7

Hmmm… Data Science concerns

Hmmm…

Data Science concerns

Слайд 8

the companies are expanding as fast as the data!

the companies are expanding as fast as the data!

Слайд 9

There's certainly a lot of it! 2015 1 Zettabyte 1 Exabyte

There's certainly a lot of it!

2015

1 Zettabyte

1 Exabyte

1 Petabyte

(brain) 14 PB:

http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store

(2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm

1 Petabyte == 1000 TB

2002

2009

(2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf

(2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf

2006

2011

(2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf

(2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm

(life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21 of 640x480 video (w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly!

5 EB

161 EB

800 EB

1.8 ZB

8.0 ZB

14 PB

60 PB

Data produced each year

100-years of HD video + audio

Human brain's capacity

Data, data everywhere…

References

1 TB = 1000 GB

120 PB

logarithmic scale

Слайд 10

data information knowledge wisdom I'd call it data, not information

data

information

knowledge

wisdom

I'd call it data, not information

Слайд 11

Big Data? I agree with this…

Big Data?

I agree with this…

Слайд 12

Make data easier to use ~ by using it! It may

Make data easier to use ~ by using it!

It may be

true that Data Science isn't a science – but that doesn't mean it's not useful!
Слайд 13

IST 380 ~ the big picture What? Why? Data Science Programming

IST 380 ~ the big picture

What?

Why?

Data Science Programming

Data Rules

All of

our insights – large and small, permanent and ephemeral, natural and artificial – come about through the integration of lots of data.
Data Science simply recognizes that the rules and skills behind those insights are widely applicable…
Слайд 14

A few examples… Make3d How is this being done? Andrew Ng

A few examples…

Make3d

How is this being done?

Andrew Ng ~ Computers and

Thought award, 2009

… Data Science is at the heart of computer science

and how do we succeed?

Слайд 15

A few examples… … Data Science is at the heart of

A few examples…

… Data Science is at the heart of computer

science

Stanford's Autonomous Vehicles project (Thrun et al.)

Learning to Powerslide

Слайд 16

A few examples… … Data Science is at the heart of

A few examples…

… Data Science is at the heart of computer

science

"my summer was finding that red line"

Learning ground from obstacles

Слайд 17

A few examples… Learning ground from obstacles classification segmentation

A few examples…

Learning ground from obstacles

classification

segmentation

Слайд 18

Insights beyond science

Insights beyond science

Слайд 19

Marketing

Marketing

Слайд 20

Visualization Motivation

Visualization

Motivation

Слайд 21

Слайд 22

Recommender Systems predicting movie ratings

Recommender Systems

predicting movie ratings

Слайд 23

Bob Bell, winner of the "Netflix prize" Napoleon Dynamite = Batman

Bob Bell, winner of the "Netflix prize"

Napoleon Dynamite =
Batman Begins =

Netflix

Prize

Finding Nemo =
Lord of the Rings =

(I don't know this guy)

1.22
.75

??
??

Some films are difficult to predict…

Слайд 24

Bob Bell, winner of the "Netflix prize" (I don't know this

Bob Bell, winner of the "Netflix prize"

(I don't know this guy)

Napoleon

Dynamite =
Batman Begins =

Finding Nemo =
Lord of the Rings =

1.22
.75

.67
.42

Some films are difficult to predict… and others are easier!

Netflix Prize

Слайд 25

Why IST 380 ? Specific skills: R statistical environment (and the

Why IST 380 ?

Specific skills:

R statistical environment (and the S programming

language)

Experience with several statistical analyses (descriptive statistics)

Experience with predictive statistics (modeling) and machine learning algorithms

Слайд 26

Why IST 380 ? Specific skills: Broad background: You'll be confident

Why IST 380 ?

Specific skills:

Broad background:

You'll be confident and capable with

whatever datasets you encounter in the future – on your own or as part of a team.

R statistical environment (and the S programming language)

Experience with several statistical analyses (descriptive statistics)

Experience with predictive statistics (modeling) and machine learning algorithms

Final project ~ open-ended with datasets of your choice

Слайд 27

About IST 380 …

About IST 380 …

Слайд 28

Details Web Page: http://www.cs.hmc.edu/~dodds/IST380 Assignments, online text, necessary files, lecture slides

Details

Web Page:

http://www.cs.hmc.edu/~dodds/IST380

Assignments, online text, necessary files, lecture slides are linked

First week's

assignment: Getting started with R

Programming: R

Textbook

An introduction to Data Science

jsresearch.net/groups/teachdatascience/

www.r-project.org/

Grab both of these now…

freely available online

and many online resources…

Слайд 29

Homepage http://www.cs.hmc.edu/~dodds/IST380/ Go to the course page Grab R and the text from these two links…

Homepage

http://www.cs.hmc.edu/~dodds/IST380/

Go to the course page

Grab R and the text from these

two links…
Слайд 30

Homework Assignments ~ 2-5 problems/week ~ 100 points extra credit, often

Homework

Assignments

~ 2-5 problems/week ~ 100 points extra credit, often

Due Tuesday of

the following week by 11:59 pm.

Assignment 1 due Tuesday, February 5.

1 week + 1 day…

Слайд 31

Homework Working on programs: On your own or in groups of

Homework

Working on programs:

On your own or in groups of 2.

Divide

the work at the keyboard evenly!

Submitting programs: at the submission website

Today's Lab:

install software ensure accounts are working

try out R - the first HW is officially due on 2/5

Assignments

~ 2-5 problems/week ~ 100 points extra credit, often

Due Tuesday of the following week by 11:59 pm.

Assignment 1 due Tuesday, February 5.

Слайд 32

Outline Weeks 1-5 using R descriptive statistics predictive statistics probability distributions

Outline

Weeks 1-5

using R

descriptive statistics

predictive statistics

probability distributions

Weeks 6-10

"Data Science"

"Machine Learning"

statistical modeling

support vector

machines (SVMs)

random forests

k-means algorithm

nearest neighbors (NN)

Weeks 11-15

approximate!

Final Project

No breaks?!

Слайд 33

Grading Grades Final project if score >= 0.95: grade = "A"

Grading

Grades

Final project

if score >= 0.95: grade = "A"
if score >= 0.90:

grade = "A-"
if score >= 0.86: grade = "B+"

the last ~4 weeks will work towards a larger, final project

there will be a short design phase and a short final presentation

I'd encourage you to connect R and our Data Science techniques to other datasets or projects that you use/need/like, etc.

Based on points percentage

~ 800 points for assignments

see the course syllabus for the full list...

~ 400 points for the final project

choose your own problem to study (I'll have some suggestions, too.)

Слайд 34

Academic Honesty This course operates under CGU's (and all of Claremont

Academic Honesty

This course operates under CGU's (and all of Claremont Schools')

Academic Honesty policies…
Your work must be your own. This must be true for the whole team, if you're working in a pair.
Consulting with others (except team members or myself) is encouraged, but has to be limited to discussion and debugging of problems. Sharing of written, electronic, or verbal solutions/files/code is a violation of CGU’s academic honesty policy.
A reasonable guideline: Work is your own if you could delete all of it and recreate it yourself.
Слайд 35

Thoughts?

Thoughts?

Слайд 36

Getting to know… R

Getting to know… R

Слайд 37

Getting to know… R http://lang-index.sourceforge.net/#categ R is the programmer's toolkit for

Getting to know… R

http://lang-index.sourceforge.net/#categ

R is the programmer's toolkit for statistics; SAS,

Stata, SPSS are preferred by those in business intelligence
Слайд 38

Getting to know… R Free… and very well supported online…

Getting to know… R

Free… and very well supported online…

Слайд 39

Getting to know… R R is responsive, up-to-date, and flexible: Data Science vs. Statistics

Getting to know… R

R is responsive, up-to-date, and flexible: Data Science

vs. Statistics
Слайд 40

Getting to know… R 1) Find the IST 380 course webpage

Getting to know… R

1) Find the IST 380 course webpage

Try it!

www.cs.hmc.edu/~dodds/IST380/

2)

Download and install R

3) Run R and try some basic commands at the prompt:

6 * 7

rnorm(10)

x <- 380

Слайд 41

Getting started! 1) Open Matloff's Why R? notes 2) Skip ahead

Getting started!

1) Open Matloff's Why R? notes
2) Skip ahead to page

7, the "5 minute example session"
3) Try out the commands in section 2.2 to get started…
4) When you finish, save your session and submit it!

This is problem 1 this week

Слайд 42

Saving your session 2) Use the Save to file… (Windows) or

Saving your session

2) Use the Save to file… (Windows) or Save

as… (Mac) in order to save your current console session into hw1

This is problem 1 this week

1) Create a folder named hw1, perhaps on your desktop

3) Name that file pr1.txt

4) From your operating system, open up that file in order to confirm it contains your whole session!

Слайд 43

Submitting your work 2) From the course webpage, click on the

Submitting your work

2) From the course webpage, click on the submission

site link.

You've completed Problem 1!

1) Zip up hw1 into hw1.zip

3) Choose a submission site login name & let me know!

4) Once your account is made, login, change your password to something you know, and submit hw1.zip

This webserver can be spacey -- I should know!

troubles? email me!

5) You can submit again – all copies are saved…

Слайд 44

Reflection Average and standard deviation? Assignment? Comments? Printing? Comments? Creating a vector?

Reflection

Average and standard deviation?

Assignment?

Comments?

Printing?

Comments?

Creating a vector?

Слайд 45

R types You can use mode() to view the type of a variable.

R types

You can use mode() to view the type of a

variable.
Слайд 46

Where's the big data? Vectors are R lists of a single

Where's the big data?

Vectors are R lists of a single type

of element

c ~ concatenate

Слайд 47

Where's the big data? Vectors are R lists of a single

Where's the big data?

Vectors are R lists of a single type

of element

c ~ concatenate

the colon : also creates vectors

Слайд 48

Analyzing vectors – try these… Square brackets [] can "subset" (or "slice") vectors

Analyzing vectors – try these…

Square brackets [] can "subset" (or "slice")

vectors
Слайд 49

Analyzing vectors Square brackets [] can "subset" (or "slice") vectors you

Analyzing vectors

Square brackets [] can "subset" (or "slice") vectors

you can use

a boolean vector to subset another vector
Слайд 50

NA R uses NA to represent data that is "not available"

NA

R uses NA to represent data that is "not available"

What is

going on here?

The function is.na( ) tests for NA

Слайд 51

NA R uses NA to represent data that is "not available"

NA

R uses NA to represent data that is "not available"

What is

going on here?

The function is.na( ) tests for NA

This uses subsetting to remove NA values!

Слайд 52

Data frames R's fundamental data structures are data frames The next tutorial will introduce them…

Data frames

R's fundamental data structures are data frames

The next tutorial will

introduce them…
Слайд 53

Irises… setosa virginica data() yields many built-in data files. This is iris

Irises…

setosa

virginica

data() yields many built-in data files. This is iris

Слайд 54

Subsetting iris data As with vectors, you can "subset" data frames. df[rows,cols]

Subsetting iris data

As with vectors, you can "subset" data frames.

df[rows,cols]

Слайд 55

Lab… The 2nd part of each class meeting dedicated to lab

Lab…

The 2nd part of each class meeting dedicated to lab work.

I

welcome you to stay for the lab, but it is not required.

Today's lab:

Work through Santorico and Shin's Tutorial for the R Statistical Package and submit the console sessions as pr2_1.txt, pr2_1.txt, pr2_1.txt, pr2_1.txt, and pr2_1.txt.

This is a nice reinforcement of vectors, introduction to data frames, and a look at the graphics that R supports.

Слайд 56

Homework Problem 3: Challenge exercises in R These will reinforce the

Homework

Problem 3: Challenge exercises in R

These will reinforce the "subsetting" and

data-analysis introduction from pr2's tutorial.

Problem 4: Introduction to Data Science, early chapters

This is a fuller background on R and the field of data science

(submit your console session for both of these…)

Слайд 57

Lab !

Lab !

Слайд 58

CS vs. IS and IT ? www.acm.org/education/curric_vols/CC2005_Final_Report2.pdf greater integration system-wide issues smaller details machine specifics

CS vs. IS and IT ?

www.acm.org/education/curric_vols/CC2005_Final_Report2.pdf

greater integration system-wide issues

smaller details
machine

specifics
Слайд 59

CS vs. IS and IT ? Where will IS go?

CS vs. IS and IT ?

Where will IS go?

Слайд 60

CS vs. IS and IT ?

CS vs. IS and IT ?

Слайд 61

IT ? Where will IT go?

IT ?

Where will IT go?

Слайд 62

IT ?

IT ?

Слайд 63

Слайд 64

The bigger picture Weeks 10-12 Objects Week 10 Week 11 Week

The bigger picture

Weeks 10-12
Objects

Week 10

Week 11

Week 12

Weeks 13-15
Final Projects

classes vs. objects

methods

and data

inheritance

Week 13

Week 14

Week 15

final projects

final projects

final exam

Слайд 65

Data?! Neighbor's name A place they consider home Are they working

Data?!

Neighbor's name
A place they consider home
Are they working

at a company now?
How many U.S. states have they visited?
Their favorite unhealthy food… ?
Do they have any "Data Science" (statistics, machine learning, CS) background?

Where?

Слайд 66

state reminders…

state reminders…

Слайд 67

Data! Neighbor's name A place they consider home Are they working

Data!

Neighbor's name
A place they consider home
Are they

working at a company now?
How many U.S. states have they visited?
Their favorite unhealthy food… ?
Do they have any "Data Science" (statistics, machine learning, CS) background?

Zachary Dodds

Pittsburgh, PA

Harvey Mudd

Where?

44

mostly CS for me…

M&Ms