Data Mining

Содержание

Слайд 2

Lecture outline What is Data Mining? Data Methods and stages of Data Mining

Lecture outline

What is Data Mining?
Data
Methods and stages of Data Mining

Слайд 3

WHAT IS DATA MINING?

WHAT IS DATA MINING?

Слайд 4

What is Data Mining? Image source: https://www.mystorybook.com/books/151814

What is Data Mining?

Image source: https://www.mystorybook.com/books/151814

Слайд 5

What is Data Mining? Data Mining is… Information extraction Data excavation

What is Data Mining?

Data Mining is…
Information extraction
Data excavation
Data intellectual analysis
Search

for regularities
Knowledge extraction
Pattern analysis
Knowledge Discovery in Databases, KDD
Слайд 6

What is Data Mining?

What is Data Mining?

Слайд 7

What is Data Mining? Statistics – science of data collecting, processing

What is Data Mining?

Statistics – science of data collecting, processing and

analysis for detecting the regularities peculiar to the researched object.
Machine learning (ML) – algorithmic learning of new knowledge by a computer program from the data.
Artificial Intelligence (AI) – research area of human intellectual process modelling.
Слайд 8

What is Data Mining? Comparison of statistics, machine learning and Data

What is Data Mining?

Comparison of statistics, machine learning and Data Mining
Statistics
More

than Data Mining is based on theory
More concentrated on hypothesis checking.
Machine learning
More heuristic in nature.
Concentrated on the enhancing of learning agents.
Data Mining.
Integration of theory and heuristics
Concentrated on the data analysis process as a whole, including data cleaning, learning, integration and visualization of the obtained results.
Слайд 9

What is Data Mining? DB technology evolution

What is Data Mining?

DB technology evolution

Слайд 10

What is Data Mining? Basic factors for emerging and development of

What is Data Mining?

Basic factors for emerging and development of Data

Mining:
Hardware and software technological improvement
Improvement of data record and storage technologies
Accumulation of large volume of retrospective data
Improvement of data processing algorithms
Слайд 11

What is Data Mining?

What is Data Mining?

Слайд 12

What is Data Mining?

What is Data Mining?

Слайд 13

What is Data Mining? Data mining - is the process of

What is Data Mining?

Data mining - is the process of discovering

previously unknown, nontrivial, practically useful and interpretable knowledge from the raw data and for use in decision making processes in a wide range of human activities.
Gregory Piatetsky-Shapiro
Слайд 14

DATA

DATA

Слайд 15

Data What is Data? Data are the facts: Numbers Texts Images

Data

What is Data?
Data are the facts:
Numbers
Texts
Images
Sounds
Video records

Data sources:
Measurements
Experiments
Arithmetic and logical operations
Records

Слайд 16

Data Attributes/Features Objects

Data

Attributes/Features

Objects

Слайд 17

Data Variable/Attribute/Feature/Charachteristic Value Discrete/Continuous Numeric/Categorial Dependent/Independent Studied objects Population - parameters Sample - statistics

Data

Variable/Attribute/Feature/Charachteristic
Value
Discrete/Continuous
Numeric/Categorial
Dependent/Independent
Studied objects
Population - parameters
Sample - statistics

Слайд 18

Data Types of datasets: Table data Transactional data Graphical data Graphs Molecular structures Maps

Data

Types of datasets:
Table data
Transactional data
Graphical data
Graphs
Molecular structures
Maps

Слайд 19

Data

Data

Слайд 20

Data Data base – is electronic data organized and stored in

Data

Data base – is electronic data organized and stored in a

specific way.
Data scheme – description of the data logic structure
DBMS – shell for organizing interrelated tables with data into a data base.
Слайд 21

Data Data base requirements: High speed performance Data updating simplicity Data

Data

Data base requirements:
High speed performance
Data updating simplicity
Data independence
Multiuser usage
Data safety
Standardization of

building and exploitation of the DB
Data adequacy
User-friendly interface
Слайд 22

Data Data type classification: Relational data Multidimensional data Permanency Variable Constant

Data

Data type classification:
Relational data
Multidimensional data
Permanency
Variable
Constant
Conditionally constant
Function
Operational
Archive
Reference
Time
Periodic
Point

Слайд 23

Data Metadata – is the data about the data Catalogues References Registries

Data

Metadata – is the data about the data
Catalogues
References
Registries

Слайд 24

METHODS AND STAGES OF DATA MINING

METHODS AND STAGES OF DATA MINING

Слайд 25

Methods and Stages of Data Mining Data Mining employs a wide

Methods and Stages of Data Mining

Data Mining employs a wide variety

of tools ranging from classical statistics to the latest information technology achievements.
Data Mining methods:
Artificial neural networks
Decision trees
Symbolic rules
K-nearest neghbors
SVM
Bayes networks
Linear regression
Correlation-regression analysis
Clustering (hierarchical, k-means and etc.)
Association rules (Apriori algorithm)
Genetic algorithm
Visualization methods
Слайд 26

Methods and Stages of Data Mining Most of Data Mining methods

Methods and Stages of Data Mining

Most of Data Mining methods are

well known mathematical algorithms and methods.
The novelty of Data Mining is in its application to solve specific science or business problems, which became possible because of tech advances.
Algorithm – exact step by step description of inputs and actions required to achieve desired output.
Слайд 27

Methods and Stages of Data Mining Abu Adallah Muhammad ibn Musa

Methods and Stages of Data Mining

Abu Adallah Muhammad ibn Musa Al-Horezmi

– medieval scientist and mathematician
The book: Al-kitāb al-mukhtaṣar fī ḥisāb al-ğabr wa’l-muqābala
Decimal system
Solving of quadratic equation algorithm
Latin translation – Algebra, was the starting point of European math
Contained compilation of Indian mathematicians’ achievements
Слайд 28

Methods and Stages of Data Mining

Methods and Stages of Data Mining

Слайд 29

Methods and Stages of Data Mining Stage 1 – Discovery Conditional

Methods and Stages of Data Mining

Stage 1 – Discovery
Conditional logic
Associations and

affinities
Trends and variations
Rules validation on the test dataset
Example: Using HH database (induction)
Using queries analyst could detect mean desired salary of specialists in the age range 25-35 years is $1200
Using Data Mining methods, after defining the target variable:
If age<20 and desired salary>$700 then position searched is programmer (target)
If age>35 and desired salary>$1200 than managing position is searched
If managing position is searched and years of experience>15 then age is 35 in 65% of cases
Слайд 30

Methods and Stages of Data Mining Stage 2 – Forecasting Use

Methods and Stages of Data Mining

Stage 2 – Forecasting
Use rules detected

on Stage 1 to predict the unknowns
Classification and regression
Example: Using the rules derived from HH database analysis (deduction)
If age<20 and desired salary>$700 then position searched is programmer (target)
If age>35 and desired salary>$1200 than managing position is searched
If managing position is searched and years of experience>15 then age is 35 in 65% of cases
Слайд 31

Methods and Stages of Data Mining Stage 3 – Exception analysis

Methods and Stages of Data Mining

Stage 3 – Exception analysis
Detect anomalies,

deviations and exceptions
Example:
If age >35 and desired salary>$1200 then 90% of cases managing position is searched. What is the other 10% of cases?
Second rule
Error (use in data cleaning)
Слайд 32

Methods and Stages of Data Mining Technological method classification Data preservation

Methods and Stages of Data Mining

Technological method classification
Data preservation
Data is stored

in the detailed state and used directly
Problems with large amounts of data
Methods – clustering, analogy
Data distillation
Feature engineering
Dimensionality reduction
Methods:
Logical methods: induction, fuzzy logic queries, symbolic rules, decision trees, genetic algorithms
Cross-tabulation methods: agents, Bayesian networks, cross-table visualization
Equation-based methods: statistical methods (correlations, regressions), neural networks
Слайд 33

Methods and Stages of Data Mining Learning method classification Statistical methods

Methods and Stages of Data Mining

Learning method classification
Statistical methods based on

retrospective data
Descriptive analysis (homogeneity, stationarity hypothesis testing, distribution analysis)
Relation analysis (correlation, regression analysis)
Multidimensional statistical analysis (linear and non-linear discriminant analysis, clustering, component analysis, factor analysis)
Time series analysis
Cybernetic methods
Neural networks
Evolutionary algorithms
Genetic algorithms
Association rules
Fuzzy logic
Decision trees
Both types rely on statistics