Data Mining - Classification

Parte di

Percorso in Introduction to Data Mining


Learn how to formulate and solve classification problems for use in Data Mining and Business Intelligence applications such as; fraud detection, customer churning, network intrusion detection, ... You will learn how to develop, validate and apply a data mining workflow to solve binary and non-binary classification problems. The course is self-contained, and it does not require any programming skills. Hands-on lectures are based on the KNIME open source software platform.

Data Mining - Classification

This course introduces basic concepts and methods of Data Mining with specific reference to Classification. In particular, the course will provide the following general concepts; data type, summarization, missing data replacement, and data pre-processing. Classification will be introduced together with the following concepts; explanatory and class attribute, train and test data set, classifier (learner and inducer), performance measures (accuracy, error, precision and recall), k-folds cross validation, overfitting and underfitting, curse of dimensionality, cost matrix, receiver operating characteristic curve, lift and cumulative gains charts, not relevant/redundant attributes, and features selection. The following classification models are described; decision trees, logistic regression, support vector machines, multi layer perceptron, naïve Bayes, tree augmented naïve Bayes and Bayesian classifiers. The course is self-contained as much as possible, and it does not require any programming skills. Indeed, the KNIME open source software platform, which exploits the concept of graphical workflow, is used to mine datasets consisting of different data types.

By the end of this course, you will be able to: 

  • develop a Data Mining workflow for solving a classification problem, 
  • apply elementary missing replacement strategies, 
  • apply pre-processing techniques including dimensionality reduction, 
  • select and deploy the “optimal classifier” (whatever it means) also taking into account decision costs, 
  • select relevant attributes and remove not relevant and/or redundant attributes. 

You will learn all this using the KNIME open source platform, which integrates power and expressiveness of Weka, R and Java.

Basic knowledge of probability, statistics and mathematics.
  • Pang-Ning Tan, Steinbach Michael and Vipin Kumar, (2006). Introduction to Data Mining. Morgan-Kaufmann.

The course spans four weeks. Each week requires 8 to 10 hours of work. Each week consists of 5 to 7 video-lectures. Each video-lecture consists of a methodology video, a software usage video and a practice session.

You must accomplish all practice sessions associated with lectures, and then upload, to the course platform, the corresponding KNIME workflow you developed .
Data Type, Exploration and Preprocessing

This week you will learn how to design and develop data mining workflows for data exploration and pre-processing. In particular, you will learn to load data sets, to summarize categorical, nominal and numeric attributes, to replace missing data, to transform different types of attribute, and to reduce the dimension of a given data set.


Lezioni

1.1.1 - TYPE - theory (0:13 hours)
1.1.2 - TYPE - software (0:22 hours)
1.2.1 - EXPLORATION - theory (0:21 hours)
1.2.2 - EXPLORATION - software (0:17 hours)
1.3.1 - MISSING REPLACEMENT - theory (0:17 hours)
1.3.2 - MISSING REPLACEMENT - software (0:17 hours)
1.4.1 - PREPROCESSING PART I - theory (0:24 hours)
1.4.2 - PREPROCESSING PART I - software (0:22 hours)
1.5.1 - PREPROCESSING PART II - theory (0:23 hours)
1.5.2 - PREPROCESSING PART II - software (0:15 hours)
1.5.4 - PREPROCESSING PART II - software (0:14 hours)
Introduction to Classification and Classification Techniques

This week you will learn how to design and develop a data mining workflow for solving a classification problem with specific reference to binary classification. You will learn how to build different classification models including, decision trees, logistic regression, artificial neural networks, support vector machines, naïve Bayes classifier and Bayesian classifiers.


Lezioni

2.1.1 - INTRODUCTION - theory (0:17 hours)
2.1.2 - INTRODUCTION - software (0:22 hours)
2.2.1 - CLASSIFICATION TECHNIQUES - PART I - theory (0:20 hours)
2.2.2 - CLASSIFICATION TECHNIQUES - PART I - software (0:19 hours)
2.3.1 - CLASSIFICATION TECHNIQUES - PART II - theory (0:21 hours)
2.3.2 - CLASSIFICATION TECHNIQUES - PART II - software (0:08 hours)
2.4.1 - CLASSIFICATION TECHNIQUES - PART III - theory (0:12 hours)
2.4.2 - CLASSIFICATION TECHNIQUES - PART III - software (0:07 hours)
2.5.1 - CLASSIFICATION TECHNIQUES - PART IV - theory (0:24 hours)
2.5.2 - CLASSIFICATION TECHNIQUES - PART IV - software (0:08 hours)
2.6.1 - CLASSIFICATION TECHNIQUES - PART V - theory (0:06 hours)
2.6.2 - CLASSIFICATION TECHNIQUES - PART V - software (0:09 hours)
Classifier Performance Evaluation and Classifiers Comparison

This week you will learn how to develop a data mining workflow for evaluating the performance of a classifier using the following performance measures; accuracy, error, precision, and recall. Furthermore, you will learn to develop a data mining workflow to compare different classifiers, and to decide which is the "optimal classifier".


Lezioni

3.1.1 - PERFORMANCE EVALUATION - PART I - theory (0:17 hours)
3.1.2 - PERFORMANCE EVALUATION - PART I - software (0:16 hours)
3.2.1 - PERFORMANCE EVALUATION - PART II - theory (0:25 hours)
3.2.2 - PERFORMANCE EVALUATION - PART II - software (0:25 hours)
3.3.1 - COMPARING CLASSIFIERS - theory (0:23 hours)
3.3.2 - COMPARING CLASSIFIERS - software (0:15 hours)
3.3.4 - COMPARING CLASSIFIERS - software (0:14 hours)
Class Imbalance Problem, Feature Selection and Non Binary Class

This week you will learn the class imbalance problem, and how to develop a data mining workflow for comparing classifiers in terms of their effectiveness to select target customers. In this week you will also learn how to develop a data mining workflow to decide which are the “optimal features” to solve a classification problem. Finally, you will learn how to develop a data mining workflow to solve non binary classification problems.


Lezioni

4.1.1 - CLASS IMBALANCE PROBLEM - theory (0:20 hours)
4.1.2 - CLASS IMBALANCE PROBLEM - software (0:10 hours)
4.2.1 - COUNTING THE COST – PART I - theory (0:14 hours)
4.2.2 - COUNTING THE COST – PART I - software (0:13 hours)
4.3.1 - COUNTING THE COST – PART II - theory (0:20 hours)
4.3.2 - COUNTING THE COST – PART II - software (0:06 hours)
4.4.1 - COUNTING THE COST – PART III - theory (0:12 hours)
4.4.2 - COUNTING THE COST – PART III - software (0:09 hours)
4.5.1 - FEATURE SELECTION - theory (0:22 hours)
4.5.2 - FEATURE SELECTION - software (0:24 hours)
4.6.1 - NON BINARY CLASSIFICATION - theory (0:11 hours)
4.6.2 - NON BINARY CLASSIFICATION - software (0:16 hours)
Modalità Corso
Tutoraggio
Stato del corso
Tutoraggio Soft
Durata
4 settimane
Impegno
10 ore/settimana
Categoria
Informatica, Gestione e Analisi Dati
Lingua
Inglese
Tipo
Online
Livello
Base
Avvio Iscrizioni
4 Apr 2016
Apertura Corso
22 Apr 2016
Inizio Tutoraggio
2 Mag 2016
Fine Tutoraggio
30 Giu 2016
Tutoraggio Soft
1 Lug 2016
Chiusura Corso
Non impostato

Partecipazione e Attestati

Quota di iscrizione
GRATUITO!
Attestato di Partecipazione
GRATUITO!


FABIO STELLA

Department of Informatics, Systems and Communication

Corsi collegati