Data Mining - Clustering and Association

Parte di

Percorso in Introduction to Data Mining


Learn how to formulate and solve Clustering problems and Association Rule extraction problems for use in Data Mining and Business Intelligence applications. Clustering and Association Rule extraction are potentially interesting to solve problems as; store layout, customer profiling, targeted marketing, market basket analysis, ... You will learn how to develop, validate and apply Data Mining workflows to solve clustering and association rule extraction problems. The course is self-contained and hands-on lectures are based on the KNIME open source software platform.

Data Mining - Clustering and Association

This course introduces basic concepts and methods of Data Mining with specific reference to Clustering and Association Rules. We present concept and purposes of cluster analysis, together with its’ main components. Partitioning, hierarchical, density based, and graph based clustering methods are described. Particular attention is devoted to; cluster validity measures and clustering validation. The last part of the course introduces association rule discovery. The concepts of association rule, frequent itemset, support and confidence are given. Furthermore, we give a brief description of the Apriori algorithm for frequent itemset generation, and introduce the concepts of maximal and closed frequent itemset. Finally, different criteria, for evaluating the quality of association patterns, are introduced.

By the end of this course, you will be able to; develop a Data Mining workflow for solving a clustering problem as well as for extracting potentially interesting association rules. You will be able to use the appropriate proximity measure, and to select the "optimal clustering model" (whatever it means) to solve a clustering problem. Furthermore, you will be able to develop a Data Mining workflow to extract potentially interesting association rules. You will learn all this by using the KNIME open source platform, which integrates power and expressiveness of Weka, R and Java.

Basic knowledge of probability and statistics. Basic knowledge of R programming.
  1. Pang-Ning Tan, Steinbach Michael and Vipin Kumar, (2006). Introduction to Data Mining. Morgan-Kaufmann. 
  2. Kaufmann. Guojun Gan, Chaoqun Ma and Jianhong Wu (2007). Data Clustering: Theory, Algorithms, and Applications, Siam. 
  3. Rui Xu and Donald C Wunsch II (2009). Clustering, Wiley.

The course spans four weeks. Each week requires 8 to 10 hours of work. Each week consists of 3 to 5 lectures. Each lecture consists of a methodology video, a software usage video and a practice session.

You must accomplish all practice sessions associated with lectures and upload the corresponding KNIME workflow to the course platform.
Introduction to Clustering and Proximity

In this week you will learn about Cluster Analysis and it's main components. In particular, you will learn about different cluster purposes, and different types of clustering. Furthermore, you will learn to develop data mining workflows for computing proximity, similarity and dissimilarity, between records consisting of multiple attributes having different types.During this week you are expected to develop and upload three KNIME workflows.


Lezioni

1.1.1 - INTRODUCTION - PART I - theory (0:07 hours)
1.1.2 - INTRODUCTION - PART II - theory (0:09 hours)
1.1.3 - INTRODUCTION - software (0:14 hours)
1.1.4 - INTRODUCTION - PART III - theory (0:13 hours)
1.1.5 - INTRODUCTION - software (0:20 hours)
1.2.1 - PROXIMITY - PART I - theory (0:18 hours)
1.2.2 - PROXIMITY - PART II - theory (0:15 hours)
1.2.3 - PROXIMITY - software (0:12 hours)
1.2.5 - PROXIMITY - PART III - theory (0:11 hours)
1.2.6 - PROXIMITY - software (0:08 hours)
Clustering Algorithms

In this week you will learn to design and develop a data mining workflow to cluster records of a dataset by using prototype-based, agglomerative hierarchical, density-based and graph-based clustering algorithms.During this week you are expected to develop and upload one KNIME workflow.


Lezioni

2.1.1 - PROTOTYPE-BASED - PART I - theory (0:14 hours)
2.1.2 - PROTOTYPE-BASED - PART II - theory (0:12 hours)
2.1.3 - PROTOTYPE-BASED - software (0:16 hours)
2.1.4 - PROTOTYPE-BASED - PART III - theory (0:07 hours)
2.1.5 - PROTOTYPE-BASED - PART IV - theory (0:14 hours)
2.1.6 - PROTOTYPE-BASED - software (0:16 hours)
2.2.1 - HIERARCHICAL AGGLOMERATIVE - theory (0:10 hours)
2.2.2 - HIERARCHICAL AGGLOMERATIVE - software (0:12 hours)
2.3.1 - DENSITY-BASED - theory (0:21 hours)
2.3.2 - DENSITY-BASED - software (0:09 hours)
2.4.1 - GRAPH-BASED - PART I - theory (0:12 hours)
2.4.2 - GRAPH-BASED - PART II - theory (0:12 hours)
2.4.3 - GRAPH-BASED - software (0:19 hours)
Clustering Evaluation

In this week you will learn about cluster validation, i.e. how to validate the results of Cluster Analysis. In particular, you will learn about internal and external validation measures as well as about relative indices and the fundamental problem of cluster validity.During this week you are expected to develop and upload three KNIME workflows.


Lezioni

3.1.1 - EXTERNAL MEASURES - theory (0:16 hours)
3.1.2 - EXTERNAL MEASURES - software (0:14 hours)
3.2.1 - INTERNAL MEASURES - theory (0:23 hours)
3.2.2 - INTERNAL MEASURES - software (0:14 hours)
3.3.1 - VALIDITY PARADIGM - theory (0:05 hours)
3.3.2 - VALIDITY PARADIGM - software (0:22 hours)
3.4.1 - THE FUNDAMENTAL PROBLEM - theory (0:10 hours)
3.4.2 - THE FUNDAMENTAL PROBLEM - software (0:17 hours)
Association Analysis

In this week you will learn how to extract association rules from transactions data. You will also learn about the apriori principle, how to select, evaluate and compare association rules. Finally, you will learn about the Simpson’s paradox.During this week you are expected to develop and upload two KNIME workflows.


Lezioni

4.1.1 - INTRODUCTION - PART I - theory (0:07 hours)
4.1.2 - INTRODUCTION - PART II - theory (0:14 hours)
4.1.3 - INTRODUCTION - software (0:25 hours)
4.2.1 - RULES EXTRACTION - PART I - theory (0:11 hours)
4.2.2 - RULES EXTRACTION - PART II - theory (0:07 hours)
4.2.3 - RULES EXTRACTION - software (0:12 hours)
4.3.1 - MAXIMAL FREQUENT ITEMSET - theory (0:09 hours)
4.3.2 - CLOSED FREQUENT ITEMSET - theory (0:10 hours)
4.3.3 - MAXIMAL/CLOSED FREQUENT ITEMSETS - software (0:17 hours)
4.4.1 - RULES EVALUATION - PART I - theory (0:18 hours)
4.4.2 - RULES EVALUATION - software (0:17 hours)
4.4.3 - RULES EVALUATION - PART II - theory (0:09 hours)
4.4.4 - RULES EVALUATION - software (0:06 hours)
4.5.1 - INCONSISTENCY - theory (0:11 hours)
4.5.2 - SIMPSON'S PARADOX - theory (0:09 hours)
4.5.3 - SIMPSON'S PARADOX - software (0:13 hours)
Modalità Corso
Tutoraggio
Stato del corso
Tutoraggio Soft
Durata
4 settimane
Impegno
10 ore/settimana
Categoria
Informatica, Gestione e Analisi Dati
Lingua
Inglese
Tipo
Online
Livello
Base
Avvio Iscrizioni
21 Apr 2016
Apertura Corso
14 Set 2016
Inizio Tutoraggio
3 Ott 2016
Fine Tutoraggio
14 Nov 2016
Tutoraggio Soft
15 Nov 2016
Chiusura Corso
Non impostato

Partecipazione e Attestati

Quota di iscrizione
GRATUITO!
Attestato di Partecipazione
GRATUITO!


FABIO STELLA

Department of Informatics, Systems and Communication


PAOLA CHIESA

Informatics

Corsi collegati