Text Mining

Parte di

Percorso in Introduction to Data Mining


Natural language text is everywhere, social networks, business, finance, medicine and biology are just few of the many sources of natural language text. However, computers are not fit to process natural language text. Indeed, Data Mining methods and algorithms, which operate on structured data, can not be directly applied to unstructured data for knowledge extraction. The Text Mining course, the last one of the Introduction to Data Mining pathway, introduces methods and tools for knowledge extraction from natural language text. 

The course assumes you are familiar with methods and models presented in the previous two courses, namely Data Mining: Classification and Data Mining: Clustering and Association. The course shows that Text Mining allows to formulate and solve problems in Business Intelligence, Finance, Recommendation, Medicine, Biomedicine, Social Networks, and Intelligence Gathering to mention just a few. 

In particular, the course introduces methods, models and algorithms for; natural language text preprocessing, text categorization, text clustering, topic modeling and information extraction.


Text Mining

This course introduces basic concepts and methods of Text Mining with specific reference to natural language text preprocessing, text categorization, text clustering, and information extraction. We present basic natural language text preprocessing techniques such as; tokenization, filtering, stemming, disambiguation, and sentence boundary determination. We describe how to formulate and solve text categorization problems using models and methods from supervised classification. We show how to exploit text clustering for auto-organizing natural language text. We introduce state of the art document organization models, namely topic models and address the problem of selecting the “optimal number of topics” (whatever it means) for a given natural language text corpus. Finally, we introduce different information extraction tasks, such as; named entity recognition, noun phrase coreference resolution. semantic role recognition, entity relation Recognition, timex and time line recognition. Furthermore, we describe different instances of an Information Extraction system.

By the end of this course, you will be able to develop, validate and apply Text Mining workflows for automatic classification and organization of natural language text. Furthermore, you will learn to develop workflows to extract entities (persons, organizations, locations, genes, drugs etc.) from natural language text and to discover their relationships. You will learn how to access natural language text from many sources such as RSS, Web pages, YouTube, Twitter, PubMed, PDF and txt files etc. 

The course is self-contained, and hands-on lectures exploit the KNIME open source software platform, which integrates power and expressiveness of Weka, R, Java, and Python.

Basic knowledge of probability and statistics. Basic knowledge of R programming. Data Mining: Classification, Data Mining: Clustering and Association.
  1. Sholom M. Weiss, Nitin Indurkhya and Tong Zhang (2010). Fundamentals of Predictive Text Mining, Springer. 
  2. Marie-Francine Moens (2006). Information Extraction: Algorithms and Prospects in a Retrieval Context, Springer.

The course spans four weeks. Each week requires 6 to 8 hours of work. Each week consists of 4 to 8 video-lectures. Each video-lecture consists of a methodology video, a software usage video and a practice session.

You must accomplish all practice sessions associated with lectures and upload the corresponding KNIME workflow to the course platform.
Text Mining

Instructors: Fabio Stella and Paola Chiesa


Lezioni

How to attend: lecture, practice, and interaction
Course Welcome and Introduction
Text Mining Problems
Text Preprocessing

In this week you will learn about text preprocessing, i.e. how to make textual data into a quantitative representation with specific reference to binary, term frequency and term frequency inverse document frequency representations. You will learn how to develop a text preprocessing pipeline using KNIME.


Lezioni

1.1.1 - INTRODUCTION - theory (0:05 hours)
1.1.2 - INTRODUCTION - software (0:11 hours)
1.2.1 - TOKENIZATION - theory (0:14 hours)
1.2.2 - FILTERING AND STEMMING - theory (0:18 hours)
1.2.3 - FILTERING AND STEMMING - software (0:15 hours)
1.3.1 - VECTOR GENERATION - theory (0:12 hours)
1.3.2 - VECTOR GENERATION - software (0:21 hours)
1.3.3 - VECTOR GENERATION - software (0:14 hours)
Text Categorization

In this week you will learn about text preprocessing and text categorization. In particular, you will learn to formulate and solve a text categorization problem, i.e. how to classify natural language text when labeled examples are available. Furthermore, you will learn to develop a KNIME text categorization pipeline to predict the customer’s sentiment from IMDb movie reviews.


Lezioni

2.1.1 - TERM FREQUENCY - INVERSE DOCUMENT FREQUENCY - theory (0:13 hours)
2.1.2 - TERM FREQUENCY - INVERSE DOCUMENT FREQUENCY - software (0:15 hours)
2.2.1 - MULTIWORD FEATURES - theory (0:09 hours)
2.2.2 - MULTIWORD FEATURES - software (0:14 hours)
2.3.1 - TEXT CATEGORIZATION - theory (0:09 hours)
2.3.2 - TEXT CATEGORIZATION - software (0:21 hours)
2.3.3 - TEXT CATEGORIZATION - software (0:14 hours)
2.4.1 - TEXT CATEGORIZATION MAIN ISSUES - theory (0:16 hours)
2.4.2 - TEXT CATEGORIZATION MAIN ISSUES - software (0:12 hours)
Documents Organization

In this week you will learn about text clustering and topic modeling for documents organization, i.e. how to group different natural language texts according to their similarity. Furthermore, you will learn to develop KNIME workflows for applying text clustering and topic modeling to organize BBC news about business, entertainment, politics, sport and technology.


Lezioni

3.1.1 - INTRODUCTION - theory (0:03 hours)
3.1.2 - INTRODUCTION - software (0:09 hours)
3.2.1 - TEXT PREPROCESSING - theory (0:11 hours)
3.2.2 - TEXT PREPROCESSING - software (0:09 hours)
3.3.1 - TEXT CLUSTERING - theory (0:04 hours)
3.3.2 - TEXT CLUSTERING - software (0:13 hours)
3.3.3 - TEXT CLUSTERING - software (0:12 hours)
3.4.1 - TOPIC MODELS - theory (0:07 hours)
3.5.1 - LATENT DIRICHLET ALLOCATION - theory (0:13 hours)
3.5.2 - LATENT DIRICHLET ALLOCATION - software (0:07 hours)
3.6.1 - LEARNING LATENT DIRICHLET ALLOCATION - theory (0:13 hours)
3.6.2 - LEARNING LATENT DIRICHLET ALLOCATION - software (0:11 hours)
3.7.1 - TOPIC MODEL VALIDATION - theory (0:12 hours)
3.7.2 - TOPIC MODEL VALIDATION - software (0:09 hours)
3.7.3 - TOPIC MODEL VALIDATION - software (0:23 hours)
3.8.1 - TOPIC MODEL WORKFLOW - theory (0:05 hours)
3.8.2 - TOPIC MODEL WORKFLOW - theory (0:07 hours)
Information Extraction

This week is about extracting information from natural language text. You will learn to extract mentions of different types of entity like, person, location, and organization. Furthermore, you will learn how to develop information extraction workflows using the KNIME platform.


Lezioni

4.1.1 - INTRODUCTION - theory (0:16 hours)
4.1.2 - INTRODUCTION - software (0:12 hours)
4.2.1 - NAMED ENTITY RECOGNITION - theory (0:12 hours)
4.2.2 - NAMED ENTITY RECOGNITION - software (0:22 hours)
4.3.1 - LEARNING NAMED ENTITY RECOGNITION - theory (0:15 hours)
4.4.1 - NAMED ENTITY RECOGNITION - SEQUENCE PREDICTION - theory (0:10 hours)
4.4.2 - NAMED ENTITY RECOGNITION - SEQUENCE PREDICTION - software (0:14 hours)
4.5.1 - INFORMATION EXTRACTION APPLICATIONS - theory (0:08 hours)
4.5.2 - INFORMATION EXTRACTION APPLICATIONS - software (0:17 hours)
4.6.1 SOME NOTES ABOUT NATURAL LANGUAGE PROCESSING - theory (0:06 hours)
Modalità Corso
Tutoraggio
Stato del corso
Tutoraggio Soft
Durata
4 settimane
Impegno
9 ore/settimana
Ore formazione
35
Categoria
Informatica, Gestione e Analisi Dati
Lingua
Inglese
Tipo
Online
Livello
Base
Avvio Iscrizioni
31 Mar 2017
Apertura Corso
21 Apr 2017
Inizio Tutoraggio
21 Apr 2017
Fine Tutoraggio
29 Mag 2017
Tutoraggio Soft
30 Mag 2017
Chiusura Corso
Non impostato

Partecipazione e Attestati

Quota di iscrizione
GRATUITO!
Attestato di Partecipazione
GRATUITO!


FABIO STELLA

Department of Informatics, Systems and Communication


DANIELE BELLANI

Department of Informatics, Systems and Communication

PAOLA CHIESA

Department of Informatics, Systems and Communication

ALESSANDRO BREGOLI

Department of Informatics, Systems and Communication

Corsi collegati