top of page

Essential Guide to Text Analysis Comprehensive Syllabus and Complete PDF Notes

ree

📘 UNIT 1 – INTRODUCTION TO TEXT MINING


1.1 Introduction to Text Mining

Text is one of the richest sources of information created by humans. However, raw text is unstructured, meaning it does not follow a rigid data model like databases do. As a result, computers cannot directly analyze it unless it is cleaned, transformed, structured, and interpreted.


Text Mining is defined as: “The process of transforming unstructured text into structured data to discover patterns, insights, and knowledge.”

It is also called Text Data Mining or Text Analytics, although academically:

  • Text Mining focuses on extracting patterns, concepts, relationships.

  • Text Analytics focuses on analyzing extracted information using statistics and ML.


Why Text Mining is Important

  • 80% of global data is unstructured text (emails, blogs, reports, chats).

  • Businesses use it for decision-making, customer experience monitoring, risk management, and automation.

1.2 Types of Data in Text Mining

1. Structured Data

  • Stored in tables (rows, columns).

  • Easy for machines to process.

  • Examples: database entries, spreadsheets.

2. Unstructured Data

  • Free-form, no predefined schema.

  • Examples: text messages, comments, PDFs, emails.


3. Semi-Structured Data

  • Contains tags or formatting but not fully tabular.

  • Examples: HTML, XML, JSON.

1.3 Key Terms in Text Analytics (with Definitions)

Term

Meaning

Corpus

A collection of text documents.

Token

A word or sentence unit after splitting text.

NLP

Field enabling machines to understand and process human language.

Stemming

Reducing words to root form (e.g., “playing” → “play”).

Lemmatization

Converting words to meaningful base form (better than stemming).

TF-IDF

Weights terms based on importance in a document relative to corpus.

NER

Identifying names, places, organizations in text.

1.4 Text Mining Process (IBM SPSS — VERY IMPORTANT)

a typical text mining session consists of:

Step 1: Identify & Collect Text

  • Collect text from emails, feedback, reviews, etc.

Step 2: Prepare the Text

Preparation involves text preprocessing, such as:

  • Language detection

  • Tokenization

  • Stopword removal

  • Lemmatization

  • PoS tagging

  • Normalization

Step 3: Extract Concepts

IBM’s Text Mining Node extracts:

  • Uniterms (single words)

  • Multiterms (multi-word concepts)

  • Synonyms

  • Patterns

Step 4: Build Concept or Category Models

  • Assign types, categories, sentiment groups.

Step 5: Use Extracted Information in ML Models

Combine structured + unstructured features → predictive models.

1.5 Text Mining Techniques in Depth

Technique 1: Information Retrieval (IR)

➡️ Retrieving relevant documents based on queries.Examples: Google search, digital library search.Includes:

  • Tokenization

  • Stemming

  • Ranking documents

Technique 2: Natural Language Processing (NLP)

➡️ Understanding and manipulating natural language using machines.

Key NLP Tasks:

  • PoS tagging

  • Chunking

  • Sentiment analysis

  • Summarization

  • Text classification

Technique 3: Information Extraction (IE)

➡️ Extracting structured information like entities, relationships, and attributes.

Includes:

  • Named Entity Recognition (NER)

  • Feature extraction

  • Attribute selection

1.6 IBM SPSS Text Mining Nodes

1. File List Node

Reads multiple documents from a folder.

2. Web Feed Node

Reads RSS/HTML feeds.

3. Text Mining Node

Extracts concepts, types, meanings.

4. Text Link Analysis Node

Identifies relationships between concepts.

5. Translate Node

Translates non-English text to English.

1.7 Steps in a Typical Text Mining Session (Must Learn for Exams)

  1. Select text source

  2. Import data (File List/Web Feed)

  3. Preprocess text

  4. Extract candidate terms

  5. Normalize & group synonyms

  6. Assign categories

  7. Build predictive models

  8. Deploy & monitor

📘 UNIT 2 – READING TEXT DATA

2.1 Introduction

Text data is stored in multiple formats such as:

  • TXT

  • PDF

  • DOC/DOCX

  • HTML

  • RSS feeds

2.2 File List Node (Most Important Topic)

Definition

➡️ A node that generates a list of document names or paths for text mining.📖


Why Use It?

  • Real datasets contain thousands of files.

  • This node avoids manual loading.

How It Works

  1. Choose folder

  2. SPSS scans it

  3. Generates file list

  4. Passes file paths to Text Mining Node

Output Fields

  • filename

  • fullpath

  • extracted text

2.3 File Viewer Node

Used to preview documents before mining.

2.4 Web Feed Node

Definition

➡️ A node that reads text from online feeds like RSS or HTML pages.📖


Uses:

  • Mining blogs

  • Mining online news

  • Streaming live text data

2.5 Demonstration 1 – Reading Multiple Files


Steps

  1. Drag File List Node

  2. Select directory

  3. Select file types

  4. Enable “Extract Text”

  5. Connect to Text Mining node

  6. Run stream

2.6 Python Equivalent

Your file includes Python code for:

  • Reading TXT, PDF, DOCX

  • Tokenizing text

  • Extracting word frequencies

📘 UNIT 3 – LINGUISTIC ANALYSIS & TEXT MINING

3.1 What is Linguistic Analysis?

Linguistic analysis is the heart of text mining. It involves studying text structure to derive meaning.

It Includes:

  • Sentence Detection

  • Tokenization

  • Lemmatization

  • Part-of-Speech Tagging

  • Dependency Parsing

  • Noun Phrase Extraction

3.2 Sentence Detection

Definition: Splitting text into sentence-level units.

3.3 Tokenization

Definition: Splitting text into words/tokens.

Problems encountered:

  • Spelling issues

  • Missing spaces



3.4 Lemmatization

Definition: Converting a word into its base form using grammar rules.

Examples:

  • “better” → “good”

  • “driving” → “drive”

3.5 Part-of-Speech (PoS) Tagging

Methods Explained in Your Files:

A. Rule-Based PoS Tagging

Uses:

  • Lexicons

  • Morphological rules (suffixes)

  • Contextual rules

Examples:

  • “ing” → verb (VBG)

  • After “the” → noun

B. Statistical PoS Tagging

Models:

  • HMM (Hidden Markov Model)

  • MEMM

C. Machine Learning PoS Tagging

Uses:

  • SVM

  • Neural Networks

  • BiLSTM

  • CRF

  • Transformers (BERT)

3.6 Extractor Component – Complete Workflow

Steps:

  1. Ingest documents

  2. Preprocessing

  3. Linguistic Annotation

    • PoS

    • Lemma

    • NER

  4. Candidate Generation

    • N-grams

    • NP-chunks

    • Dependency-based spans

  5. Filtering

  6. Scoring (PMI, TF-IDF, C-value)

  7. Normalization

  8. Export final list of concepts

3.7 Identification of Candidate Terms

Methods:

  • POS-patterns

  • Chunking

  • N-gram enumeration

  • Dependency parsing

  • Collocations (PMI)

3.8 Equivalence Classes

Definition

Grouping different forms of words representing the same concept.

Examples:

  • dropped call, call drops → “call_drop”

  • USA, America → “USA”

Techniques:

  • WordNet

  • TF-IDF cosine similarity

  • Word2Vec/BERT

  • Hybrid (Levenshtein + embeddings)

3.9 Forcing & Excluding

  • Forcing: Manually include certain terms

  • Excluding: Remove unwanted terms

Used for:

  • Domain-specific mining

  • Removing noise

3.10 Assigning Types

Types = Semantic labels like:

  • PERSON

  • LOCATION

  • ORGANIZATION



📘 UNIT 4 – MACHINE CATEGORIZATION TECHNIQUES

4.1 Introduction

Machine categorization involves assigning text to categories such as:

  • Complaint

  • Praise

  • Request

  • Product feature

4.2 Types of Categorization Techniques

1. Manual / Rule-Based Categorization

Uses:

  • Dictionaries

  • Rules (if-else)

  • Pattern matching

2. Statistical Categorization

Uses:

  • Naive Bayes

  • SVM

  • Logistic Regression

3. Linguistic Categorization

Uses:

  • Noun phrases

  • PoS-tag patterns

  • Equivalence classes

4. Hybrid Categorization

Combination of:

  • Linguistic features

  • Statistical models

  • Synonym dictionaries

4.3 Text Analysis Packages (TAP)

TAP contains:

  • Synonym libraries

  • Type dictionaries

  • Exclude lists

  • Categorization rules

4.4 Demonstrations (As in Syllabus)

Demo 1: Use TAP to Categorize Data

Steps:

  1. Extract concepts

  2. Attach categories

  3. Create category nugget

Demo 2: Import Predefined Categories (Excel)

Used to map:

  • Product issues

  • Complaint categories

Demo 3: Automated Classification

IBM SPSS identifies categories automatically.

📘 UNIT 5 – MONITORING USING TEXT MINING MODELS

5.1 Objective

Build a hybrid churn prediction model using:

  • Linguistic analysis of customer feedback

  • Structured data

  • ML algorithms

5.2 Business Understanding

  • 18% churn → $6M loss

  • Goal: predict churn probability

5.3 Data Description

  • 5,000 structured customer records

  • 4,500 text feedback files

  • 500 forum posts

5.4 Full Text Mining Pipeline

Steps include:

  1. File List Node equivalent (Python)

  2. Web Feed reading

  3. Data merging

  4. Text preprocessing

  5. PoS Tagging

  6. Noun phrase extraction

  7. Collocations via PMI

  8. TAP-based normalization

  9. Feature engineering

  10. Model building

  11. Scoring new data

5.5 Linguistic Feature Engineering

Features include:

  • Noun phrases

  • Bigrams

  • Synonym clusters

  • Equivalence classes

  • Sentiments

5.6 Machine Learning Models

Models used:

  • Logistic Regression

  • Random Forest

  • XGBoost

  • Neural networks

Evaluation metrics:

  • AUC ≥ 0.88

  • Recall ≥ 0.80

5.7 Scoring New Data

Steps:

  1. Clean new text

  2. Extract linguistic features

  3. Combine with structured features

  4. Use saved model for scoring

 



Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
bottom of page