Essential Guide to Text Analysis Comprehensive Syllabus and Complete PDF Notes
- Computer Science

- 1 hour ago
- 5 min read

📘 UNIT 1 – INTRODUCTION TO TEXT MINING
1.1 Introduction to Text Mining
Text is one of the richest sources of information created by humans. However, raw text is unstructured, meaning it does not follow a rigid data model like databases do. As a result, computers cannot directly analyze it unless it is cleaned, transformed, structured, and interpreted.
Text Mining is defined as: “The process of transforming unstructured text into structured data to discover patterns, insights, and knowledge.”
It is also called Text Data Mining or Text Analytics, although academically:
Text Mining focuses on extracting patterns, concepts, relationships.
Text Analytics focuses on analyzing extracted information using statistics and ML.
Why Text Mining is Important
80% of global data is unstructured text (emails, blogs, reports, chats).
Businesses use it for decision-making, customer experience monitoring, risk management, and automation.
1.2 Types of Data in Text Mining
1. Structured Data
Stored in tables (rows, columns).
Easy for machines to process.
Examples: database entries, spreadsheets.
2. Unstructured Data
Free-form, no predefined schema.
Examples: text messages, comments, PDFs, emails.
3. Semi-Structured Data
Contains tags or formatting but not fully tabular.
Examples: HTML, XML, JSON.
1.3 Key Terms in Text Analytics (with Definitions)
Term | Meaning |
Corpus | A collection of text documents. |
Token | A word or sentence unit after splitting text. |
NLP | Field enabling machines to understand and process human language. |
Stemming | Reducing words to root form (e.g., “playing” → “play”). |
Lemmatization | Converting words to meaningful base form (better than stemming). |
TF-IDF | Weights terms based on importance in a document relative to corpus. |
NER | Identifying names, places, organizations in text. |
1.4 Text Mining Process (IBM SPSS — VERY IMPORTANT)
a typical text mining session consists of:
Step 1: Identify & Collect Text
Collect text from emails, feedback, reviews, etc.
Step 2: Prepare the Text
Preparation involves text preprocessing, such as:
Language detection
Tokenization
Stopword removal
Lemmatization
PoS tagging
Normalization
Step 3: Extract Concepts
IBM’s Text Mining Node extracts:
Uniterms (single words)
Multiterms (multi-word concepts)
Synonyms
Patterns
Step 4: Build Concept or Category Models
Assign types, categories, sentiment groups.
Step 5: Use Extracted Information in ML Models
Combine structured + unstructured features → predictive models.
1.5 Text Mining Techniques in Depth
Technique 1: Information Retrieval (IR)
➡️ Retrieving relevant documents based on queries.Examples: Google search, digital library search.Includes:
Tokenization
Stemming
Ranking documents
Technique 2: Natural Language Processing (NLP)
➡️ Understanding and manipulating natural language using machines.
Key NLP Tasks:
PoS tagging
Chunking
Sentiment analysis
Summarization
Text classification
Technique 3: Information Extraction (IE)
➡️ Extracting structured information like entities, relationships, and attributes.
Includes:
Named Entity Recognition (NER)
Feature extraction
Attribute selection
1.6 IBM SPSS Text Mining Nodes
1. File List Node
Reads multiple documents from a folder.
2. Web Feed Node
Reads RSS/HTML feeds.
3. Text Mining Node
Extracts concepts, types, meanings.
4. Text Link Analysis Node
Identifies relationships between concepts.
5. Translate Node
Translates non-English text to English.
1.7 Steps in a Typical Text Mining Session (Must Learn for Exams)
Select text source
Import data (File List/Web Feed)
Preprocess text
Extract candidate terms
Normalize & group synonyms
Assign categories
Build predictive models
Deploy & monitor
📘 UNIT 2 – READING TEXT DATA
2.1 Introduction
Text data is stored in multiple formats such as:
TXT
PDF
DOC/DOCX
HTML
RSS feeds
2.2 File List Node (Most Important Topic)
Definition
➡️ A node that generates a list of document names or paths for text mining.📖
Why Use It?
Real datasets contain thousands of files.
This node avoids manual loading.
How It Works
Choose folder
SPSS scans it
Generates file list
Passes file paths to Text Mining Node
Output Fields
filename
fullpath
extracted text
2.3 File Viewer Node
Used to preview documents before mining.
2.4 Web Feed Node
Definition
➡️ A node that reads text from online feeds like RSS or HTML pages.📖
Uses:
Mining blogs
Mining online news
Streaming live text data
2.5 Demonstration 1 – Reading Multiple Files
Steps
Drag File List Node
Select directory
Select file types
Enable “Extract Text”
Connect to Text Mining node
Run stream
2.6 Python Equivalent
Your file includes Python code for:
Reading TXT, PDF, DOCX
Tokenizing text
Extracting word frequencies
📘 UNIT 3 – LINGUISTIC ANALYSIS & TEXT MINING
3.1 What is Linguistic Analysis?
Linguistic analysis is the heart of text mining. It involves studying text structure to derive meaning.
It Includes:
Sentence Detection
Tokenization
Lemmatization
Part-of-Speech Tagging
Dependency Parsing
Noun Phrase Extraction
3.2 Sentence Detection
Definition: Splitting text into sentence-level units.
3.3 Tokenization
Definition: Splitting text into words/tokens.
Problems encountered:
Spelling issues
Missing spaces
3.4 Lemmatization
Definition: Converting a word into its base form using grammar rules.
Examples:
“better” → “good”
“driving” → “drive”
3.5 Part-of-Speech (PoS) Tagging
Methods Explained in Your Files:
A. Rule-Based PoS Tagging
Uses:
Lexicons
Morphological rules (suffixes)
Contextual rules
Examples:
“ing” → verb (VBG)
After “the” → noun
B. Statistical PoS Tagging
Models:
HMM (Hidden Markov Model)
MEMM
C. Machine Learning PoS Tagging
Uses:
SVM
Neural Networks
BiLSTM
CRF
Transformers (BERT)
3.6 Extractor Component – Complete Workflow
Steps:
Ingest documents
Preprocessing
Linguistic Annotation
PoS
Lemma
NER
Candidate Generation
N-grams
NP-chunks
Dependency-based spans
Filtering
Scoring (PMI, TF-IDF, C-value)
Normalization
Export final list of concepts
3.7 Identification of Candidate Terms
Methods:
POS-patterns
Chunking
N-gram enumeration
Dependency parsing
Collocations (PMI)
3.8 Equivalence Classes
Definition
Grouping different forms of words representing the same concept.
Examples:
dropped call, call drops → “call_drop”
USA, America → “USA”
Techniques:
WordNet
TF-IDF cosine similarity
Word2Vec/BERT
Hybrid (Levenshtein + embeddings)
3.9 Forcing & Excluding
Forcing: Manually include certain terms
Excluding: Remove unwanted terms
Used for:
Domain-specific mining
Removing noise
3.10 Assigning Types
Types = Semantic labels like:
PERSON
LOCATION
ORGANIZATION
📘 UNIT 4 – MACHINE CATEGORIZATION TECHNIQUES
4.1 Introduction
Machine categorization involves assigning text to categories such as:
Complaint
Praise
Request
Product feature
4.2 Types of Categorization Techniques
1. Manual / Rule-Based Categorization
Uses:
Dictionaries
Rules (if-else)
Pattern matching
2. Statistical Categorization
Uses:
Naive Bayes
SVM
Logistic Regression
3. Linguistic Categorization
Uses:
Noun phrases
PoS-tag patterns
Equivalence classes
4. Hybrid Categorization
Combination of:
Linguistic features
Statistical models
Synonym dictionaries
4.3 Text Analysis Packages (TAP)
TAP contains:
Synonym libraries
Type dictionaries
Exclude lists
Categorization rules
4.4 Demonstrations (As in Syllabus)
Demo 1: Use TAP to Categorize Data
Steps:
Extract concepts
Attach categories
Create category nugget
Demo 2: Import Predefined Categories (Excel)
Used to map:
Product issues
Complaint categories
Demo 3: Automated Classification
IBM SPSS identifies categories automatically.
📘 UNIT 5 – MONITORING USING TEXT MINING MODELS
5.1 Objective
Build a hybrid churn prediction model using:
Linguistic analysis of customer feedback
Structured data
ML algorithms
5.2 Business Understanding
18% churn → $6M loss
Goal: predict churn probability
5.3 Data Description
5,000 structured customer records
4,500 text feedback files
500 forum posts
5.4 Full Text Mining Pipeline
Steps include:
File List Node equivalent (Python)
Web Feed reading
Data merging
Text preprocessing
PoS Tagging
Noun phrase extraction
Collocations via PMI
TAP-based normalization
Feature engineering
Model building
Scoring new data
5.5 Linguistic Feature Engineering
Features include:
Noun phrases
Bigrams
Synonym clusters
Equivalence classes
Sentiments
5.6 Machine Learning Models
Models used:
Logistic Regression
Random Forest
XGBoost
Neural networks
Evaluation metrics:
AUC ≥ 0.88
Recall ≥ 0.80
5.7 Scoring New Data
Steps:
Clean new text
Extract linguistic features
Combine with structured features
Use saved model for scoring




Comments