Essential Guide to Text Analysis Comprehensive Syllabus and Complete PDF Notes
- Computer Science

- Nov 30
- 5 min read

šĀ UNIT 1 ā INTRODUCTION TO TEXT MINING
1.1 Introduction to Text Mining
Text is one of the richest sources of information created by humans. However, raw text is unstructured, meaning it does not follow a rigid data model like databases do. As a result, computers cannot directly analyze it unless it is cleaned, transformed, structured, and interpreted.
Text MiningĀ is defined as: āThe process of transforming unstructured text into structured data to discover patterns, insights, and knowledge.ā
It is also called Text Data MiningĀ or Text Analytics, although academically:
Text MiningĀ focuses on extracting patterns, concepts, relationships.
Text AnalyticsĀ focuses on analyzing extracted information using statistics and ML.
Why Text Mining is Important
80% of global data is unstructured text (emails, blogs, reports, chats).
Businesses use it for decision-making, customer experience monitoring, risk management, and automation.
1.2 Types of Data in Text Mining
1. Structured Data
Stored in tables (rows, columns).
Easy for machines to process.
Examples: database entries, spreadsheets.
2. Unstructured Data
Free-form, no predefined schema.
Examples: text messages, comments, PDFs, emails.
3. Semi-Structured Data
Contains tags or formatting but not fully tabular.
Examples: HTML, XML, JSON.
1.3 Key Terms in Text Analytics (with Definitions)
Term | Meaning |
Corpus | A collection of text documents. |
Token | A word or sentence unit after splitting text. |
NLP | Field enabling machines to understand and process human language. |
Stemming | Reducing words to root form (e.g., āplayingā ā āplayā). |
Lemmatization | Converting words to meaningful base form (better than stemming). |
TF-IDF | Weights terms based on importance in a document relative to corpus. |
NER | Identifying names, places, organizations in text. |
1.4 Text Mining Process (IBM SPSS ā VERY IMPORTANT)
a typical text mining session consists of:
Step 1: Identify & Collect Text
Collect text from emails, feedback, reviews, etc.
Step 2: Prepare the Text
Preparation involves text preprocessing, such as:
Language detection
Tokenization
Stopword removal
Lemmatization
PoS tagging
Normalization
Step 3: Extract Concepts
IBMās Text Mining Node extracts:
UnitermsĀ (single words)
MultitermsĀ (multi-word concepts)
Synonyms
Patterns
Step 4: Build Concept or Category Models
Assign types, categories, sentiment groups.
Step 5: Use Extracted Information in ML Models
Combine structured + unstructured features ā predictive models.
1.5 Text Mining Techniques in Depth
Technique 1: Information Retrieval (IR)
ā”ļø Retrieving relevant documents based on queries.Examples: Google search, digital library search.Includes:
Tokenization
Stemming
Ranking documents
Technique 2: Natural Language Processing (NLP)
ā”ļø Understanding and manipulating natural language using machines.
Key NLP Tasks:
PoS tagging
Chunking
Sentiment analysis
Summarization
Text classification
Technique 3: Information Extraction (IE)
ā”ļø Extracting structured information like entities, relationships, and attributes.
Includes:
Named Entity Recognition (NER)
Feature extraction
Attribute selection
1.6 IBM SPSS Text Mining Nodes
1. File List Node
Reads multiple documents from a folder.
2. Web Feed Node
Reads RSS/HTML feeds.
3. Text Mining Node
Extracts concepts, types, meanings.
4. Text Link Analysis Node
Identifies relationships between concepts.
5. Translate Node
Translates non-English text to English.
1.7 Steps in a Typical Text Mining Session (Must Learn for Exams)
Select text source
Import data (File List/Web Feed)
Preprocess text
Extract candidate terms
Normalize & group synonyms
Assign categories
Build predictive models
Deploy & monitor

Text Analytics
šĀ UNIT 2 ā READING TEXT DATA
2.1 Introduction
Text data is stored in multiple formats such as:
TXT
PDF
DOC/DOCX
HTML
RSS feeds
2.2 File List Node (Most Important Topic)
Definition
ā”ļø A node that generates a list of document names or paths for text mining.š
Why Use It?
Real datasets contain thousands of files.
This node avoids manual loading.
How It Works
Choose folder
SPSS scans it
Generates file list
Passes file paths to Text Mining Node
Output Fields
filename
fullpath
extracted text
2.3 File Viewer Node
Used to preview documents before mining.
2.4 Web Feed Node
Definition
ā”ļø A node that reads text from online feeds like RSSĀ or HTML pages.š
Uses:
Mining blogs
Mining online news
Streaming live text data
2.5 Demonstration 1 ā Reading Multiple Files
Steps
Drag File List Node
Select directory
Select file types
Enable āExtract Textā
Connect to Text Mining node
Run stream
2.6 Python Equivalent
Your file includes Python code for:
Reading TXT, PDF, DOCX
Tokenizing text
Extracting word frequencies
šĀ UNIT 3 ā LINGUISTIC ANALYSIS & TEXT MINING
3.1 What is Linguistic Analysis?
Linguistic analysis is the heart of text mining. It involves studying text structure to derive meaning.
It Includes:
Sentence Detection
Tokenization
Lemmatization
Part-of-Speech Tagging
Dependency Parsing
Noun Phrase Extraction
3.2 Sentence Detection
Definition: Splitting text into sentence-level units.
3.3 Tokenization
Definition: Splitting text into words/tokens.
Problems encountered:
Spelling issues
Missing spaces
3.4 Lemmatization
Definition: Converting a word into its base form using grammar rules.
Examples:
ābetterā ā āgoodā
ādrivingā ā ādriveā
3.5 Part-of-Speech (PoS) Tagging
Methods Explained in Your Files:
A. Rule-Based PoS Tagging
Uses:
Lexicons
Morphological rules (suffixes)
Contextual rules
Examples:
āingā ā verb (VBG)
After ātheā ā noun
B. Statistical PoS Tagging
Models:
HMM (Hidden Markov Model)
MEMM
C. Machine Learning PoS Tagging
Uses:
SVM
Neural Networks
BiLSTM
CRF
Transformers (BERT)
3.6 Extractor Component ā Complete Workflow
Steps:
Ingest documents
Preprocessing
Linguistic Annotation
PoS
Lemma
NER
Candidate Generation
N-grams
NP-chunks
Dependency-based spans
Filtering
Scoring (PMI, TF-IDF, C-value)
Normalization
Export final list of concepts
3.7 Identification of Candidate Terms
Methods:
POS-patterns
Chunking
N-gram enumeration
Dependency parsing
Collocations (PMI)
3.8 Equivalence Classes
Definition
Grouping different forms of words representing the same concept.
Examples:
dropped call, call drops ā ācall_dropā
USA, America ā āUSAā
Techniques:
WordNet
TF-IDF cosine similarity
Word2Vec/BERT
Hybrid (Levenshtein + embeddings)
3.9 Forcing & Excluding
Forcing: Manually include certain terms
Excluding: Remove unwanted terms
Used for:
Domain-specific mining
Removing noise
3.10 Assigning Types
Types = Semantic labels like:
PERSON
LOCATION
ORGANIZATION

Text Analytics Journey
šĀ UNIT 4 ā MACHINE CATEGORIZATION TECHNIQUES
4.1 Introduction
Machine categorization involves assigning text to categories such as:
Complaint
Praise
Request
Product feature
4.2 Types of Categorization Techniques
1. Manual / Rule-Based Categorization
Uses:
Dictionaries
Rules (if-else)
Pattern matching
2. Statistical Categorization
Uses:
Naive Bayes
SVM
Logistic Regression
3. Linguistic Categorization
Uses:
Noun phrases
PoS-tag patterns
Equivalence classes
4. Hybrid Categorization
Combination of:
Linguistic features
Statistical models
Synonym dictionaries
4.3 Text Analysis Packages (TAP)
TAP contains:
Synonym libraries
Type dictionaries
Exclude lists
Categorization rules
4.4 Demonstrations (As in Syllabus)
Demo 1: Use TAP to Categorize Data
Steps:
Extract concepts
Attach categories
Create category nugget
Demo 2: Import Predefined Categories (Excel)
Used to map:
Product issues
Complaint categories
Demo 3: Automated Classification
IBM SPSS identifies categories automatically.
šĀ UNIT 5 ā MONITORING USING TEXT MINING MODELS
5.1 Objective
Build a hybrid churn prediction modelĀ using:
Linguistic analysis of customer feedback
Structured data
ML algorithms
5.2 Business Understanding
18% churn ā $6M loss
Goal: predict churn probability
5.3 Data Description
5,000Ā structured customer records
4,500Ā text feedback files
500Ā forum posts
5.4 Full Text Mining Pipeline
Steps include:
File List Node equivalent (Python)
Web Feed reading
Data merging
Text preprocessing
PoS Tagging
Noun phrase extraction
Collocations via PMI
TAP-based normalization
Feature engineering
Model building
Scoring new data
5.5 Linguistic Feature Engineering
Features include:
Noun phrases
Bigrams
Synonym clusters
Equivalence classes
Sentiments
5.6 Machine Learning Models
Models used:
Logistic Regression
Random Forest
XGBoost
Neural networks
Evaluation metrics:
AUC ā„ 0.88
Recall ā„ 0.80
5.7 Scoring New Data
Steps:
Clean new text
Extract linguistic features
Combine with structured features
Use saved model for scoring
Ā




š