# Module 3 – Reflection Based on your Module topics, what did you find new and interesting? And what appeared to be a review? Business Analytics

© 202

Module 3 – Reflection Based on your Module topics, what did you find new and interesting? And what appeared to be a review? Business Analytics

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Descriptive Data Mining

Chapter 5

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Introduction (Slide 1 of 2)

The increase in the use of data-mining techniques in business has been caused largely by three events:

The explosion in the amount of data being produced and electronically tracked.

The ability to electronically warehouse these data.

The affordability of computer power to analyze the data.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Introduction (Slide 2 of 2)

Observation: Set of recorded values of variables associated with a single entity.

Unsupervised learning: A descriptive data-mining technique used to identify relationships between observations.

Thought of as high-dimensional descriptive analytics.

There is no outcome variable to predict; instead, qualitative assessments are used to assess and compare the results.

4

Cluster Analysis

Measuring Similarity Between Observations

Hierarchical Clustering

k-Means Clustering

Hierarchical Clustering versus k-Means Clustering

Cluster Analysis (Slide 1 of 21)

Goal of clustering is to segment observations into similar groups based on observed variables.

Can be employed during the data-preparation step to identify variables or observations that can be aggregated or removed from consideration.

Commonly used in marketing to divide customers into different homogenous groups; known as market segmentation.

Used to identify outliers.

Cluster Analysis (Slide 2 of 21)

Clustering methods:

Bottom-up hierarchical clustering starts with each observation belonging to its own cluster and then sequentially merges the most similar clusters to create a series of nested clusters.

k-means clustering assigns each observation to one of k clusters in a manner such that the observations assigned to the same cluster are as similar as possible.

Both methods depend on how two observations are similar—hence, we have to measure similarity between observations.

Cluster Analysis (Slide 3 of 21)

Measuring Similarity Between Observations:

When observations include numeric variables, Euclidean distance is the most common method to measure dissimilarity between observations.

measurements of q variables.

The Euclidean distance between observations u and v is:

Cluster Analysis (Slide 4 of 21)

Measuring Similarity Between Observations:

Illustration:

KTC is a financial advising company that provides personalized financial advice to its clients.

KTC would like to segment its customers into several groups (or clusters) so that the customers within a group are similar and dissimilar with respect to key characteristics.

For each customer, KTC has an observation of seven variables: Age, Female, Income, Married, Children, Car Loan, Mortgage.

Example: The observation u = (61, 0, 57881, 1, 2, 0, 0) corresponds to a 61-year-old male with an annual income of $57,881, married with two children, but no car loan and no mortgage.

9

Cluster Analysis (Slide 5 of 21)

Figure 5.1: Euclidean Distance

Euclidean distance becomes smaller as a pair of observations become more similar with respect to their variable values.

Figure 4.1 depicts Euclidean distance for two observations consisting of two variable measurements.

Euclidean distance is highly influenced by the scale on which variables are measured.

Therefore, it is common to standardize the units of each variable j of each observation u;

Example: uj, the value of variable j in observation u, is replaced with its z-score, zj.

The conversion to z-scores also makes it easier to identify outlier measurements, which can distort the Euclidean distance between observations.

10

Cluster Analysis (Slide 6 of 21)

Euclidean distance is highly influenced by the scale on which variables are measured:

Common to standardize the units of each variable j of each observation u.

The conversion to z-scores also makes it easier to identify outlier measurements, which can distort the Euclidean distance between observations.

11

Cluster Analysis (Slide 7 of 21)

When clustering observations solely on the basis of categorical variables encoded as 0–1, a better measure of similarity between two observations can be achieved by counting the number of variables with matching values.

The simplest overlap measure is called the matching coefficient and is computed as:

Cluster Analysis (Slide 8 of 21)

A weakness of the matching coefficient is that if two observations both have a 0 entry for a categorical variable, this is counted as a sign of similarity between the two observations.

To avoid misstating similarity due to the absence of a feature, a similarity measure called Jaccard’s coefficient does not count matching zero entries and is computer as:

Cluster Analysis (Slide 9 of 21)

Table 5.1: Comparison of Similarity Matrixes for Observations with Binary Variables

Observation Female Married Loan Mortgage

1 1 0 0 0

2 0 1 1 1

3 1 1 1 0

4 1 1 0 0

5 1 1 0 0

14

Cluster Analysis (Slide 10 of 21)

Table 5.1: Comparison of Similarity Matrixes for Observations with Binary Variables (cont.)

Similarity Matrix Based on Matching Coefficient:

Observation 1 2 3 4 5

1 1

2 0 1

3 0.5 0.5 1

4 0.75 0.25 0.75 1

5 0.75 0.25 0.75 1 1

15

Cluster Analysis (Slide 11 of 21)

Table 5.1: Comparison of Similarity Matrixes for Observations with Binary Variables (cont.)

Similarity Matrix Based on Jaccard’s Coefficient:

Observation 1 2 3 4 5

1 1

2 0 1

3 0.333 0.5 1

4 0.5 0.25 0.667 1

5 0.5 0.25 0.667 1 1

16

Cluster Analysis (Slide 12 of 21)

Hierarchical Clustering:

Determines the similarity of two clusters by considering the similarity between the observations composing either cluster.

Starts with each observation in its own cluster and then iteratively combines the two clusters that are the most similar into a single cluster.

Given a way to measure similarity between observations, there are several clustering method alternatives for comparing observations in two clusters to obtain a cluster similarity measure:

Single linkage.

Complete linkage.

Group average linkage.

Median linkage.

Centroid linkage.

Cluster Analysis (Slide 13 of 21)

Single linkage: The similarity between two clusters is defined by the similarity of the pair of observations (one from each cluster) that are the most similar.

Complete linkage: This clustering method defines the similarity between two clusters as the similarity of the pair of observations (one from each cluster) that are the most different.

Group Average linkage: Defines the similarity between two clusters to be the average similarity computed over all pairs of observations between the two clusters.

Median linkage: Analogous to group average linkage except that it uses the median of the similarities computer between all pairs of observations between the two clusters.

Centroid linkage uses the averaging concept of cluster centroids to define between-cluster similarity.

Single linkage will consider two clusters to be close if an observation in one of the clusters is close to at least one observation in the other cluster.

Complete linkage will consider two clusters to be close if their most-different pair of observations are close. This method produces clusters such that all member observations of a cluster are relatively close to each other.

18

Cluster Analysis (Slide 14 of 21)

Figure 5.2: Measuring Similarity Between Clusters

Cluster Analysis (Slide 15 of 21)

Ward’s method merges two clusters such that the dissimilarity of the observations with the resulting single cluster increases as little as possible.

When McQuitty’s method considers merging two clusters A and B, the dissimilarity of the resulting cluster AB to any other cluster C is calculated as: ((dissimilarity between A and C) + (dissimilarity between B and C)) divided by 2).

A dendrogram is a chart that depicts the set of nested clusters resulting at each step of aggregation.

Cluster Analysis (Slide 16 of 21)

Figure 5.3: Dendrogram for KTC Using Matching Coefficients and Group Average Linkage

Cluster Analysis (Slide 17 of 21)

k-Means Clustering:

Given a value of k, the k-means algorithm randomly assigns each observation to one of the k clusters.

After all observations have been assigned to a cluster, the resulting cluster centroids are calculated.

Using the updated cluster centroids, all observations are reassigned to the cluster with the closest centroid.

The algorithm repeats this process (calculate cluster centroid, assign observation to cluster with nearest centroid) until there is no change in the clusters or a specified maximum number of iterations is reached.

One rule of thumb is that the ratio of between-cluster distance to within-cluster distance should exceed 1.0 for useful clusters.

22

Cluster Analysis (Slide 18 of 21)

Figure 5.4: Clustering Observations by Age and Income Using

k-Means Clustering with k = 3

To illustrate k-means clustering, we consider a 3-means clustering of a small sample of KTC’s customer data in the file DemoKTC.

Figure 4.4 shows three clusters based on customer income and age.

Cluster 1 is characterized by relatively younger, lower-income customers (Cluster 1’s centroid is at [33, $20,364]).

Cluster 2 is characterized by relatively older, higher-income customers (Cluster 2’s centroid is at [58, $47,729]).

Cluster 3 is characterized by relatively older, lower-income customers (Cluster 3’s centroid is at [53, $21,416]).

23

Cluster Analysis (Slide 19 of 21)

Table 5.2: Average Distances Within Clusters

No. of Observations Average Distance Between Observations in Cluster

Cluster 1 12 0.622

Cluster 2 8 0.739

Cluster 3 10 0.520

Table 4.2 shows that Cluster 2 is the smallest, most heterogeneous cluster, whereas Cluster 1 is the largest cluster, and Cluster 3 is the most homogeneous cluster.

In Table 4.3, we compare the average distances between clusters to the average distance within clusters in Table 4.2.

Cluster 1 and Cluster 2 are the most distinct from each other.

Cluster 2 and Cluster 3 are the least distinct from each other.

Comparing the distance between the Cluster 2 and Cluster 3 centroids (1.964) to the average distance between observations within Cluster 2 (0.739), suggests that there are observations within Cluster 2 that are more similar to those in Cluster 3 than to those in Cluster 2.

24

Cluster Analysis (Slide 20 of 21)

Table 5.3: Distances Between Cluster Centroids

Cluster 1 Cluster 2 Cluster 3

Cluster 1 0 2.784 1.529

Cluster 2 2.784 0 1.964

Cluster 3 1.529 1.964 0

Table 4.2 shows that Cluster 2 is the smallest, most heterogeneous cluster, whereas Cluster 1 is the largest cluster, and Cluster 3 is the most homogeneous cluster.

In Table 4.3, we compare the average distances between clusters to the average distance within clusters in Table 4.2.

Cluster 1 and Cluster 2 are the most distinct from each other.

Cluster 2 and Cluster 3 are the least distinct from each other.

Comparing the distance between the Cluster 2 and Cluster 3 centroids (1.964) to the average distance between observations within Cluster 2 (0.739), suggests that there are observations within Cluster 2 that are more similar to those in Cluster 3 than to those in Cluster 2.

25

Cluster Analysis (Slide 21 of 21)

Hierarchical Clustering versus k-Means Clustering

Hierarchical Clustering k-Means Clustering

Suitable when we have a small data set (e.g., fewer than 500 observations) and want to easily examine solutions with increasing numbers of clusters. Suitable when you know how many clusters you want and you have a larger data set (e.g., more than 500 observations).

Convenient method if you want to observe how clusters are nested. Partitions the observations,

which is appropriate if trying to summarize the data with k “average” observations

that describe the data with the minimum amount of error.

Because Euclidean distance is the standard metric for k-means clustering, it is generally not as appropriate for binary or ordinal data for which an “average” is not meaningful.

26

Association Rules

Evaluating Association Rules

Association Rules (Slide 1 of 7)

Association rules: If-then statements which convey the likelihood of certain items being purchased together.

Although association rules are an important tool in market basket analysis, they are also applicable to other disciplines.

Antecedent: The collection of items (or item set) corresponding to the if portion of the rule.

Consequent: The item set corresponding to the then portion of the rule.

Support count of an item set: Number of transactions in the data that include that item set.

28

Association Rules (Slide 2 of 7)

Table 5.4: Shopping-Cart Transactions

Transaction Shopping Cart

1 bread, peanut butter, milk, fruit, jelly

2 bread, jelly, soda, potato chips, milk, fruit, vegetables, peanut butter

3 whipped cream, fruit, chocolate sauce, beer

4 steak, jelly, soda, potato chips, bread, fruit

5 jelly, soda, peanut butter, milk, fruit

6 jelly, soda, potato chips, milk, bread, fruit

7 fruit, soda, potato chips, milk

8 fruit, soda, peanut butter, milk

9 fruit, cheese, yogurt

10 yogurt, vegetables, beer

Hy-Vee grocery store would like to gain insight into its customers’ purchase patterns to possibly improve its in-aisle product placement and cross-product promotions.

Table 4.4 contains a small sample of data where each transaction comprises the items purchased by a shopper in a single visit to a Hy-Vee.

An example of an association rule from this data would be “if {bread, jelly}, then {peanut butter}” meaning that “if a transaction includes bread and jelly, then it also includes peanut butter.”

Antecedent – {bread, jelly},

Consequent – {peanut butter}

The potential impact of an association rule is often governed by the number of transactions it may affect, which is measured by computing the support count of the item set consisting of the union of its antecedent and consequent.

Investigating the rule “if {bread, jelly}, then {peanut butter}” from Table 4.4, we see the support count of {bread, jelly, peanut butter} is 2.

29

Association Rules (Slide 3 of 7)

Confidence: Helps identify reliable association rules:

Lift ratio: Measure to evaluate the efficiency of a rule:

For the data in Table 5.4, the rule “if {bread, jelly}, then {peanut butter}” has confidence

This measure of confidence can be viewed as the conditional probability of the consequent item set occurs given that the antecedent item set occurs.

A high value of confidence suggests a rule in which the consequent is frequently true when the antecedent is true, but a high value of confidence can be misleading.

For example, if the support of the consequent is high—that is, the item set corresponding to the then part is very frequent—then the confidence of the association rule could be high even if there is little or no association between the items.

A lift ratio greater than 1 suggests that there is some usefulness to the rule and that it is better at identifying cases when the consequent occurs than no rule at all.

For the data in Table 4.4, the rule “if {bread, jelly}, then {peanut butter}” has confidence = 2/4 = 0.5 and a lift ratio = 0.5/(4/10) = 1.25.

In other words, identifying a customer who purchased both bread and jelly as one who also purchased peanut butter is 25% better than just guessing that a random customer purchased peanut butter.

30

Association Rules (Slide 4 of 7)

Table 5.5: Association Rules for Hy-Vee

Antecedent (A) Consequent (C) Support for A Support for C Support for A & C Confidence (%) Lift Ratio

Bread Fruit, Jelly 4 5 4 100.0 2.00

Bread Jelly 4 5 4 100.0 2.00

Bread, Fruit Jelly 4 5 4 100.0 2.00

Fruit, Jelly Bread 5 4 4 80.0 2.00

Jelly Bread 5 4 4 80.0 2.00

Jelly Bread, Fruit 5 4 4 80.0 2.00

Fruit, Potato Chips Soda 4 6 4 100.0 1.67

Peanut Butter Milk 4 4 6 100.0 1.67

Peanut Butter Milk, Fruit 4 6 4 100.0 1.67

Association Rules (Slide 5 of 7)

Table 5.5: Association Rules for Hy-Vee (cont.)

Antecedent (A) Consequent (C) Support for A Support for C Support for A & C Confidence (%) Lift Ratio

Peanut Butter, Fruit Milk 4 6 4 100.0 1.67

Potato Chips Fruit, Soda 4 6 4 100.0 1.67

Potato Chips Soda 4 6 4 100.0 1.67

Fruit, Soda Potato Chips 6 4 4 66.7 1.67

Milk Peanut Butter 6 4 4 66.7 1.67

Milk Peanut Butter, Fruit 6 4 4 66.7 1.67

Milk, Fruit Peanut Butter 6 4 4 66.7 1.67

Soda Fruit, Potato Chips 6 4 4 66.7 1.67

Association Rules (Slide 6 of 7)

Table 5.5: Association Rules for Hy-Vee (cont.)

Antecedent (A) Consequent (C) Support for A Support for C Support for A & C Confidence (%) Lift Ratio

Soda Potato Chips 6 4 4 66.7 1.67

Fruit, Soda Milk 6 6 5 83.3 1.39

Milk Fruit, Soda 6 6 5 83.3 1.39

Milk Soda 6 6 5 83.3 1.39

Milk, Fruit Soda 6 6 5 83.3 1.39

Soda Milk 6 6 5 83.3 1.39

Soda Milk, Fruit 6 6 5 83.3 1.39

Association Rules (Slide 7 of 5)

Evaluating Association Rules:

An association rule is ultimately judged on how actionable it is and how well it explains the relationship between item sets.

For example, Walmart mined its transactional data to uncover strong evidence of the association rule, “If a customer purchases a Barbie doll, then a customer also purchases a candy bar.”

An association rule is useful if it is well supported and explains an important previously unknown relationship.

The support of an association rule can generally be improved by basing it on less specific antecedent and consequent item sets.

34

Text Mining

Voice of the Customer at Triad Airline

Preprocessing Text Data for Analysis

Movie Reviews

Text Mining (1 of 12)

Text, like numerical data, may contain information that can help solve problems and lead to better decisions.

Text mining is the process of extracting useful information from text data.

Text data is often referred to as unstructured data because in its raw form, it cannot be stored in a traditional structured database (rows and columns).

Audio and video data are also examples of unstructured data.

Data mining with text data is more challenging than data mining with traditional numerical data, because it requires more preprocessing to convert the text to a format amenable for analysis.

36

Text Mining (2 of 12)

Voice of the Customer at Triad Airline:

Triad solicits feedback from its customers through a follow-up e-mail the day after the customer has completed a flight.

Survey asks the customer to rate various aspects of the flight and asks the respondent to type comments into a dialog box in the e-mail; includes:

Quantitative feedback from the ratings.

Comments entered by the respondents which need to be analyzed.

A collection of text documents to be analyzed is called a corpus.

Text Mining (3 of 12)

Table 5.6: Ten Respondents’ Concerns for Triad Airlines

Concerns

The wi-fi service was horrible. It was slow and cut off several times.

My seat was uncomfortable.

My flight was delayed 2 hours for no apparent reason.

My seat would not recline.

The man at the ticket counter was rude. Service was horrible.

The flight attendant was rude. Service was bad.

My flight was delayed with no explanation.

My drink spilled when the guy in front of me reclined his seat.

My flight was canceled.

The arm rest of my seat was nasty.

Text Mining (4 of 12)

Voice of the Customer at Triad Airline:

To be analyzed, text data needs to be converted to structured data (rows and columns of numerical data) so that the tools of descriptive statistics, data visualization and data mining can be applied.

Think of converting a group of documents into a matrix of rows and columns where the rows correspond to a document and the columns correspond to a particular word.

A presence/absence or binary term-document matrix is a matrix with the rows representing documents and the columns representing words.

Entries in the columns indicate either the presence or the absence of a particular word in a particular document.

Text Mining (5 of 12)

Voice of the Customer at Triad Airline (cont.):

Creating the list of terms to use in the presence/absence matrix can be a complicated matter:

Too many terms results in a matrix with many columns, which may be difficult to manage and could yield meaningless results.

Too few terms may miss important relationships.

Term frequency along with the problem context are often used as a guide.

In Triad’s case, management used word frequency and the context of having a goal of satisfied customers to come up with the following list of terms they feel are relevant for categorizing the respondent’s comments: delayed, flight, horrible, recline, rude, seat, and service.

Text Mining (6 of 12)

Table 5.7: The Presence/Absence Term-Document Matrix for Triad Airlines

Term

Document Delayed Flight Horrible Recline Rude Seat Service

1 0 0 1 0 0 0 1

2 0 0 0 0 0 1 0

3 1 1 0 0 0 0 0

4 0 0 0 1 0 1 0

5 0 0 1 0 1 0 1

6 0 1 0 0 1 0 1

7 1 1 0 0 0 0 0

8 0 0 0 1 0 1 0

9 0 1 0 0 0 0 0

10 0 0 0 0 0 1 0

Text Mining (7 of 12)

Preprocessing Text Data for Analysis:

The text-mining process converts unstructured text into numerical data and applies quantitative techniques.

Which terms become the headers of the columns of the term-document matrix can greatly impact the analysis.

Tokenization is the process of dividing text into separate terms, referred to as tokens:

Symbols and punctuations must be removed from the document, and all letters should be converted to lowercase.

Different forms of the same word, such as “stacking,” “stacked,” and “stack” probably should not be considered as distinct terms.

Stemming is the process of converting a word to its stem or root word.

Text Mining (8 of 12)

Preprocessing Text Data for Analysis (cont.):

The goal of preprocessing is to generate a list of most-relevant terms that is sufficiently small so as to lend itself to analysis:

Frequency can be used to eliminate words from consideration as tokens.

Low-frequency words probably will not be very useful as tokens.

Consolidating words that are synonyms can reduce the set of tokens.

Most text-mining software gives the user the ability to manually specify terms to include or exclude as tokens.

The use of slang, humor, and sarcasm can cause interpretation problems and might require more sophisticated data cleansing and subjective intervention on the part of the analyst to avoid misinterpretation.

Data preprocessing parses the original text data down to the set of tokens deemed relevant for the topic being studied.

© 2021 Cengage Learning. All Rights Reserved. May …