CL Research Implementation of Minnesota Contextual Content Analysis

Content analysis (and specifically Minnesota Contextual Content Analysis) consists of characterizing a text based on the relative frequency with which words in each category are used, compared to norms determined from general usage statistics for the English language. Several statistics are generated from this analysis, for direct use or for further statistical analysis, to the screen or to a file. For more details on MCCA and its use, see Content Analysis Papers at CL Research.

All words in one or more texts are divided into 116 idea categories (plus a "not classified" category). The MCCA dictionary groups word meanings into categories thought to express (singly or in combinations of categories) ideas important to an investigator. Two kinds of normed scores (emphasis or E-scores and context or C-scores) are generated for each analyzed text. If a word has two or more categories associated with it, MCCA applies a sense disambiguation routine to choose just one of these categories.

MCCALite for Windows is a light version of MCCA, without addition of reference groups, more sophisticated statistical analyses, or ability to modify the MCCA dictionary. This version now includes the ability to save statistics to Word, Microsoft Excel, and CSV files. This version is suitable for analyses and comparisons of sets of texts (from sentences to books) and multi-person transcripts, including plays, focus groups, interviews, hearings, and TV scripts. The download is an installation executable. After running it to install MCCALite on the Windows Start Menu, you can perform an immediate analysis of Hamlet or the text on which McTavish & Pirro was based. Click on Content Analysis for more details. MCCALite is free for non-commercial use.

Basic Statistics

Word Accounting: various word statistics over all texts in the file and for each individual text group. These include (for all text groups in the file and for each individual text group): (1) the total number of words; (2) the percentage of unique words in the texts; (3) the total number of words for which a category was available; (4) the percentage of tokens in the text that were categorized; (5) the percentage of unique words that were categorized; and (6) statistics on the length of the words in the texts, specifically, the average length, the standard deviation of the lengths, and the shortest and longest lengths. (See screen shot).
Lookup: a list of all unique tokens in each of the text groups in the input text, as well as a concordance of their uses. (See list screen shot). (See concordance screen shot).
Words in Category: tokens in a specified text group that have been used at least a specified number of times, that is, a list of the name of each category, the tokens in that category meeting the cutoff restriction, and for each token, its use percentage (relative to the total number of tokens in the text group) and its frequency.(See screen shot).
Word List: tokens in a specified text group that have been used at least a specified number of times, in decreasing frequency order, that is, a list consisting of a token's rank in the list, the token itself, its use percentage (relative to the total number of tokens in the text group), its frequency, and its category number and name.(See screen shot).

E-Score

High Categories: emphasis scores (E-scores) for those categories for which either an E-score for one of the text groups has an absolute value greater than 5.0 or the difference between the E-scores for two of the text groups is greater than 5.0. The 116 MCCA categories are grouped into 23 super-categories. The results include the percent of words in all the texts in the supercategories and the important categories, and summary statistics on the categories, including the mean and standard deviation of the Escores for categories meeting the cutoff criterion. (See screen shot).
Selected Plots: plots of emphasis scores (E-scores) for those categories for which either an E-score for one of the text groups has an absolute value greater than 5.0 or the difference between the E-scores for any two of the text groups is greater than 5.0. These plots consist of arrows from a zero-point to an approximate plus or minus point corresponding to the E-score for the category, allowing the E-scores from different texts to be compared. (See screen shot).
Difference Analysis: the difference in E-scores between any one of the text groups and all the others. This includes some summary word accounting statistics for each text group, showing the total number of tokens and unique tokens, along with their percentages of all words in the text, and then these broken down into tokens that were classified, and tokens that were not classified (or leftover), with percentages of the total number of tokens and unique tokens; the E-score mean and standard deviation over these categories, and the low E-score, the high E-score, and the E-score range; and finally, the E-score differences between the specified text and the others for the selected categories (those with scores or differences of 5.0 or more) are printed. (See screen shot).
Diagnostic Plots: 43 emphasis score (Escore) combinations that may be usable for analyzing a text group, plotted for easy comparisons. (See screen shot).
Distance Matrix: a "probability" distance between each pair of text groups in the input file over all 116 E-score categories plus the leftover category. Texts that are "more" similar to one another have lower "distances" between them. (See screen shot).

C-Score

Scores (Weighted): the weighted context scores (C-Scores) for each text, scaled so that their absolute values sum to 50, to facilitate comparing C-Score profiles (Traditional, Practical, Emotional, and Analytic) across texts. (See screen shot).
Plots (Weighted): plots of the scaled context scores (C-scores) for the text groups for each context type (Traditional, Practical, Emotional, and Analytic), permit comparison of the context strength across text groups. (See screen shot).
Scores (Raw): the unnormalized context scores for each text, summing to zero, but not scaled, for each context type (Traditional, Practical, Emotional, and Analytic). (See screen shot).
Plots (Raw): plots of the raw context scores (C-scores) for the text groups for each context type (Traditional, Practical, Emotional, and Analytic). (See screen shot).
Distance Matrix: a euclidean distance between the context scores of each pair of text groups in the input file. Texts that are "more" similar to one another have lower "distances" between them. (See screen shot).

Other

Co-Occurrence: an analysis of the cooccurrence of categories in a text group. This analysis shows, for each category, which are the four most frequent categories that follow. For example, the phrase "the house" may have the category sequence 116 (for "the") and 49 (for "house"). This analysis shows the four most frequent categories that follow category 116 words.
SPSS Output: a set of summary statistics characterizing each of the text groups in a format suitable for analysis using any statistical package. In particular, the output follows the formatting specifications of an accompanying file, SPSS.CTL, which is an SPSS data definition file that includes variable names and specifications, variable labels, and variable list labels. The data consists of the following data elements: total number of tokens; total number of categorized tokens; number of unique tokens; number of unique categorized tokens; 4 context scores (C-Scores); emphasis score (E-Score) mean, standard deviation, low, high, and range; and 116 emphasis scores (E-Scores).
KYST Output: files that can be used directly as input to the included KYST program, a multidimensional scaling routine for analyzing the C-score and E-score distance matrices, useful for examining the differences between a set of texts.