# Glossary of statistical definitions - making sense of popular statistical terms

**Tags: **

*At Iridium Insights we are all about making sense out of data, and as we recognise that a lot of the technical stats terms can be confusing, we’ve created this quick look-up glossary of statistical definitions. *

*We’ve also added some links to articles that show how these statistical methods are used in the real world.*

#### Quick look-up list

Bayesian modelling

CHAID analysis

Cluster analysis

Correlation analysis

Decision tree

Decision tree analysis

Machine learning

Marketing mix modelling

Multivariate regression

Prediction interval/confidence interval

(Multivariable) Regression analysis

R

Random Forest

#### Full glossary

#### Bayesian modelling

There are two different statistical approaches to gaining insights from data: frequentist (or classical) and Bayesian. The frequentist approach builds a model based only on the data observed, while the Bayesian approach allows some subjective beliefs about the model to be incorporated with the observations.

*For a more in-depth description, see the second part of our blog explanation How effective is your marketing strategy? – The maths behind the measurement.*

#### CHAID analysis (Chi squared automatic interaction detector)

CHAID is a type of a decision tree algorithm that determines relationships between the variable of interest (for example, the number of purchases of a particular product) and the independent variables (for example, customer characteristics – age, gender and socioeconomic status). CHAID automatically creates the decision tree based on the trends and patterns within the data. It can then help understand a customer’s response to a marketing campaign and is often used for customer segmentation.

Back to top

#### Cluster analysis

Cluster analysis is an exploratory data analysis method that helps identify meaningful structures within data. It defines areas/groups/segments of data that share similarities across several measures. In the marketing industry, the cluster analysis is often used to identify customer segments.

CHAID is also often used for customer segmentation, but is a very different algorithm to Cluster analysis. Cluster analysis treats all the variables in the data uniformly, while CHAID analysis recognises the variable of interest and independent variables as separate variables.

*To see how we used the model-based clustering method to gain strategic insights from beer consumption data, see our blog on How we used cluster analysis to drive beer company strategy.*

Back to top

#### Correlation analysis

Correlation analysis studies relationships between a variable of interest and an explanatory variable. For example, a variable of interest could be premium juice consumption while an explanatory variable could be GDP per capita. If the relationship proves to be statistically significant, the explanatory variable is said to be related or associated to the variable of interest. Parameters such as r-squared and p-value are used to assess the strength of the relationship.

*To find out more, read our blog: Correlation doesn’t imply causation but it still matters*

Back to top

#### Decision tree

A Decision Tree is a representation of the dataset in a form of a tree – where each node of the tree represents an attribute (i.e. a column label), and branches coming from a node represent all possible values that the attribute can take.

Back to top

#### Decision tree analysis

Decision analysis is a general name given to techniques that analyse every possible outcome of a decision. A decision tree is a graph that visualises the outcomes and can be easily interpreted. They can help understand and evaluate risks and uncertainties. They also can help answer questions such as: What are the factors that affect the sales of a product the most? Can we predict a consumer group response to a marketing campaign?

Back to top

#### Machine learning

Machine learning is a method of data analysis that iteratively “learns” from data as it arrives without human intervention. Machine learning can analyse large amounts of data quickly to enable businesses to make decisions about their marketing campaigns in real time and to deliver insights on to complex consumer behaviours.

*For more details, see our blog How Machine Learning is advancing customer marketing strategies.*

Back to top

#### Marketing mix modelling (MM modelling)

Marketing mix modelling is a method of data analysis used to quantify the impact of marketing activities on product sales. In simplest terms, MM modelling gives weights to different factors that affect product sales. The weights can be determined using for example multivariable regression modelling.

*For more details, see the first part of our blog How effective is your marketing strategy? – The maths behind the measurement .*

Back to top

#### Multivariate regression

Multivariate regression analysis studies the relationship between several variables of interest against several explanatory variables. For example, the variables of interest could be consumption of beer, cider and wine, while the explanatory variables could be the GDP per capita, commodity prices, new product launches, population demographics and so on. Multivariate regression analysis helps to understand how differently the changes in explanatory variables affect the variables of interest.

Back to top

#### Prediction interval/confidence interval

A confidence interval is a range of values that is likely to contain an unknown value of a variable. Prediction interval is a type of confidence interval that can be used for values that are yet to be observed.

For example, let the local train delay in minutes represent a variable of interest. If we know from experience that the train is never on time, arriving either late or too early by 15 minutes 95% of the time – then we would say that we are 95% confident that the train arrives at the station during the period between 15 minutes before departure time and 15 minutes after the departure time.

Back to top

#### (Multivariable) Regression analysis

Regression analysis is a more general form of Correlation analysis, where the relationships between one variable of interest and several explanatory variables are measured. For example, the variable of interest could be a premium beer consumption while the explanatory variables could be GDP per capita, commodity prices, new product launches and so on. Regression analysis helps to understand how changes in explanatory variables affect the variable of interest. It is widely used for predictions and forecasts.

Back to top

#### R

R is a programming language for statistical computing and data visualisation. It is widely used by data scientists and is available for free under a public licence, although Microsoft has a proprietary version. R can be extended by adding packages that contain particular statistical methods or other additional functionalities.

Back to top

#### Random Forest

Random Forest is a machine learning algorithm. It consists of a collection of Decision Trees that are each constructed on randomly chosen sections of a training dataset. Then, Random Forest algorithm applied to new data can predict the outcome of the unknown variable by looking at what the majority of the Decision Trees are predicting.

Back to top

*If you are interested in what Iridium Insights could do with your data, please get in contact: info@iridium-insights.com*