SL Math - Analysis and Approaches - A

The course code for this page is MHF4U7.

4 - Statistics and probability

!!! note “Definition” - Statistics: The techniques and procedures to analyse, interpret, display, and make decisions based on data. - Descriptive statistics: The use of methods to work with and describe the entire data set. - Inferential statistics: The use of samples to make judgements about a population. - Data set: A collection of data with elements and observations, typically in the form of a table. It is similar to a map or dictionary in programming. - Element: The name of an observation(s), similar to a key to a map/dictionary in programming. - Observation: The collected data linked to an element, similar to a value to a map/dictionary in programming. - Population: A collection of all elements of interest within a data set. - Sample: The selection of a few elements within a population to represent that population. - Raw data: Data collected prior to processing or ranking.

Sampling

A good sample:

represents the relevant features of the full population,
is as large as reasonably possible so that it decently represents the full population,
and is random.

The types of random sampling include:

Simple: Choosing a sample completely randomly.
Convenience: Choosing a sample based on ease of access to the data.
Systematic: Choosing a random starting point, then choosing the rest of the sample at a consistent interval in a list.
Quota: Choosing a sample whose members have specific characteristics.
Stratified: Choosing a sample so that the proportion of specific characteristics matches that of the population.

??? example - Simple: Using a random number generator to pick items from a list. - Convenience: Asking the first 20 people met to answer a survey, - Systematic: Rolling a die and getting a 6, so choosing the 6th element and every 10th element after that. - Quota: Ensuring that all members of the sample all wear red jackets. - Stratified: The population is 45% male and 55% female, so the proportion of the sample is also 45% male and 55% female.

Types of data

!!! note “Definition” - Quantitative variable: A variable that is numerical and can be sorted. - Discrete variable: A quantitative variable that is countable. - Continuous variable: A quantitative variable that can contain an infinite number of values between any two values. - Qualitative variable: A variable that is not numerical and cannot be sorted. - Bias: An unfair influence in data during the collection process, causing the data to be not truly representative of the population.

Frequency distribution

A frequency distribution is a table that lists categories/ranges and the number of values in each category/range.

A frequency distribution table includes:

A number of classes, all of the same width.
- This number is arbitrarily chosen, but a commonly used formula is \(\lceil\sqrt{\text{# of elements}}\rceil\).
- The width (size) of each class is \(\lceil\frac{\text{max value} - \text{min value}}{\text{number of classes}}\rceil\).
- Each class includes its lower bound and excludes its upper bound (\(\text{lower} ≤ x < \text{upper}\))
- The relative frequency of a data set is the percentage of the whole data set present in that class in decimal form.
The number of values that fall under each class.
- The largest value can either be included in the final class (changing its range to \(\text{lower} ≤ x ≤ \text{highest}\)), or put in a completely new class above the largest class.

??? example | Height \(x\) (cm) | Frequency | | — | — | | \(1≤x<5\) | 2 | | \(5≤x<9\) | 3 | | \(9≤x≤14\) | 1 |

For a given class \(i\), the midpoint of that class is as follows: \[x_{i} = \frac{\text{lower bound} + \text{upper bound}}{2}\]

Representing frequency

A stem and leaf plot can list out all the data points while grouping them simultaneously.

A frequency histogram can be used to represent frequency distribution, with the x-axis containing class boundaries, and the y-axis representing frequency.

(Source: Kognity)

!!! note If data is discrete, a gap must be left between the bars. If data is continuous, there must not be a gap between the bars.

A cumulative frequency table can be used to find the number of data values below a certain class boundary. It involves the addition of a cumulative frequency column which represents the sum of the frequency of the current class as well as every class before it. It is similar to a prefix sum array in computer science.

??? example | Height \(h\) (cm) | Frequency | Cumulative frequency | | — | — | — | | \(1≤h<10\) | 2 | 2 | | \(10≤h<19\) | 5 | 7 |

Outliers

Outliers are data values that significantly differs from the rest of the data set. They may be because of:

a random natural occurrence, or
abnormal circumstances

Outliers can be ignored once identified.

5.4 KiB Raw Blame History