- **Inferential statistics:** The use of samples to make judgements about a population.
- **Data set:** A collection of data with elements and observations, typically in the form of a table. It is similar to a map or dictionary in programming.
- **Element:** The name of an observation(s), similar to a key to a map/dictionary in programming.
- **Observation:** The collected data linked to an element, similar to a value to a map/dictionary in programming.
- This number is arbitrarily chosen, but a commonly used formula is $\lceil\sqrt{\text{# of elements}}\rceil$.
- The width (size) of each class is $\lceil\frac{\text{max value} - \text{min value}}{\text{number of classes}}\rceil$.
- Each class includes its lower bound and excludes its upper bound ($\text{lower} ≤ x < \text{upper}$)
- The **relative frequency** of a data set is the percentage of the whole data set present in that class in decimal form.
- The number of values that fall under each class.
- The largest value can either be included in the final class (changing its range to $\text{lower} ≤ x ≤ \text{highest}$), or put in a completely new class above the largest class.
??? example
| Height $x$ (cm) | Frequency |
| --- | --- |
| $1≤x<5$|2|
| $5≤x<9$|3|
| $9≤x≤14$ | 1 |
For a given class $i$, the midpoint of that class is as follows:
A **stem and leaf plot** can list out all the data points while grouping them simultaneously.
A **frequency histogram** can be used to represent frequency distribution, with the x-axis containing class boundaries, and the y-axis representing frequency.
If data is discrete, a gap must be left between the bars. If data is continuous, there must *not* be a gap between the bars.
A **cumulative frequency table** can be used to find the number of data values below a certain class boundary. It involves the addition of a **cumulative frequency** column which represents the sum of the frequency of the current class as well as every class before it. It is similar to a prefix sum array in computer science.
??? example
| Height $h$ (cm) | Frequency | Cumulative frequency |
| --- | --- | --- |
| $1≤h<10$|2|2|
| $10≤h<19$|5|7|
### Outliers
Outliers are data values that significantly differs from the rest of the data set. They may be because of: