math: add frequency data structures
This commit is contained in:
parent
d060ad4322
commit
de940be05e
@ -6,21 +6,21 @@ The course code for this page is **MHF4U7**.
|
|||||||
|
|
||||||
!!! note "Definition"
|
!!! note "Definition"
|
||||||
- **Statistics:** The techniques and procedures to analyse, interpret, display, and make decisions based on data.
|
- **Statistics:** The techniques and procedures to analyse, interpret, display, and make decisions based on data.
|
||||||
- **Descriptive statistics:** The use of methods to organise, display, and describe data by using various charts and summary methods to reduce data to a manageable size.
|
- **Descriptive statistics:** The use of methods to work with and describe the **entire** data set.
|
||||||
- **Inferential statistics:** The use of samples to make judgements about a population.
|
- **Inferential statistics:** The use of samples to make judgements about a population.
|
||||||
- **Data set:** A collection of data with elements and observations, typically in the form of a table. It is similar to a map or dictionary in programming.
|
- **Data set:** A collection of data with elements and observations, typically in the form of a table. It is similar to a map or dictionary in programming.
|
||||||
- **Element:** The name of an observation(s), similar to a key to a map/dictionary in programming.
|
- **Element:** The name of an observation(s), similar to a key to a map/dictionary in programming.
|
||||||
- **Observation:** The collected data linked to an element, similar to a value to a map/dictionary in programming.
|
- **Observation:** The collected data linked to an element, similar to a value to a map/dictionary in programming.
|
||||||
- **Population**: A collection of all elements of interest within a data set.
|
- **Population**: A collection of all elements of interest within a data set.
|
||||||
- **Sample**: The selection of a few elements within a population to represent that population.
|
- **Sample**: The selection of a few elements within a population to represent that population.
|
||||||
- **Raw data:** Data collected prior to processing or ranking.\
|
- **Raw data:** Data collected prior to processing or ranking.
|
||||||
|
|
||||||
### Sampling
|
### Sampling
|
||||||
|
|
||||||
A good sample:
|
A good sample:
|
||||||
|
|
||||||
- represents the relevant features of the full population,
|
- represents the relevant features of the full population,
|
||||||
- is large enough so that it decently represents the full population,
|
- is as large as reasonably possible so that it decently represents the full population,
|
||||||
- and is random.
|
- and is random.
|
||||||
|
|
||||||
The types of random sampling include:
|
The types of random sampling include:
|
||||||
@ -47,13 +47,58 @@ The types of random sampling include:
|
|||||||
- **Qualitative variable**: A variable that is not numerical and cannot be sorted.
|
- **Qualitative variable**: A variable that is not numerical and cannot be sorted.
|
||||||
- **Bias**: An unfair influence in data during the collection process, causing the data to be not truly representative of the population.
|
- **Bias**: An unfair influence in data during the collection process, causing the data to be not truly representative of the population.
|
||||||
|
|
||||||
|
|
||||||
### Frequency distribution
|
### Frequency distribution
|
||||||
|
|
||||||
A **frequency distribution** is a data set that lists ranges and the number of values in each range. It can be displayed using a frequency distribution table.
|
A **frequency distribution** is a table that lists categories/ranges and the number of values in each category/range.
|
||||||
|
|
||||||
!!! note "Definition"
|
A frequency distribution table includes:
|
||||||
|
|
||||||
|
- A number of classes, all of the same width.
|
||||||
|
- This number is arbitrarily chosen, but a commonly used formula is $\lceil\sqrt{\text{# of elements}}\rceil$.
|
||||||
|
- The width (size) of each class is $\lceil\frac{\text{max value} - \text{min value}}{\text{number of classes}}\rceil$.
|
||||||
|
- Each class includes its lower bound and excludes its upper bound ($\text{lower} ≤ x < \text{upper}$)
|
||||||
|
- The **relative frequency** of a data set is the percentage of the whole data set present in that class in decimal form.
|
||||||
|
- The number of values that fall under each class.
|
||||||
|
- The largest value can either be included in the final class (changing its range to $\text{lower} ≤ x ≤ \text{highest}$), or put in a completely new class above the largest class.
|
||||||
|
|
||||||
|
??? example
|
||||||
|
| Height $x$ (cm) | Frequency |
|
||||||
|
| --- | --- |
|
||||||
|
| $1≤x<5$ | 2 |
|
||||||
|
| $5≤x<9$ | 3 |
|
||||||
|
| $9≤x≤14$ | 1 |
|
||||||
|
|
||||||
|
For a given class $i$, the midpoint of that class is as follows:
|
||||||
|
$$x_{i} = \frac{\text{lower bound} + \text{upper bound}}{2}$$
|
||||||
|
|
||||||
|
### Representing frequency
|
||||||
|
|
||||||
|
A **stem and leaf plot** can list out all the data points while grouping them simultaneously.
|
||||||
|
|
||||||
|
A **frequency histogram** can be used to represent frequency distribution, with the x-axis containing class boundaries, and the y-axis representing frequency.
|
||||||
|
|
||||||
|
<img src="/resources/images/frequency-discrete.png" width=700>(Source: Kognity)</img>
|
||||||
|
|
||||||
|
!!! note
|
||||||
|
If data is discrete, a gap must be left between the bars. If data is continuous, there must *not* be a gap between the bars.
|
||||||
|
|
||||||
|
A **cumulative frequency table** can be used to find the number of data values below a certain class boundary. It involves the addition of a **cumulative frequency** column which represents the sum of the frequency of the current class as well as every class before it. It is similar to a prefix sum array in computer science.
|
||||||
|
|
||||||
|
??? example
|
||||||
|
| Height $h$ (cm) | Frequency | Cumulative frequency |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| $1≤h<10$ | 2 | 2 |
|
||||||
|
| $10≤h<19$ | 5 | 7 |
|
||||||
|
|
||||||
|
|
||||||
|
### Outliers
|
||||||
|
|
||||||
|
Outliers are data values that significantly differs from the rest of the data set. They may be because of:
|
||||||
|
|
||||||
|
- a random natural occurrence, or
|
||||||
|
- abnormal circumstances
|
||||||
|
|
||||||
|
Outliers can be ignored once identified.
|
||||||
|
|
||||||
## Resources
|
## Resources
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user