Probability and Statistics Engineering Mathematics Formulas List

Solvin's formula for finding the sample size:

`"where " n: "Sample Size, " N: "Population, " e: "Margin Error"`
`n = N/(1+Ne^2)`

Solvin's formula for finding the Population size:

`N = (-n)/(n e^2-1)`
`"where " n: "Sample Size, " N: "Population, " e: "Margin Error"`

Solvin's formula for finding the Margin of Error:

`"where " n: "Sample Size, " N: "Population"`
`e = sqrt((N/n - 1)/N)`

Terms used most frequently in statistics and probability.

What is Variables: Each property we collect in our data about the individuals .
What is Individuals: The set of elements (whether people or otherwise) that are surveyed to form a set of data about those individuals.
What is Data: The collection of individuals and variables.
What is Data table: A table that organizes the data, including the individuals and their variables.

Types of Variables in Statistics:

Categorical variables: Non-numerical variables, also called called “qualitative” variables. Their values aren’t represented with numbers.
Quantitative variables: Numerical variables. Their values are numbers.
Discrete variables: Variables we can obtain by counting. Therefore, they can take on only certain numerical values.
Continuous variables: Variables that can include data such as decimals, fractions, or irrational numbers.

Types of measurement in Statistics:

Nominal scale of measurement: Things like favorite food, colors, names, and “yes” or “no” responses have a nominal scale of measurement. Only categorical data can be measured with a nominal scale.
Ordinal scale of measurement: Categorial data can also be ordinal. This type of data can be ordered.
Interval scale of measurement: Data measured using an interval scale can be ordered like ordinal data. But interval data also gives us a known interval between measurements.
Ratio scale of measurement: Data measured using a ratio scale is just like interval scale data, except that ratio scale data has a starting point, or absolute zero.

Points to consider when calculating a sample size:

Make sure the sample size is at least thirty. It has been demonstrated statistically that a sample size of 30 is the minimum at which the average result of the sample begins to approximate the average result of the population.
The most popular confidence level is 95%, while 90% is occasionally appropriate.
Use a greater sample size for a better degree of confidence.
Increase the sample size to reduce the margin of error.
Use a higher sample size to achieve greater statistical significance.

Why is a sample size of at least 30 required?

The Central Limit Theorem (CLT) in the study of probability and statistics serves as the foundation for this recommendation. The results more nearly match the normal (bell-shaped) distribution from a large number of samples as sample size grows. The lowest sample size for which the CLT is still valid is thirty. Regression analysis, a statistical technique used by researchers to ascertain the connections between controllable and dependent variables, also favors a minimum sample size of thirty.

Use Case of Sample Size:

Depending on the business situation, sample sizes differ. The kind of business problem you are attempting to address will determine the sample size that you need.
For instance, 180,000 respondents to a survey conducted in a 200,000-person city would be considered a big sample size. But in the absence of that, how would one go about determining what a suitable lower sample size would be?
If the respondents were chosen from each district in the city, would 250 be acceptable?

Answer: Depending on research question.

If your business's goal is to learn what the citizens thought of the new library, 250 people might be a sufficient sample size.
If your business problem is to find out how citizens would vote to fund the library, a sample size of 250 might not be sufficient.
Note: If you were to survey people about their opinions about the new library as opposed to how they would vote to fund it, you could probably tolerate a bigger margin of error. You likely used a higher sample size for the voter survey as a result.

Large Sample Higher Cost-

The cost of larger sample sizes is higher. You must balance the expense of a bigger sample size against the advantages of more precise data. A larger sample size would be unnecessary for someone attempting to comprehend consumer preferences for a new product line as opposed to someone attempting to comprehend the effects of a novel medication. Using a bigger sample size has benefits over costs for drug safety. However, for consumer preferences, results from a smaller sample size at a cheaper cost might be adequate.

Statistical Terms and Definitions
Terminology	Definitions
Population	The entire group that you are interested in for your study. For example, if you are surveying people in your company, the population would be all the employees in your company.
Sample	A subset of your population. Just like a food sample, it is called a sample because it is only a taste. So if your company is too large to survey every individual, you can survey a representative sample of your population.
Margin of error	Since a sample is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population. This difference is called the margin of error. The smaller the margin of error, the closer the results of the sample are to what the result would have been if you had surveyed the entire population.
Confidence level	How confident you are in the survey results. For example, a 95% confidence level means that if you were to run the same survey 100 times, you would get similar results 95 of those 100 times. Confidence level is targeted before you start your study because it will affect how big your margin of error is at the end of your study.
Confidence interval	The range of possible values that the population’s result would be at the confidence level of the study. This range is the sample result +/- the margin of error.
Statistical significance	The determination of whether your result could be due to random chance or not. The greater the significance, the less due to chance.

Z-Score

The z-score indicates how far a point deviates from the mean in standard deviations. Where x is the data point, `mu` is the mean, and `sigma` is the standard deviation, we apply the formula to get the z-score for normally distributed data (normal distributions). A data point's z-score indicates how much it deviates from the mean, therefore we should always present the z-score as a standard deviation.

`z = (x-mu)/sigma`

Z-Table

A negative z-score indicates that the data point is less than the mean, and it will be located to the left of the mean. You can search for them in the negative z-score table.

Negative Z-score Table

Negative Z-score Table (Cumulative Probabilities)
Z	.00	.01	.02	.03	.04	.05	.06	.07	.08	.09
-3.4	.0003	.0003	.0003	.0003	.0003	.0003	.0003	.0003	.0003	.0002
-3,3	.0005	.0005	.0005	.0004	.0004	.0004	.0004	.0004	.0004	.0003
-3.2	.0007	.0007	.0006	.0006	.0006	.0006	.0006	.0005	.0005	.0005
-3.1	.0010	.0009	.0009	.0009	.0008	.0008	.0008	.0008	.0007	.0007
-3.0	.0013	.0013	.0013	.0012	.0012	.0011	.0011	.0011	.0010	.0010
-2.9	.0019	.0018	.0018	.0017	.0016	.0016	.0015	.0015	.0014	.0014
-2.8	.0026	.0025	.0024	.0023	.0023	.0022	.0021	.0021	.0020	.0019
-2.7	.0035	.0034	.0033	.0032	.0031	.0030	.0029	.0028	.0027	.0026
-2.6	.0047	.0045	.0044	.0043	.0041	.0040	.0039	.0038	.0037	.0036
-2.5	.0062	.0060	.0059	.0057	.0055	.0054	.0052	.0051	.0049	.0048
-2.4	.0082	.0080	.0078	.0075	.0073	.0071	.0069	.0068	.0066	.0064
-2.3	.0107	.0104	.0102	.0099	.0096	.0094	.0091	.0089	.0087	.0084
-2.2	.0139	.0136	.0132	.0129	.0125	.0122	.0119	.0116	.0113	.0110
-2.1	.0179	.0174	.0170	.0166	.0162	.0158	.0154	.0150	.0146	.0143
-2.0	.0228	.0222	.0217	.0212	.0207	.0202	.0197	.0192	.0188	.0183
-1.9	.0287	.0281	.0274	.0268	.0262	.0256	.0250	.0244	.0239	.0233
-1.8	.0359	.0351	.0344	.0336	.0329	.0322	.0314	.0307	.0301	.0294
-1.7	.0446	.0436	.0427	.0418	.0409	.0401	.0392	.0384	.0375	.0367
-1.6	.0548	.0537	.0526	.0516	.0505	.0495	.0485	.0475	.0465	.0455
-1.5	.0668	.0655	.0643	.0630	.0618	.0606	.0594	.0582	.0571	.0559
-1.4	.0808	.0793	.0778	.0764	.0749	.0735	.0721	.0708	.0694	.0681
-1.3	.0968	.0951	.0934	.0918	.0901	.0885	.0869	.0853	.0838	.0823
-1.2	.1151	.1131	.1112	.1093	.1075	.1056	.1038	.1020	.1003	.0985
-1.1	.1357	.1335	.1314	.1292	.1271	.1251	.1230	.1210	.1190	.1170
-1.0	.1587	.1562	.1539	.1515	.1492	.1469	.1446	.1423	.1401	.1379
-0.9	.1841	.1814	.1788	.1762	.1736	.1711	.1685	.1660	.1635	.1611
-0.8	.2119	.2090	.2061	.2033	.2005	.1977	.1949	.1922	.1894	.1867
-0.7	.2420	.2389	.2358	.2327	.2296	.2266	.2236	.2206	.2177	.2148
-0.6	.2743	.2709	.2676	.2643	.2611	.2578	.2546	.2514	.2483	.2451
-0.5	.3085	.3050	.3015	.2981	.2946	.2912	.2877	.2843	.2810	.2776
-0.4	.3446	.3409	.3372	.3336	.3300	.3264	.3228	.3192	.3156	.3121
-0.3	.3821	.3783	.3745	.3707	.3669	.3632	.3594	.3557	.3520	.3483
-0.2	.4207	.4168	.4129	.4090	.4052	.4013	.3974	.3936	.3897	.3859
-0.1	.4602	.4562	.4522	.4483	.4443	.4404	.4364	.4325	.4286	.4247
0.0	.5000	.4960	.4920	.4880	.4840	.4801	.4761	.4721	.4681	.4641

Greater-than-mean data points have a positive z-score and are positioned to the right of the mean. You should check the positive z-score table for them.

Positive Z-score Table

Positive Z-score Table (Cumulative Probabilities)
Z	.00	.01	.02	.03	.04	.05	.06	.07	.08	.09
0.0	.5000	.5040	.5080	.5120	.5160	.5199	.5239	.5279	.5319	.5359
0.1	.5398	.5438	.5478	.5517	.5557	.5596	.5636	.5675	.5714	.5753
0.2	.5793	.5832	.5871	.5910	.5948	.5987	.6026	.6064	.6103	.6141
0.3	.6179	.6217	.6255	.6293	.6331	.6368	.6406	.6443	.6480	.6517
0.4	.6554	.6591	.6628	.6664	.6700	.6736	.6772	.6808	.6844	.6879
0.5	.6915	.6950	.6985	.7019	.7054	.7088	.7123	.7157	.7190	.7224
0.6	.7257	.7291	.7324	.7357	.7389	.7422	.7454	.7486	.7517	.7549
0.7	.7580	.7611	.7642	.7673	.7704	.7734	.7764	.7794	.7823	.7852
0.8	.7881	.7910	.7939	.7967	.7995	.8023	.8051	.8078	.8106	.8133
0.9	.8159	.8186	.8212	.8238	.8264	.8289	.8315	.8340	.8365	.8389
1.0	.8413	.8438	.8461	.8485	.8508	.8531	.8554	.8577	.8599	.8621
1.1	.8643	.8665	.8686	.8708	.8729	.8749	.8770	.8790	.8810	.8830
1.2	.8849	.8869	.8888	.8907	.8925	.8944	.8962	.8980	.8997	.9015
1.3	.9032	.9049	.9066	.9082	.9099	.9115	.9131	.9147	.9162	.9177
1.4	.9192	.9207	.9222	.9236	.9251	.9265	.9279	.9292	.9306	.9319
1.5	.9332	.9345	.9357	.9370	.9382	.9394	.9406	.9418	.9429	.9441
1.6	.9452	.9463	.9474	.9484	.9495	.9505	.9515	.9525	.9535	.9545
1.7	.9554	.9564	.9573	.9582	.9591	.9599	.9608	.9616	.9625	.9633
1.8	.9641	.9649	.9656	.9664	.9671	.9678	.9686	.9693	.9699	.9706
1.9	.9713	.9719	.9726	.9732	.9738	.9744	.9750	.9756	.9761	.9767
2.0	.9772	.9778	.9783	.9788	.9793	.9798	.9803	.9808	.9812	.9817
2.1	.9821	.9826	.9830	.9834	.9838	.9842	.9846	.9850	.9854	.9857
2.2	.9861	.9864	.9868	.9871	.9875	.9878	.9881	.9884	.9887	.9890
2.3	.9893	.9896	.9898	.9901	.9904	.9906	.9909	.9911	.9913	.9916
2.4	.9918	.9920	.9922	.9925	.9927	.9929	.9931	.9932	.9934	.9936
2.5	.9938	.9940	.9941	.9943	.9945	.9946	.9948	.9949	.9951	.9952
2.6	.9953	.9955	.9956	.9957	.9959	.9960	.9961	.9962	.9963	.9964
2.7	.9965	.9966	.9967	.9968	.9969	.9970	.9971	.9972	.9973	.9974
2.8	.9974	.9975	.9976	.9977	.9977	.9978	.9979	.9979	.9980	.9981
2.9	.9981	.9982	.9982	.9983	.9984	.9984	.9985	.9985	.9986	.9986
3.0	.9987	.9987	.9987	.9988	.9988	.9989	.9989	.9989	.9990	.9990
3.1	.9990	.9991	.9991	.9991	.9992	.9992	.9992	.9992	.9993	.9993
3.2	.9993	.9993	.9994	.9994	.9994	.9994	.9994	.9995	.9995	.9995
3.3	.9995	.9995	.9995	.9996	.9996	.9996	.9996	.9996	.9996	.9997
3.4	.9997	.9997	.9997	.9997	.9997	.9997	.9997	.9997	.9997	.9998

Mean

In a group of numbers, the mean is the average or most frequent value. It is a statistic used to measure a probability distribution's central tendency.

Mean General Representation

`mu = (sum_(i=1)^n x_i)/n`

Mean of a Population Formula:

`mu = (sum_(i=1)^N x_i)/N`

Mean of a Sample Formula:

`bar x = (sum_(i=1)^n x_i)/n`

Variance

Variance is the measure of how far the data is spread from the mean.

Variance of a Population Formula:

Notice that `mu` is the population mean, which means that `(x_i - mu)^2` gives the distance of each point from the mean, which is the deviation of each point.

`sigma^2 = (sum_(i=1)^N (x_i - mu)^2)/N`

Variance of a Sample Formula:

Unbiased Sample Variance: To make the sample variance formula a lot more accurate is to divide by `n - 1` instead of `n`. Dividing by `n` will underestimate sample variance, and dividing by `n - 2` will overestimate sample variance. Thus we only use Unbiased Sample Variance formula instead of Biased Sample Variance formula.
- `s^2 = (sum_(i=1)^n (x_i - bar x)^2)/(n-1)`
Biased Sample Variance: Although the biased sample variance formula is shown here, we never apply it.
- `s^2 = (sum_(i=1)^n (x_i - bar x)^2)/n`

Notice that the formula for sample variance, `s^2`, is identical to the formula for population variance, except that we've swapped out `mu` for `bar x` (since `bar x` is sample mean, whereas `mu` is population mean), and we've changed N to n (since n refers to sample size, whereas N refers to population size).

Standard deviation:

Standard deviation is a measure of how much the data in a set varies from the mean. The larger the value of standard deviation, the more the data in the set varies from the mean. The smaller the value of standard deviation, the less the data in the set varies from the mean.

Standard deviation of Population:

Population Standard deviation is the positive square root of population variance. Since population variance is given by `sigma^2`, population standard deviation is given by `sigma`.

`sigma = sqrt((sum_(i=1)^N (x_i - mu)^2)/N)`

Standard deviation of Sample:

Since Sample variance is given by `s^2`, population standard deviation is given by `s`.

`s = sqrt((sum_(i=1)^n (x_i - bar x)^2)/(n-1))`

Distribution

The distribution of the data is displayed by the area under the curve. Therefore, the smooth line enclosing a distribution is generally referred to as a density curve.

Normal Distribution (Symmetric): The mean and median are located just in the middle of the distribution when a density curve is fully symmetric. In a symmetric distribution, the mean and median are always found where both the left and right sides of the distribution have equal amounts of data.

Skewed Distribution: Non-symmetric distributions with a leftward or rightward tilt are known as skewed distributions. We will examine two types of distributions: positively skewed (also known as right-skewed or right-tailed distributions) and negatively skewed (also known as left-skewed or left-tailed distributions).
- Left-skewed distribution (Left-tailed distributions): In this distributions the "tail" is on the left. The median of a left- skewed distribution is still at the point that divides the area into two equal parts. The mean is further to the left than the median, more towards the tail on the left side, and the mode is where the data peaks:
- Right-skewed distribution (Right-tailed distributions): the "tail" is on the right. The median of a right-skewed distribution is still at the point that divides the area into two equal parts. The mean is further to the right than the median, more towards the tail on the right side, and the mode is still where the data peaks:

Formuala for caluculating Low Outliers and High Outliers. It is common for both Left-skewed and Right-skewed distribution.
Low Outlier = Q1 - 1.5(IQR)
High Outlier = Q3 + 1.5(IQR)
IQR = interquartile range

Solved Examples - Practice Exercises.

Example 1: A school principal wants to know the average age of teachers in her school. She took a sample of 10 teachers and recorded their ages. Find the sample mean and sample variance.

`"ages" = {43, 35, 38, 56, 29, 33, 46, 63, 49, 40}`

Solution: Let's find the sample mean first.

`bar x = (sum_(i=1)^n x_i)/n`

`bar x = (43+35+38+56+29+33+46+63+49+40)/10 = 43.2`

Now we'll use a table to calculate sample variance. We'll first calculate how far each data point is from the sample mean to get the deviation of each data point, and then we can calculate squared deviations.

ages	Deviations `(x_i-bar x)`	Squared Deviations `(x_i-bar x)^2`
43	`43-43.2=0.2`	0.04
35	`35-43.2=8.2`	67.24
38	`38-43.2=5.2`	27.04
56	`56-43.2=12.8`	163.84
29	`29-43.2=14.2`	201.64
33	`33-43.2=10.2`	104.04
46	`46-43.2=2.8`	7.84
63	`63-43.2=19.8`	392.04
49	`49-43.2=5.8`	33.64
40	`40-43.2=3.2`	10.24
Total		1007.6

Now let's find the sample variance:

`s^2 = (sum_(i=1)^n (x_i - bar x)^2)/(n-1)`

`s^2 = (1007.6)/(10-1)`
`s^2 ~~ 111.96`

Therefore, the mean age of teachers, based on the sample, is 43.2. The sample variance is 111.96.

Eaxmple 2: A school principal wants to know the average age of teachers in her school. She takes a sample of 10 teachers and finds sample mean `bar x =43.2` and sample variance `s^2 ~~ 111.96`. What is the standard deviation of her sample?

Solution: Let's find the standard deviation of sample. As we know the formula for variance of sample data is:

`"variance formula " <=> s^2 = (sum_(i=1)^n (x_i - bar x)^2)/(n-1)` `"Taking squar-root both side "-> sqrt(s^2) = sqrt((sum_(i=1)^n (x_i - bar x)^2)/(n-1))` `-> s = sqrt((sum_(i=1)^n (x_i - bar x)^2)/(n-1)) <=> "standard deviation" `

Hence, `s = sqrt(s^2) = sqrt(111.96) ~~ 10.58`

Eaxmple 3: Find the population mean and standard deviation for the student data set.

`"data" = {6, 3, 3, 2, 2}`

Solution: Use the population's mean and standard deviation formula to find the results.
`"Correct Answer: " mu=3.2 " and " sigma ~~ 1.4697`

Eaxmple 4: Find the sample's mean and standard deviation for the student data set.

`"data" = {2, 4, 7, 9, 10}`

Solution: Use the sample's mean and standard deviation formula to find the results.
`"Correct Answer: " bar x = 6.4 " and " s ~~ 3.3615`

Eaxmple 5: With a standard deviation of 3", the mean height of a electricity pole is 55". What is the approximate percentile of a electricity pole that is 60" tall?

Solution:

`z = (x-mu)/sigma`

` = (60-55)/3 `
` = 1.66`

Hence, based on Z-table lookup, approximate percentile is = 95%

Eaxmple 6: A sample value from the normally distributed sample data is 22, and its z-score is -0.50. What is the standard deviation of the distribution, if its mean is 24?