Introduction to dplyr

Review of Statatics

Descriptive Measures:


The central tendency is the extent to which all the data values group around a typical or central value The variation is the amount of dispersion or scattering of values The shape is the pattern of the distribution of values from the lowest value to the highest value

Central Tendency:

The mean is generally used unless extreme values (outliers) exist. The median is often used since the median is not sensitive to extreme values. For example, median home prices may be reported for a region; it is less sensitive to outliers. In some situations, it makes sense to report both the mean and the median.

Variation Measures:

  • Coefficient Deviation Measures relative variation Always in percentage (%) Shows variation relative to mean Can be used to compare the variability of two or more sets of data measured in different units

  • Locating Extreme Outliers: Z-Score:

To compute the Z-score of a data value, subtract the mean and divide by the standard deviation. The Z-score is the number of standard deviations a data value is from the mean. A data value is considered an extreme outlier if its Z-score is less than -3.0 or greater than +3.0. The larger the absolute value of the Z-score, the farther the data value is from the mean.

Chebyshev’s Rule This rule applies to all types of probability distributions Regardless of how the data are distributed, at least (1 - 1/k2) x 100% of the values will fall within k standard deviations of the mean (for k > 1)

The Binomial Distribution - 二项式分布 (离散分布) A fixed number of observations, n. e.g., 15 tosses of a coin; ten light bulbs took from a warehouse. Each observation is categorized as to whether or not the “event of interest” occurred. Observations are independent: The outcome of one observation does not affect the outcome of the other. Two sampling methods deliver independence.

For most common distributions R comes with four types of functions associated with the distribution: Density or point probability (d) Cumulated probability distribution (p) Quantiles (q) Random generator (r)

The associated distributions are:

dbinom(x, size, prob) P(X = x)
pbinom(q, size, prob, lower.tail = TRUE)
F(x) = P(X <= x)
qbinom(p, size, prob, lower.tail = TRUE)
rbinom(n, size, prob)

Parameters: Vector or single value to be evaluated: x Number of trials: size Probability of success: prob Starting from lower tail (or not): lower.tail Vector or single value of quantiles: q Vector or single value of probabilities: p Number of observations: n

Poisson Distribution

The Poisson Distribution Definitions - 泊松分布(离散分布) You use the Poisson distribution when you are interested in the number of times an event occurs in a given area of opportunity. An area of opportunity is a continuous unit or interval of time, volume, or such area in which more than one occurrence of an event can occur. For examples: The number of scratches in a car’s paint The number of mosquito bites on a person The number of computer crashes in a day

The associated distributions are:

dpois(x, lambda)
ppois(q, lambda, lower.tail = TRUE)
qpois(p, lambda, lower.tail = TRUE)
rpois(n, lambda)

Apply the Poisson Distribution when:

You wish to count the number of times an event occurs in a given area of opportunity.
The probability that an event occurs in one area of opportunity is the same for all areas of opportunity.
The number of events that occur in one area of opportunity is independent of the number of events that occur in the other areas of opportunity.
The probability that two or more events occur in an area of opportunity approaches zero as the area of opportunity becomes smaller.
The average number of events per unit is lambda.

The Normal Distribution- 正态分布(连续分布) Bell Shaped, Symmetrical, Mean, Median and Mode are Equal Location is determined by the mean, μ Spread is determined by the standard deviation, σ The random variable has an infinite theoretical range: + infinite to - infinite

The best bet to determin the distribution is the Normal Distribution:

Normal implies bell-shaped and symmetric distribution, with mean = median = mode.
The best bet is having those properties and Kurtosis approximately equal to 3.

The Standardized Normal(标准正态化)

Any normal distribution (with any mean and standard deviation combination) can be transformed into the standardized normal distribution (Z). Need to transform X units into Z units. The standardized normal distribution (Z) has a mean of 0 and a standard deviation of 1.

The associated distributions are:

dnorm(x, mean = 0, sd = 1)
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE)
rnorm(n, mean = 0, sd = 1)

The Exponential Distribution - 指数分布 (连续分布)

Often used to model the length of time between two occurrences of an event (the time between arrivals)

Examples: Time between trucks arriving at an unloading dock Time between transactions at an ATM Machine Time between phone calls to the main operator

The expected value and SD of a exponential distributed random variable X are given in the inverse of the rate.

The associated distributions are:

dexp(x, rate = 1)
pexp(q, rate = 1, lower.tail = TRUE)
qexp(p, rate = 1, lower.tail = TRUE)
rexp(n, rate = 1)

Parameter: rate (Lambda)

The Uniform Distribution(均匀分布) The uniform distribution is a probability distribution that has equal probabilities for all possible outcomes of the random variable. Also called a rectangular distribution.

The associated distributions are:

dunif(x, min= 0, max = 1)
punif(q, min= 0, max = 1, lower.tail = TRUE)
qunif(p, min= 0, max = 1, lower.tail = TRUE)
runif(n, min= 0, max = 1)

Monte-Carlo Distributions(蒙特卡洛分布)

We can compute probabilities of a combination of the previous random variables by using the random number generator functions. This is called the Monte-Carlo approach: Generate a large sample of the variable we want to study. Use the sample to estimate the required distribution or probabilities.

Residuals - 余差 A residual is the difference between an actual dependent variable value and its predicted linear value. Residuals are important to check the assumptions of the regression model.

R2 Statistic The R2 statistic indicates how well an estimated regression function fits the data. 0 < R2 < 1. It measures the proportion of the total variation in Y around its mean that is accounted for by the estimated regression equation.

F- test for overall significance of the model Shows if there is a linear relationship between all of the X variables considered together and Y

GitHub – tonyleidong