Sampling Distribution

Sampling Distribution

Generate Samples: Now, let’s sample 1000 times, every time we draw a sample of size 10, and compute the mean and SD for each sample.

means = replicate(1000,mean(rnorm(10,mean=5,sd=3)))
sds = replicate(1000,sd(rnorm(10,mean=5,sd=3)))

Notice the difference between the rep() and replicate() functions in R

Limit Laws : Rules of Thumb -

For most distributions, n > 30  will give a sampling distribution that is nearly normal
For fairly symmetric distributions, n > 15
For normal population distributions, the sampling distribution of the mean is always normally distributed

Confidence Intervals A point estimate is a single number, a confidence interval provides additional information about the variability of the estimate

Monte-Carlo Distributions Practical Codes

Example 1: There are two tasks. Task 1 duration follows a normal distribution with mean 5 days and standard deviation of 1 day. Task 2 duration follows an uniform distribution between 3 and 7 days. What is the probability both tasks get completed in 10 days if sequentially done? Same question if done in parallel. What is the median time completion in both cases? Solution: - Simulate times:

T1 = rnorm(1000,5,1)
T2 = runif(1000,min=3,max=7)
  • Compute probability for sequential project

    S = T1 + T2 sum(S<10)/1000

  • Compute probability for parallel project

    P = pmax(T1,T2) sum(P<10)/1000

  • Compute medians:

    median(S) median(P)

pmax function example: > x <- c(3, 26, 122, 6) > y <- c(43,2,54,8) > z <- c(9,32,1,9) > pmax(x,y,z) [1] 43 32 122 9

Example 2: Phoncessories manufactures several customized accessories for smart phones and packages them into boxes. Each box consists of 20 units. Processing each unit in a box takes 2 minutes (constant). Company classifies the boxes into “simple” (ordered 60% of the time) and “complex” (ordered 40% of the time): Simple box setup time is exponential dist. with mean = 1 hour Complex box setup time is exponential dist. With mean = 1.5 hours The firm would like to study the overall time to process a random order for a box.

This is the code to generate a sample of production times for 1,000 boxes: TSimple=rexp(1000,rate=1/60) TComplex=rexp(1000,rate=1/90) OrderType=sample(c(0,1),size=1000,replace=TRUE,prob=c(0.6,0.4)) ProdTime=ifelse(OrderType==0,TSimple,TComplex)+40

Binomial Distribution Practical Codes

BinomialDistribution Formular PPT page 54

Example: PPT page 56

R Binomial Function

Density or point probability (d)

Cumulated probability distribution (p)

Quantiles (q)

Random generator (r)

Practice 1: Past Records, 0.08 probability that an online tretail order is fraudlent. Suppose we have 20 orders, Probbility no order is fraudulent?

Analysize the Problem: Triail -> Checking order ; Fraudelent or not? ;Whether we are sucess -> order is fraudulent; n = 20 trials; Pi = probability of success = 0.08; X = the number of fraudulent orders among the 20

Solution: P(X = 0) -> d; P(x<=a) = F(a) -> p; P(X = 0) is what we want!

R coding: PPT Page 63

dbinom(0, size = 20, prob = 0.08) # P( X = 0 )

Practice 2: Probability of 2 or more Fraudulent orders among the 20?

Solution: P(X>=2) = P(X>1) = 1 - F(1)

R Coding:

1-pbinom(1, size = 20, prob = 0.08)

Mean & Standard Deviation in PPT page 60

EV.Fraud = 20* 0.08

SD.Fraud = sqrt(20*0.08*(1-0.08))

Chebyshev Rule 75% of the time we will get success get 0 to 4 fraudulent orders

EV.Fraud - 2*SD.Fraud
EV.Fraud + 2*SD.Fraud

GitHub – tonyleidong