Stratified sampling(WOR) under Neyman allocation

 Stratified sampling (WOR) under Neyman allocation 

-Pranav S. Bakal (2148110) 


Introduction 

Complete enumeration of population is expensive, time consuming and difficult to administrate. In order to overcome these difficulties, different sampling procedures are usedIn sampling, we chose a small part of the population and then use this sample to estimate various characteristics (parameters) about the actual population. However, we must ensure that the parameters estimated must be close to the true values else, the purpose of sampling is failed. In order to achieve this, we use different methods to choose a sample. Idea sample is the one which represents the population well. There are different sampling procedures; simple random sampling, cluster sampling, stratified sampling etc. Here, we discuss stratified sampling in some detail. Each sampling procedure has its own advantages and disadvantages.  

Stratified Sampling 

Population may not be homogenous; this makes choosing a sample very important so that a proper representative of the population is chosen. In such situation, stratified sampling is used.  In stratified sampling, we divide population data into smaller, homogeneous subgroups called strata (plural: stratum). Once stratification is done, we select appropriate number of elements from each stratum and these elements together make up our sample. We must bear some important points while stratification: - 

  1. 1. Strata should be non-overlapping and should together make the entire population i.e., exclusive and exhaustive. 

  1. 2. Strata must be homogeneous within themselves. 

Next important step in stratified sampling is deciding the size of sample to be taken. Size of sample depends on number of elements chosen from each stratum. Allocation of different sample size in different strata depends on 3 factors; total number of units in stratum, variability within stratum and the cost in taking observations per sampling unit in stratum. The allocation which makes most effective use of available resources is considered a good allocation. In stratified sampling, there are 4 allocation methods these are: Equal allocation, Proportional allocation, Neyman allocation and Optimum allocation. Here, we take a closer look at Neyman allocation. 

 

Neyman Allocation 

This is also known as minimum variance allocation. This allocation is based on variance and size of stratum together. We assume that sampling cost for each element is same and sample of sample is fixed. Then, sample sizes are allocated as, 


On substituting above value of ni, we get minimum variance as, 

Here, ni is the sample size from ith strata 

Ni is the size of strata 

Si is mean square for ith strata 

N is size of sample 

Wi is weight given to ith strata, Wi = Ni/N 

K is the total number of strata 

Advantages and Disadvantages of stratified sampling 

Stratification gives a smaller error in estimation and greater precision than the simple random sampling method. More the difference between the strata, greater the gain in precision. Stratification gives a proper representative of the sample.  

Stratified sampling also has some disadvantages. Lot of conditions has to be satisfied for stratified sampling to be used effectively. Overlapping is an issue if there are elements that can fall in more than one stratum. Hence it is very disadvantageous if we can’t confidently decide where to put an element. Also, creating an exhaustive and exclusive list of elements is a difficult task. 


R Code(example): - 


Let us consider data about heart diseases acquired from Keaggle relating to different heart conditions and measure of their different attributes like chest pain, maximum heart rate etc. in a hospital in USA. Here, we try to estimate mean and total cholesterol level in different patients. 


### Finding stratum sizes, mean and standard deviation 

```{r}

stratum0 <- heart[heart$cp == "0", ] 

stratum1 <- heart[heart$cp == "1", ]

stratum2 <- heart[heart$cp == "2", ] 

stratum3 <- heart[heart$cp == "3", ]

head(stratum0)

head(stratum1)

head(stratum2)

head(stratum3)


N0 = sum(stratum0$cp == 0)

N1 = sum(stratum1$cp == 1)

N2 = sum(stratum2$cp == 2)

N3 = sum(stratum3$cp == 3)

N = sum(c(N0, N1, N2, N3))


mean_stratum0 = mean(stratum0$chol)

mean_stratum1 = mean(stratum1$chol)

mean_stratum2 = mean(stratum2$chol)

mean_stratum3 = mean(stratum3$chol)

mean_stratum0

mean_stratum1

mean_stratum2

mean_stratum3

pop_mean = mean(mean_stratum0, mean_stratum1, mean_stratum2, mean_stratum3)

pop_mean


pop_total = pop_mean * N

pop_total


S0 = sqrt(var(stratum0$chol))

S1 = sqrt(var(stratum1$chol))

S2 = sqrt(var(stratum2$chol))

S3 = sqrt(var(stratum3$chol))

S0

S1

S2

S3

```




Next, we find size of sample to be taken from each strata hence take appropriate sample and move to analysis


### Finding sizes of samples to be taken from different strata under Neyman allocation.


sample_sizes = stratasamp(102, c(N0, N1, N2, N3), c(S0, S1, S2, S3), type = "opt")

sample_sizes



### Taking samples of appropriate sizes


set.seed(00)

sample_strata0 <- stratum0[sample(1:nrow(stratum0), 48, replace = FALSE), ]

set.seed(01)

sample_strata1 <- stratum1[sample(1:nrow(stratum1), 14, replace = FALSE), ]

set.seed(02)

sample_strata2 <- stratum2[sample(1:nrow(stratum2), 34, replace = FALSE), ]

set.seed(03)

sample_strata3 <- stratum3[sample(1:nrow(stratum3), 5, replace = FALSE), ]


final_sample <- rbind(sample_strata0, sample_strata1, sample_strata2, sample_strata3)

head(final_sample)



### Estimation 


W0 = N0/N

W1 = N1/N

W2 = N2/N

W3 = N3/N

res = stratamean(y = final_sample$chol, h = as.vector(final_sample$cp), wh = c(W0 ,W1, W2, W3))

res

est_pop_mean = res$mean

est_pop_mean

est_pop_total = N * res$mean

est_pop_total


v_opt = (1/101) * (S0 * W0 + S1 * W1 + S2 * W2 + S3 * W3)^2 - (1/303) * (W0 * S0 * S0 + W1 * S1 * S1 + W2 * S2 * S2 + W3 * S3 * S3)

v_opt




Code explanantion and analysis: - 

We first divide population (here, heart) into 4 different strata (stratum0, stratum1, stratum2, stratum3) depending on the type of chest pain (cp). We also find population mean and total to compare with the estimated values. Standard error for each stratum will be used later in formula to find variance; W0, W1, W2, W3 act as weights in calculating variance. Next, we take samples (sample_strata0, sample_strata1, sample_strata2, sample_strata3) of appropriate size (under Neyman) from each stratum and get the complete sample (final_sample). Samples are drawn using without replacement method. That is, once an element is chosen, same element cannot be chosen again. We then estimate the mean and population total from this sample.  

  1. We get estimated population total as 239.9202 while the actual value is 250.1329. 

  1. Estimated population total is 72695.83 and the real value is 75790.26.



Comments

Popular posts from this blog

Population Proportion of Size Without Replacement Using DesRaj Estimator

Probability Proportional to Size Sampling without replacement (PPSWOR) using Murthy’s unordered estimator

PPSWOR AND HORVITZ THOMPSON ESTIMATOR