Exhaustive Comparision of SRSWOR and SRSWR

Aim

We aim to make a blog on the topic of SRSWOR and SRSWR in a brief fashion and use R programming to draw important conclusions.

Objective

Aim of this blog is to illustrate that SRSWOR is a better sampling technique than SRSWR 

What is Sampling? 

To get to the deeper concepts, we first need to thoroughly understand what a sample is. A sample is a small part or quantity taken from something large that is usually representative of the population. Now, a random sample is one in which each population unit has an equal chance of getting picked.

This photo illustrates the act of picking a ranodm sample from a population


While picking a random sample, a researcher takes care to ensure no biasedness towards any of the units, i.e. the researcher must not know the nature of outcome of each draw of a sample unit. 

Sampling in Our Daily Life

The purpose of this study is not to introduce a new concpet but to shed light on the importance of a preexisting one, since the idea of sampling goods before actually using them is a practice as old as time. Early human used to sample berries before bringing them back home and correspondingly, modern human samples groceries at a supermarket before getting more of the same. Sampling has been all around us throughout time and now, we are going to discuss the two ways one can make a simple random sample and compare the two as well using real data.

Procedure of selection of a random sample
The procedure of selection of a random sample follows the following steps:
1.Identify the N units in the population with the numbers 1 to N
2.Choose any random number arbitrarily in the random number table and start reading numbers.
3.Choose the sampling unit whose serial number corresponds to the random number drawn
   from the table of random numbers.
4.   In the case of SRSWR, all the random numbers are accepted even if repeated more than once.
4.1 In the case of SRSWOR, if any random number is repeated, then it is ignored, and more
numbers are drawn.
As is evident, we see units getting repeated in SRSWR but that is not the case in SRSWOR

Assumptions in SRS
1. Units are selected in random fashion where the probability of picking any particular unit is unknown.
2. Each sampling unit has a predefined probability of getting picked which every unit confers to.
3. The sampling method used must yield a unique estimation for the parameters.
4. The set of samples from which the sample for estimation is to be picked is well defined.

Limitations
1. An accurate estimation of population parameters demands the availability of the full span of                   population which may not be readily available at the time of analysis.
2. As a follow up to the above limitation, when we don't have the entire list of population, attempts           might be made to cover the entire population space which can be very time consuming and often not       worth spending time and effort on.
3. Sampling bias may occur during sample selection where the sample picked from the sampleset may       not be unbiased, i.e., may not be random through and through.


SRSWR (Simple random sampling with replacement)
This is one of the sampling techniques where every unit has an equal chance of getting picked by the sampler and hence, each unit has a 1/N^n chance of getting selected. This method can be visualised as having a bowl with many sampling units from which a researcher draws a sample unit, puts it back into the bowl and then picks another sample unit from the now replanished bowl.

Formula
Since at any stage, each unit has a 1/N chance of getting selected, 
Among the many estimators for finding the population mean and variance, sample arithmatic mean and variance are popular picks. For the estimation of population mean, we take a simple ȳ=1/n * Σyi where i goes from 1 to n. Now, according to the following proof, we see that ȳ is an unbiased estimator of population mean. 
We set the variance equation up as follows 
Under SRSWR, we take K as follows
since ith and jth draws are independant 
This causes the variance under SRSWR to be 

SRSWOR
This method of sampling is close to the SRSWR method as explained above barring the fact that here, the samples once picked are not put back into the sample pool. Consequently, the units for sampling have a reducing probability of n/N, n-1/N-1, n-2/N-2 and so on. Thus according to the following, we see the probability of each sampling unit getting selected is:
For population mean estimation, we start with the steps similar to that in SRSWR and take the same ȳ to proceed and show that it is an unbiased estimator of Population mean
Now, for variance, we set the base equation barring K value as seen above:
Now, using the appropriate values of K, we find the variance


Gain In Efficiency

Now, before going into the codes and experimentation, we look at the theoritical aspect of the claim that SRSWOR is a more efficient estimator of population mean than SRSWR.
##################################################################################
To further solidify our assumptions, we take publicly availabe data from Kaggle and run a sampling simulation in both methods to compare their variances.

Aim of the Experiment

The aim here is to take a dataset and use any of the variables to compare the variances the samples suffer from, in each of SRSWOR and SRSWR. Since we have already established above that SRSWOR is a better sampling method, this experiment serves to prove our statement about the same.

Data Description

Data was acquired from freely available data pool from Kaggle which depicts various covid related details such as death, confirmed cases, healed, etc; segregated on the basis of countries. We take the number of deaths as our variable of study and take samples using both SRSWOR and SRSWR

Codes and analysis

library(readxl)
DS
<-read_excel("C:/Users/ShinPyro/Desktop/College/Assignment/Countrywise Covid dist.xlsx")
is.data.frame(DS)

## [1] TRUE

head(DS) #We get an idea of the dataset we are using with the help of the Head() function and we already know our data is in the form of a dataframe as seen above.

## # A tibble: 6 x 15
##   `Country/Region`    Confirmed Deaths Recovered Active `New cases` `New deaths`
##   <chr>                   <dbl>  <dbl>     <dbl>  <dbl>       <dbl>        <dbl>
## 1 Afghanistan             36263   1269     25198   9796         106           10
## 2 Albania                  4880    144      2745   1991         117            6
## 3 Algeria                 27973   1163     18837   7973         616            8
## 4 Andorra                   907     52       803     52          10            0
## 5 Angola                    950     41       242    667          18            1
## 6 Antigua and Barbuda        86      3        65     18           4            0
## # ... with 8 more variables: New recovered <dbl>, Deaths / 100 Cases <dbl>,
## #   Recovered / 100 Cases <dbl>, Deaths / 100 Recovered <chr>,
## #   Confirmed last week <dbl>, 1 week change <dbl>, 1 week % increase <dbl>,
## #   WHO Region <chr>

#As we can see, the data is arranged in the alphabetical order

PTotal <- sum(DS$Deaths) #Here we check for the population total against which we will check the values of the population total estimate which will be found next using both SRSWOR and SRSWR method.

## [1] 654036

set.seed(39014) #This prevents the computer from taking different samples at different times
sampleWR<-sample(DS$Deaths,32,replace = T, prob = NULL)

sampleWR #We use the sample() function to take a sample of 32 units from the dataset DS$Deaths where replacce is T, meaning the samples can be repeated
#We see the samples selected as follows:

##  [1]  6160  8777     7    43   116   121   483  4838  1166   408  7067   146
## [13]     0     6     0 35112    22  4838    26  5532   294 33408  1761   474
## [25]     1    24    34   748     0  1764     0  1945

m_WR<-mean(sampleWR)
m_WR

## [1] 3603.781

total_WR <- m_WR*187
total_WR

## [1] 673907.1

#This suggests that when we estimate the population using the sample picked by the program, we get an estimated population total of ~673907 deaths.

set.seed(39014)
sampleWOR
<-sample(DS$Deaths,32,replace = F, prob = NULL)
sampleWOR

##  [1]  6160  8777     7    43   116   121  4838  1166   408  7067     0     6
## [13]     0 35112    22   146    26  5532   294 33408  1761   474     1    24
## [25]    34   748     0  1764     0  1945   141   423

m_WOR<-mean(sampleWOR)
m_WOR

## [1] 3455.125

total_WOR <- m_WOR*187
total_WOR

## [1] 646108.4

#This suggests that when we estimate the population using the sample picked by the program, we get an estimated population total of ~646108 deaths.

As has been made clear above, SRSWOR gives a better estimate of the population total than SRSWR but to further confirm our statement, we calculate the variance in each case and calculate the gain in efficiency

N=187
n
= 32
vsrswor
<- (((N-n)/(N*n))*(var(DS$Deaths)))
vsrswor

## [1] 5149659

vsrswr <- (((N-1)/(N*n)) *(var(DS$Deaths)))
vsrswr

## [1] 6179591

gain<- (vsrswr-vsrswor)/vsrswr
gain

## [1] 0.1666667

Code based Conclusion

As we see above, there is a net positive gain in the efficiency suggesting that samples taken using SRSWOR technique yield better estimates for the population parameters.

#########################################################################

Blog Conclusion

Under "Gain in Efficiency", we depicted how the variance in SRSWR method was higher than SRSWOR since there was some positive value involved in case of SRSWR that increased the overall variance. 

If thought about practically, there is always the chance of getting the same sample unit over and over again in case of SRSWR which is definitely not an adequate representation of the population whereas in case of SRSWOR, since the chance of one unit getting picked twice or more is completely eliminated, the WOR method proves to be more representative of the population.

Our assumptions and proofs are further solidified when we see a 0.16 gain or 16% gain in efficiency when we use both techniques to calculate the variance from a real work dataset. 

We also notice that SRSWOR gives an estimaate of population total that is much closer to the  actual population toal values than the estimates we get from SRSWR method, hence confirming that SRSWOR is a better estimator of popoulation parameters than SRSWR

Comments

Popular posts from this blog

Population Proportion of Size Without Replacement Using DesRaj Estimator

Probability Proportional to Size Sampling without replacement (PPSWOR) using Murthy’s unordered estimator

PPSWOR AND HORVITZ THOMPSON ESTIMATOR