Simple Random Sampling Without Replacement

 Simple Random Sampling Without Replacement

Tejus Prabhu

2148116



Simple random sampling is of the most fundamental methods of estimation of a sample from a given population. It is a method in which the units are selected from the population in a random order without any predefined order with an equal chance of selection. 


We will now explain the method of Simple Random Sampling Without Replacement in the following blog. It is a method of selection of n sample units from N population units such that at any point in time, any one of the units which would replace the other unit will have an equal chance (equal probability) of selection. 



The figure above shows the procedure of Simple Random Sampling Without Replacement in a graphical manner in which the population consists of white and black balls. The selected sample units taken can consist of either purely white balls, purely black balls, or a mix of black and white balls. In this case, the first sample has two black and two white, the second and third sample contains one black and three white balls. 




The following procedure of simple random sampling is:

  1. Take N population units and label every unit from 1 to N. 

  2. We have to choose any number randomly and select the respective value which corresponds to the sample number.

  3. If the sample values are repeated, ignore the process and continue selecting other samples without replacing anything.


The Probability of selecting a sample using the method is given by 1/(NCn)

The number of samples is given by (NCn).


Advantages of Simple Random Sampling Without Replacement:

  1. It is a very simple method of estimation.

  2. There is accuracy during the selection of samples.

  3. The researcher doesn’t need to have any prior knowledge about the dataset and he can randomly select without any prior preparation.


Disadvantages of Simple Random Sampling Without Replacement:

  1. It is not as effective as methods like Stratified or Systematic Sampling in general cases.

  2. It is a costly and time-consuming process.

  3. Bias can occur during the sampling process.


The chances of errors can occur as the method can also select outliers and not the desired samples.


Formulas Involving Simple Random Sampling Without Replacement:


The above formula represents the mean of the sample.


The above formula represents the mean of the population.


The above shows the variance of the sample population.


The above shows the unbiased estimator of the variance.


The confidence interval formulas are given by:

Lower Limit=sample mean - Z*Standard Error

Upper Limit=sample mean + Z*Standard Error


The Simple Random Sampling Without Replacement is considered to be more efficient than the method of Simple Random Sampling With Replacement due to the lower value of the variance obtained after substitution. 


Applications of Simple Random Sampling Without Replacement:

Simple Random Sampling can be used in various domains of research in the field of business, medicine, engineering, etc. Large volumes of data are generated through various means of data collection and it can be difficult to analyze each and every unit of the population.


Now, we depict the method of simple random sampling without replacement using the R Programming language.





Simple Random Sampling Without Replacement

Tejus Prabhu

11/23/2021

AIM

We have to take a sample from the given population using the method of Simple Random Sampling Without Replacement and provide the estimates.

Data Description

Here, the data gives the information about the Covid-19 vaccination rates and the total number of people vaccinated in different parts of Turkey. The dataset has been taken from Kaggle.

Now, we have to import the dataset

library(readxl)
DS=read_excel("C:/Users/tejus/OneDrive/Desktop/covid19Vaccination.xlsx")
DS

## # A tibble: 81 x 13
##        ID DATE_               SEQID CITY           CITY2 `_1DOSE` `_2DOSE` `_TOTAL`
##     <dbl> <dttm>              <dbl> <chr>          <chr>    <dbl>    <dbl>    <dbl>
##  1 438102 2021-06-26 22:41:31  5412 Adana          Adana   784464   355404  1139868
##  2 438103 2021-06-26 22:41:31  5412 Adiyaman       Adiy~   170847    77955   248802
##  3 438104 2021-06-26 22:41:31  5412 Afyonkarahisar Afyon   292611   138934   431545
##  4 438105 2021-06-26 22:41:31  5412 Agri           Agri     94807    41146   135953
##  5 438106 2021-06-26 22:41:31  5412 Aksaray        Aksa~   131889    62603   194492
##  6 438107 2021-06-26 22:41:31  5412 Amasya         Amas~   169285    87322   256607
##  7 438108 2021-06-26 22:41:31  5412 Ankara         Anka~  2652085  1241906  3893991
##  8 438109 2021-06-26 22:41:31  5412 Antalya        Anta~  1055350   530415  1585765
##  9 438110 2021-06-26 22:41:31  5412 Ardahan        Arda~    34629    18642    53271
## 10 438111 2021-06-26 22:41:31  5412 Artvin         Artv~    80554    43132   123686
## # ... with 71 more rows, and 5 more variables: POPULATION <dbl>,
## #   DIFF_1DOSE <dbl>, DIFF_2DOSE <dbl>, DIFF_TOTAL <dbl>, PREVID <dbl>

summary(DS)

##        ID             DATE_                         SEQID    
##  Min.   :438102   Min.   :2021-06-26 22:41:31   Min.   :5412 
##  1st Qu.:438122   1st Qu.:2021-06-26 22:41:31   1st Qu.:5412 
##  Median :438142   Median :2021-06-26 22:41:31   Median :5412 
##  Mean   :438142   Mean   :2021-06-26 22:41:31   Mean   :5412 
##  3rd Qu.:438162   3rd Qu.:2021-06-26 22:41:31   3rd Qu.:5412 
##  Max.   :438182   Max.   :2021-06-26 22:41:31   Max.   :5412 
##      CITY              CITY2               _1DOSE            _2DOSE      
##  Length:81          Length:81          Min.   :  24128   Min.   :  13155 
##  Class :character   Class :character   1st Qu.: 106990   1st Qu.:  49145 
##  Mode  :character   Mode  :character   Median : 197631   Median :  98896 
##                                        Mean   : 396543   Mean   : 182891 
##                                        3rd Qu.: 395582   3rd Qu.: 187138 
##                                        Max.   :6197550   Max.   :2419591 
##      _TOTAL          POPULATION         DIFF_1DOSE        DIFF_2DOSE    
##  Min.   :  37283   Min.   :   81910   Min.   :  0.000   Min.   : 0.0000 
##  1st Qu.: 157545   1st Qu.:  284923   1st Qu.:  0.000   1st Qu.: 0.0000 
##  Median : 294010   Median :  537762   Median :  1.000   Median : 0.0000 
##  Mean   : 579434   Mean   : 1032276   Mean   :  8.407   Mean   : 0.4938 
##  3rd Qu.: 580524   3rd Qu.: 1081065   3rd Qu.:  6.000   3rd Qu.: 0.0000 
##  Max.   :8617141   Max.   :15462452   Max.   :193.000   Max.   :11.0000 
##    DIFF_TOTAL          PREVID     
##  Min.   :  0.000   Min.   :438021 
##  1st Qu.:  0.000   1st Qu.:438041 
##  Median :  1.000   Median :438061 
##  Mean   :  8.901   Mean   :438061 
##  3rd Qu.:  6.000   3rd Qu.:438081 
##  Max.   :204.000   Max.   :438101

str(DS)

## tibble [81 x 13] (S3: tbl_df/tbl/data.frame)
##  $ ID        : num [1:81] 438102 438103 438104 438105 438106 ...
##  $ DATE_     : POSIXct[1:81], format: "2021-06-26 22:41:31" "2021-06-26 22:41:31" ...
##  $ SEQID     : num [1:81] 5412 5412 5412 5412 5412 ...
##  $ CITY      : chr [1:81] "Adana" "Adiyaman" "Afyonkarahisar" "Agri" ...
##  $ CITY2     : chr [1:81] "Adana" "Adiyaman" "Afyon" "Agri" ...
##  $ _1DOSE    : num [1:81] 784464 170847 292611 94807 131889 ...
##  $ _2DOSE    : num [1:81] 355404 77955 138934 41146 62603 ...
##  $ _TOTAL    : num [1:81] 1139868 248802 431545 135953 194492 ...
##  $ POPULATION: num [1:81] 2258718 632459 736912 535435 423011 ...
##  $ DIFF_1DOSE: num [1:81] 15 2 8 0 3 1 91 29 0 0 ...
##  $ DIFF_2DOSE: num [1:81] 0 0 0 0 0 0 6 4 0 0 ...
##  $ DIFF_TOTAL: num [1:81] 15 2 8 0 3 1 97 33 0 0 ...
##  $ PREVID    : num [1:81] 438021 438022 438023 438024 438025 ...

plot(DS)

Now, we have to select a sample of size 36 from a population of size 81.

library(samplingbook)

## Loading required package: pps

## Loading required package: sampling

## Loading required package: survey

## Loading required package: grid

## Loading required package: Matrix

## Loading required package: survival

##
## Attaching package: 'survival'

## The following objects are masked from 'package: sampling':
##
##     cluster, strata

##
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
##
##     dotchart

set.seed(420)
DD=DS[sample(1:nrow(DS),36,replace=F),]
DD

## # A tibble: 36 x 13
##        ID DATE_               SEQID CITY          CITY2 `_1DOSE` `_2DOSE` `_TOTAL`
##     <dbl> <dttm>              <dbl> <chr>         <chr>    <dbl>    <dbl>    <dbl>
##  1 438106 2021-06-26 22:41:31  5412 Aksaray       Aksa~   131889    62603   194492
##  2 438143 2021-06-26 22:41:31  5412 Kahramanmaras Kahr~   336185   149207   485392
##  3 438179 2021-06-26 22:41:31  5412 Van           Van     236641    91208   327849
##  4 438174 2021-06-26 22:41:31  5412 Tekirdag      Teki~   477726   208146   685872
##  5 438138 2021-06-26 22:41:31  5412 Hatay         Hatay   512074   223365   735439
##  6 438122 2021-06-26 22:41:31  5412 Bursa         Bursa  1272038   550539  1822577
##  7 438126 2021-06-26 22:41:31  5412 Denizli       Deni~   475605   212478   688083
##  8 438125 2021-06-26 22:41:31  5412 Çorum         Corum   238440   121891   360331
##  9 438142 2021-06-26 22:41:31  5412 Izmir         Izmir  2136705  1006223  3142928
## 10 438161 2021-06-26 22:41:31  5412 Mus           Mus      62187    29056    91243
## # ... with 26 more rows, and 5 more variables: POPULATION <dbl>,
## #   DIFF_1DOSE <dbl>, DIFF_2DOSE <dbl>, DIFF_TOTAL <dbl>, PREVID <dbl>

A sample of size 36 is created as a result of using the method of Simple Random Sampling Without Replacement.

plot(DD)

Now, we have to analyze by checking certain properties.

We take the Population of the cities of Turkey and analyze it..

We have to estimate the confidence intervals.

Mean=mean(DD$POPULATION)
Mean #sample mean

## [1] 1410072

variance=var(DS$POPULATION)
variance #population variance

## [1] 3.50654e+12

N=81
n=36
V=((N-n)/(N*N))*variance
V # variance of the estimate.

## [1] 24050344277

SE=sqrt(V)
SE # The standard error

## [1] 155081.7

LL=Mean-1.96*SE
LL #lower limit

## [1] 1106111

UL=Mean+1.96*SE
UL

## [1] 1714032

CI=c(LL,UL)
CI

## [1] 1106111 1714032

These are the confidence intervals for a 95% level of significance. We had taken 1.96 as the sample size is above 30. Conclusions: There is a 95% chance that the sample mean will lie between the above values.


Comments

Popular posts from this blog

PPSWOR AND HORVITZ THOMPSON ESTIMATOR

Population Proportion of Size Without Replacement Using DesRaj Estimator

HORVITZ-THOMPSON ESTIMATOR - An Unordered Estimator