PROBABILITY PROPORTIONAL TO SIZE WITHOUT REPLACEMENT -2148149

 2148149--VARNIKA V.N

            PROBABILITY PROPORTIONAL TO SIZE WITHOUT REPLACEMENT


What is PPS?

In PPS (Probability Proportional to size) Sampling, the probability of selection of element is proportional to its size. This improves the accuracy of population estimates because the larger elements have the greatest impact on the population estimates.

In order to ensure that all units (ex. individuals) in the population have the same probability of selection irrespective of the size of their cluster, each of the hierarchical levels prior to the ultimate level has to be sampled according to the size of ultimate units it contains, but the same number of units has to be sampled from each cluster at the last hierarchical level. This method also facilitates planning for fieldwork because a pre-determined number of individuals is interviewed in each unit selected, and staff can be allocated accordingly It is most useful when the sampling units vary considerably in size because it assures that those in larger sites have the same probability of getting into the sample as those in smaller sites, and vice verse. 

The basic difference between simple random sampling and pps sampling procedure is in simple random sampling, the probability of drawing any specified unit at any given draw is the same while in PPS sampling it differs from draw to draw.

Let us take this example of Wikipedia which illustrates clearly the concept of PPS Sampling,

Suppose we have six schools with a population of 150,180,200,220,260 and 490 students respectively. Total 1500 students and we want to use student population as the basis for a PPS sample of size 3. To this we will allocate the first school numbers 1 to 150, the second school 151 to 330(=150+180), the third school 331to 530, and soon to the last school (10111to 1500). We then generate a random start between 1 and 500(equal to 1500/3)and count through the school populations by multiples of 500. If our random start was 137,  we would select the schools which have been allocated numbers 137,637, and 1137 i.e.the first, fourth and sixth schools.

In PPS there are 2 ways in which we can draw samples they are,

  • PPS with replacement 
  • PPS without replacement.

In this blog, we will look into the concept of probability proportional to size precisely without replacement technique and sampling without replacement is more efficient than with replacement technique.

PPSWOR is divided into ordered and unordered estimators for better clarity.

Ordered estimators:

  • Des-Raj's ordered, estimator

Unordered estimators:

  • Horvitz Thompson Estimator
  • Murthy's unordered Estimator

Ordered Estimators:

In ordered estimation to overcome the difficulty of changing expectation with each draw, associate a new variate with each draw such that its expectation is equal to the population value of the variate under study. Estimators take into account the order of the draw.
Des Raj (1956) have proposed such ordered estimators which make use of conditional probabilities.
Des Raj's ordered estimators
Desraj estimator computes the estimated value of a finite population total when values of the study variable for the sampled units and the corresponding selection probabilities are given as per the order of selection. It is meant for use in Probability Proportional to Size Without Replacement Sampling Scheme

Formulas:
Case1: Case of two draws 
Let y1 and y2 denote the values of the units  Ui(1) and  Ui(2) drawn at first and second draws respectively.

 


Unbiased estimator of V( ŶD) is obtained by,


APPLICATION OF  DESRAJ ESTIMATOR:

The application of desraj estimator is demonstrated on the data-set which shows the area Under lime (in acres) and No. of bearing lime trees in each of the 22 villages growing lime in one of the tehsils of Bangalore district. The procedure includes drawing the sample of size 5 and with ppswor scheme and estimate the total number of bearing lime trees using ordered Desraj estimator, then also give the bound on the error of estimation.

#ANALYSIS:

#loading the package

library(fpest)
library(readxl)
tree_data
<- read_excel("C:/Users/DELL/Downloads/tree_data.xlsx")
View(tree_data)

attach(tree_data)

#Now wew obtain sum of area under lime

S<-sum(`Area Under lime(in acres)`)
S

## [1] 497.66

#Now, we have to find the probability of the values of the area under lime(in acres)

pi<-(`Area Under lime(in acres)`)/S
pi

##  [1] 0.0658481694 0.0160149500 0.0012458305 0.0313667966 0.0861029619
##  [6] 0.0804364426 0.0188683037 0.0127195274 0.0101474903 0.1899891492
## [11] 0.1079250894 0.0013463007 0.0016477113 0.0043202186 0.0008640437
## [16] 0.2478800788 0.0005827272 0.0060282120 0.0080376160 0.0040188080
## [21] 0.0124783989 0.0921311739

Now we have to combine the data frame and their respective probabilities.

Data_1<-cbind(tree_data,pi)
Data_1

##    S.No. of villages Area Under lime(in acres) No. of bearing lime trees
## 1                  1                     32.77                      2328
## 2                  2                      7.97                       754
## 3                  3                      0.62                       105
## 4                  4                     15.61                       949
## 5                  5                     42.85                      3091
## 6                  6                     40.03                      1736
## 7                  7                      9.39                       840
## 8                  8                      6.33                       311
## 9                  9                      5.05                         0
## 10                10                     94.55                      3044
## 11                11                     53.71                      2483
## 12                12                      0.67                       128
## 13                13                      0.82                       102
## 14                14                      2.15                        60
## 15                15                      0.43                         0
## 16                16                    123.36                     11799
## 17                17                      0.29                        26
## 18                18                      3.00                       317
## 19                19                      4.00                       190
## 20                20                      2.00                       180
## 21                21                      6.21                       752
## 22                22                     45.85                      3091
##              pi
## 1  0.0658481694
## 2  0.0160149500
## 3  0.0012458305
## 4  0.0313667966
## 5  0.0861029619
## 6  0.0804364426
## 7  0.0188683037
## 8  0.0127195274
## 9  0.0101474903
## 10 0.1899891492
## 11 0.1079250894
## 12 0.0013463007
## 13 0.0016477113
## 14 0.0043202186
## 15 0.0008640437
## 16 0.2478800788
## 17 0.0005827272
## 18 0.0060282120
## 19 0.0080376160
## 20 0.0040188080
## 21 0.0124783989
## 22 0.0921311739

#Now we have to obtain a sample using the methodof PPSWOR #FIRT,we have to import the fpest library

set.seed(123)
PPS_WOR
<-Data_1[sample(1:nrow(Data_1),5,replace = F),]
PPS_WOR

##    S.No. of villages Area Under lime(in acres) No. of bearing lime trees
## 15                15                      0.43                         0
## 19                19                      4.00                       190
## 14                14                      2.15                        60
## 3                  3                      0.62                       105
## 10                10                     94.55                      3044
##              pi
## 15 0.0008640437
## 19 0.0080376160
## 14 0.0043202186
## 3  0.0012458305
## 10 0.1899891492

desraj(PPS_WOR$`No. of bearing lime trees`,PPS_WOR$pi)

## $est
## [1] 27426.98
##
## $estvar
## [1] 210519342
##
## $tvals
## [1]     0.00 23618.43 13954.56 83416.77 16145.17

#Hence,the estimate of the total number of bearing lime trees is 27426.98.

#Now we have to find the bound on Error of Estimation.

variance<-210519342
Standard_Error
<-sqrt(variance)
Standard_Error

## [1] 14509.28

Bound_Error<-2*Standard_Error
Bound_Error

## [1] 29018.57

CONCLUSION:

The analysis shows that the estimation of the total no. of bearing lime trees in the population is 27427 trees with the variance on the estimated value as 210519342. The bound-on error estimate is 29018.57. So, we can say that almost about 27427 total no. of trees are in 22 villages growing lime in one of the Bangalore district's tehsils.

UNORDERED ESTIMATORS:

In the ordered estimator, the order in which the units are drawn is considered. Corresponding to any ordered estimator, there exists an unordered estimator which does not depend on the order in which the units are drawn and has a smaller variance than the ordered estimator.

 In case of sampling WOR from a population of size N, there are Ncn unordered sample(s) of size n.

Corresponding to an unordered sample(s) of size n units, there are n! ordered samples. For example, for n =2 if the units are u1 and u2 , then

- there are 2! ordered samples - (u1,u2)  and (u2,u1).

- there is one unordered sample (u1,u2)

  1. Horvitz-Thompson Estimator:
The unordered estimates have limited applicability as they lack simplicity and the expressions for the estimators and their variance becomes unmanageable when the sample size is even moderately large. The HT estimate is simpler than other estimators. Let N be the population size and, yi=( 1, 2,..., N)  be the value of characteristic understudy and a sample of size n is drawn by WOR using the arbitrary probability of selection at each draw.

Thus prior to each succeeding draw, there is defined a new probability distribution for the units available at that draw. The probability distribution at each draw may or may not depend upon the initial probability at the first draw. 



# STATISTICAL ANALYSIS:

library(readxl)
tree_data
<- read_excel("C:/Users/DELL/Downloads/tree_data.xlsx")
View(tree_data)
data
<-tree_data
attach(data)
head(data)

## # A tibble: 6 x 3
##   `S.No. of villages` `Area Under lime(in acres)` `No. of bearing lime trees`
##                 <dbl>                       <dbl>                       <dbl>
## 1                   1                       32.8                         2328
## 2                   2                        7.97                         754
## 3                   3                        0.62                         105
## 4                   4                       15.6                          949
## 5                   5                       42.8                         3091
## 6                   6                       40.0                         1736

library(samplingbook)

## Loading required package: pps

## Loading required package: sampling

## Loading required package: survey

## Loading required package: grid

## Loading required package: Matrix

## Loading required package: survival

##
## Attaching package: 'survival'

## The following objects are masked from 'package:sampling':
##
##     cluster, strata

##
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
##
##     dotchart

#Horvitz-Thompson Method

set.seed(123)

pps
=pps.sampling(data$`No. of bearing lime trees`,n=8,method="midzuno")
pps

##
## pps.sampling object: Sample with probabilities proportional to size
## Method of Midzuno:
##
## PPS sample:
## [1]  1  5  6 10 11 16 20 22
##
## Sample probabilities:
##            [,1]       [,2]      [,3]       [,4]       [,5]       [,6]
## [1,] 0.81249455 0.81249455 0.4268213 0.81249455 0.67908559 0.81249455
## [2,] 0.81249455 1.00000000 0.6058808 1.00000000 0.86659105 1.00000000
## [3,] 0.42682130 0.60588081 0.6058808 0.60588081 0.47848110 0.60588081
## [4,] 0.81249455 1.00000000 0.6058808 1.00000000 0.86659105 1.00000000
## [5,] 0.67908559 0.86659105 0.4784811 0.86659105 0.86659105 0.86659105
## [6,] 0.81249455 1.00000000 0.6058808 1.00000000 0.86659105 1.00000000
## [7,] 0.04959955 0.06282174 0.0350299 0.06282174 0.05341423 0.06282174
## [8,] 0.81249455 1.00000000 0.6058808 1.00000000 0.86659105 1.00000000
##            [,7]       [,8]
## [1,] 0.04959955 0.81249455
## [2,] 0.06282174 1.00000000
## [3,] 0.03502990 0.60588081
## [4,] 0.06282174 1.00000000
## [5,] 0.05341423 0.86659105
## [6,] 0.06282174 1.00000000
## [7,] 0.06282174 0.06282174
## [8,] 0.06282174 1.00000000

sample=data[pps$sample, ]
sample
#selected sample using midzuno method

## # A tibble: 8 x 3
##   `S.No. of villages` `Area Under lime(in acres)` `No. of bearing lime trees`
##                 <dbl>                       <dbl>                       <dbl>
## 1                   1                        32.8                        2328
## 2                   5                        42.8                        3091
## 3                   6                        40.0                        1736
## 4                  10                        94.6                        3044
## 5                  11                        53.7                        2483
## 6                  16                       123.                        11799
## 7                  20                         2                           180
## 8                  22                        45.8                        3091

N=nrow(data)
N

## [1] 22

popmean=mean(data$`No. of bearing lime trees`)
PI
=pps$PI
head(PI)

##           [,1]      [,2]      [,3]      [,4]      [,5]      [,6]       [,7]
## [1,] 0.8124945 0.8124945 0.4268213 0.8124945 0.6790856 0.8124945 0.04959955
## [2,] 0.8124945 1.0000000 0.6058808 1.0000000 0.8665910 1.0000000 0.06282174
## [3,] 0.4268213 0.6058808 0.6058808 0.6058808 0.4784811 0.6058808 0.03502990
## [4,] 0.8124945 1.0000000 0.6058808 1.0000000 0.8665910 1.0000000 0.06282174
## [5,] 0.6790856 0.8665910 0.4784811 0.8665910 0.8665910 0.8665910 0.05341423
## [6,] 0.8124945 1.0000000 0.6058808 1.0000000 0.8665910 1.0000000 0.06282174
##           [,8]
## [1,] 0.8124945
## [2,] 1.0000000
## [3,] 0.6058808
## [4,] 1.0000000
## [5,] 0.8665910
## [6,] 1.0000000

PI=as.matrix(PI)

#Estimation of variance using HT Method.
var_ht
=htestimate(sample$`No. of bearing lime trees`,N=N,PI=PI,method='ht')
var_ht

##
## htestimate object: Estimator for samples with probabilities proportional to size
## Method of Horvitz-Thompson:
##
## Mean estimator: 1476.636
## Standard Error: 117.0282

#pik is the 1st order inclusion prob for all population units
po_tota
=popmean*N
po_tota

## [1] 32486

INTERPRETATION:

Using the Horvitz-Thompson method, the estimated average number of bearing lime trees is 1476.636 with a standard error of 117.0282. 


Ex2: 

In this example below we calculated the Horvitz-Thompson estimate with different methods for variance estimation such as Yates and Grundy, Hansen-Hurwitz, and Hajek for dataset influenza.

input:

data(influenza)
summary(influenza)

# pps.sampling()
set.seed(108506)
pps <- pps.sampling(z=influenza$population,n=20,method='midzuno')
sample <- influenza[pps$sample,]
# htestimate()
N <- nrow(influenza)
# exact variance estimate
PI <- pps$PI
htestimate(sample$cases, N=N, PI=PI, method='yg')
htestimate(sample$cases, N=N, PI=PI, method='ht')
# approximate variance estimate
pk <- pps$pik[pps$sample]
htestimate(sample$cases, N=N, pk=pk, method='hh')
pik <- pps$pik
htestimate(sample$cases, N=N, pk=pk, pik=pik, method='ha')
# without pik just approximate calculation of Hajek method
htestimate(sample$cases, N=N, pk=pk, method='ha') 
# calculate confidence interval based on normal distribution for number of cases
est.ht <- htestimate(sample$cases, N=N, PI=PI, method='ht')
est.ht$mean*N  
lower <- est.ht$mean*N - qnorm(0.975)*N*est.ht$se
upper <- est.ht$mean*N + qnorm(0.975)*N*est.ht$se
c(lower,upper) 
# true number of influenza cases
sum(influenza$cases)

output:
Loading required package: pps
Loading required package: sampling
Loading required package: survey
Loading required package: grid
Loading required package: Matrix
Loading required package: survival

Attaching package: 'survival'

The following objects are masked from 'package:sampling':

    cluster, strata


Attaching package: 'survey'

The following object is masked from 'package:graphics':

    dotchart

       id                        district     population          cases       
 Min.   : 1001   LK Aachen           :  1   Min.   :  34719   Min.   :  0.00  
 1st Qu.: 5877   LK Ahrweiler        :  1   1st Qu.: 104553   1st Qu.:  9.00  
 Median : 8331   LK Aichach-Friedberg:  1   Median : 145130   Median : 27.00  
 Mean   : 8468   LK Alb-Donau-Kreis  :  1   Mean   : 193910   Mean   : 44.58  
 3rd Qu.: 9778   LK Altenburger Land :  1   3rd Qu.: 244154   3rd Qu.: 59.00  
 Max.   :16077   LK Altenkirchen     :  1   Max.   :1770629   Max.   :410.00  
                 (Other)             :418                                     

htestimate object: Estimator for samples with probabilities proportional to size
Method of Yates and Grundy:

Mean estimator: 40.36766
Standard Error: 8.059507


htestimate object: Estimator for samples with probabilities proportional to size
Method of Horvitz-Thompson:

Mean estimator: 40.36766
Standard Error: 8.227719


htestimate object: Estimator for samples with probabilities proportional to size
Method of Hansen-Hurwitz (approximate variance):

Mean estimator: 40.36766
Standard Error: 8.534792


htestimate object: Estimator for samples with probabilities proportional to size
Method of Hajek (approximate variance):

Mean estimator: 40.36766
Standard Error: 8.262482


htestimate object: Estimator for samples with probabilities proportional to size
Method of Hajek (approximate variance):

Mean estimator: 40.36766
Standard Error: 8.244296

Conclusion:
we can say the least biased and most
accurate method to be used is Hajek method followed by Yates-Grundy,
Horvitz-Thompson and Hansen Hurwitz> in that order.
Murthy’s unordered estimator corresponding to Des Raj’s ordered estimator for the
sample size 2

Suppose yi and yj are the values of units  Ui and Uj selected in the first and second draws respectively with varying probability and WOR in a sample of size 2 and let pi and pj be the corresponding initial probabilities of selection. So now we have two ordered estimates corresponding to the ordered samples s1* and s2* as follows 

s1*=(yi,yj) and  s2*=(Ui,Uj)



    Thus we can conclude that Unordered sampling is more efficient than ordered sampling.































































Comments

Popular posts from this blog

PPSWOR AND HORVITZ THOMPSON ESTIMATOR

Population Proportion of Size Without Replacement Using DesRaj Estimator

HORVITZ-THOMPSON ESTIMATOR - An Unordered Estimator